100% found this document useful (1 vote)

496 views36 pages

Python Data Analysis Guide

This document discusses using Python for data analysis. It imports common Python libraries for data analysis and reads a CSV file containing salary data. It then demonstrates various data exploration and manipulation techniques including displaying data, calculating statistics, slicing and grouping data. These include viewing the data types and structure, calculating means, standard deviations and counts for numeric columns, and grouping and aggregating the data by variables like rank and sex.

Uploaded by

Sudheer Redus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

496 views36 pages

Python Data Analysis Guide

Uploaded by

Sudheer Redus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

6/26/2019 dataScience-with-answers

Python for Data Analysis

Research Computing Services

Website: [Link] ([Link]

Tutorial materials: [Link]
([Link]

In [1]:

#Import Python Libraries

import numpy as np
import scipy as sp
import pandas as pd
import [Link] as plt
import seaborn as sns

In [2]:

#Read csv file

df = pd.read_csv("[Link]

In [3]:

#Display a few first records

[Link]()

Out[3]:

rank discipline phd service sex salary

0 Prof B 56 49 Male 186960

1 Prof A 12 6 Male 93000

2 Prof A 23 20 Male 110515

3 Prof A 40 31 Male 131205

4 Prof B 20 18 Male 104800

Excersize

[Link]/examples/python/data_analysis/[Link]#Filtering 1/36
6/26/2019 dataScience-with-answers

In [4]:

#Display first 10 records

# <your code goes here>
[Link](10)

Out[4]:

rank discipline phd service sex salary

0 Prof B 56 49 Male 186960

1 Prof A 12 6 Male 93000

2 Prof A 23 20 Male 110515

3 Prof A 40 31 Male 131205

4 Prof B 20 18 Male 104800

5 Prof A 20 20 Male 122400

6 AssocProf A 20 17 Male 81285

7 Prof A 18 18 Male 126300

8 Prof A 29 19 Male 94350

9 Prof A 51 51 Male 57800

[Link]/examples/python/data_analysis/[Link]#Filtering 2/36
6/26/2019 dataScience-with-answers

In [5]:

#Display first 20 records

# <your code goes here>
[Link](20)

Out[5]:

rank discipline phd service sex salary

0 Prof B 56 49 Male 186960

1 Prof A 12 6 Male 93000

2 Prof A 23 20 Male 110515

3 Prof A 40 31 Male 131205

4 Prof B 20 18 Male 104800

5 Prof A 20 20 Male 122400

6 AssocProf A 20 17 Male 81285

7 Prof A 18 18 Male 126300

8 Prof A 29 19 Male 94350

9 Prof A 51 51 Male 57800

10 Prof B 39 33 Male 128250

11 Prof B 23 23 Male 134778

12 AsstProf B 1 0 Male 88000

13 Prof B 35 33 Male 162200

14 Prof B 25 19 Male 153750

15 Prof B 17 3 Male 150480

16 AsstProf B 8 3 Male 75044

17 AsstProf B 4 0 Male 92000

18 Prof A 19 7 Male 107300

19 Prof A 29 27 Male 150500

[Link]/examples/python/data_analysis/[Link]#Filtering 3/36
6/26/2019 dataScience-with-answers

In [6]:

#Display the last 5 records

# <your code goes here>
[Link]()

Out[6]:

rank discipline phd service sex salary

73 Prof B 18 10 Female 105450

74 AssocProf B 19 6 Female 104542

75 Prof B 17 17 Female 124312

76 Prof A 28 14 Female 109954

77 Prof A 23 15 Female 109646

In [7]:

#Identify the type of df object

type(df)

Out[7]:

[Link]

In [8]:

#Check the type of a column "salary"

df['salary'].dtype

Out[8]:

dtype('int64')

In [9]:

#List the types of all columns

[Link]

Out[9]:
rank object
discipline object
phd int64
service int64
sex object
salary int64
dtype: object

[Link]/examples/python/data_analysis/[Link]#Filtering 4/36
6/26/2019 dataScience-with-answers

In [10]:

#List the column names

[Link]

Out[10]:
Index(['rank', 'discipline', 'phd', 'service', 'sex', 'salary'], dtype='ob
ject')

In [11]:

#List the row labels and the column names

[Link]

Out[11]:
[RangeIndex(start=0, stop=78, step=1),
Index(['rank', 'discipline', 'phd', 'service', 'sex', 'salary'], dtype='o
bject')]

In [12]:

#Number of dimensions
[Link]

Out[12]:

In [13]:

#Total number of elements in the Data Frame

[Link]

Out[13]:
468

In [14]:

#Number of rows and columns

[Link]

Out[14]:
(78, 6)

[Link]/examples/python/data_analysis/[Link]#Filtering 5/36
6/26/2019 dataScience-with-answers

In [15]:

#Output basic statistics for the numeric columns

[Link]()

Out[15]:

phd service salary

count 78.000000 78.000000 78.000000

mean 19.705128 15.051282 108023.782051

std 12.498425 12.139768 28293.661022

min 1.000000 0.000000 57800.000000

25% 10.250000 5.250000 88612.500000

50% 18.500000 14.500000 104671.000000

75% 27.750000 20.750000 126774.750000

max 56.000000 51.000000 186960.000000

In [16]:

#Calculate mean for all numeric columns

[Link]()

Out[16]:
phd 19.705128
service 15.051282
salary 108023.782051
dtype: float64

Excersize

In [17]:

#Calculate the standard deviation (std() method) for all numeric columns
# <your code goes here>
[Link]()

Out[17]:

phd 12.498425
service 12.139768
salary 28293.661022
dtype: float64

[Link]/examples/python/data_analysis/[Link]#Filtering 6/36
6/26/2019 dataScience-with-answers

In [18]:

#Calculate average of the columns in the first 50 rows

# <your code goes here>
[Link](50).mean()

Out[18]:
phd 21.52
service 17.60
salary 113789.14
dtype: float64

Data slicing and grouping

In [19]:

df_sex = [Link]('sex')

In [20]:

#Extract a column by name (method 1)

df['sex'].head()

Out[20]:
0 Male
1 Male
2 Male
3 Male
4 Male
Name: sex, dtype: object

In [21]:

#Extract a column name (method 2)

[Link]()

Out[21]:
0 Male
1 Male
2 Male
3 Male
4 Male
Name: sex, dtype: object

Excersize

[Link]/examples/python/data_analysis/[Link]#Filtering 7/36
6/26/2019 dataScience-with-answers

In [22]:

#Calculate the basic statistics for the salary column (used describe() method)
# <your code goes here>
df['salary'].describe()

Out[22]:
count 78.000000
mean 108023.782051
std 28293.661022
min 57800.000000
25% 88612.500000
50% 104671.000000
75% 126774.750000
max 186960.000000
Name: salary, dtype: float64

In [23]:

#Calculate how many values in the salary column (use count() method)
# <your code goes here>
df['salary'].count()

Out[23]:
78

In [24]:

#Calculate the average salary

df['salary'].mean()

Out[24]:
108023.78205128205

In [25]:

#Group data using rank

df_rank = [Link]('rank')

[Link]/examples/python/data_analysis/[Link]#Filtering 8/36
6/26/2019 dataScience-with-answers

In [26]:

#Calculate mean of all numeric columns for the grouped object

df_rank.mean()

Out[26]:

phd service salary

rank

AssocProf 15.076923 11.307692 91786.230769

AsstProf 5.052632 2.210526 81362.789474

Prof 27.065217 21.413043 123624.804348

In [27]:

#Calculate the mean salary for men and women. The following produce Pandas Series (sing
le brackets around salary)
[Link]('sex')['salary'].mean()

Out[27]:
sex
Female 101002.410256
Male 115045.153846
Name: salary, dtype: float64

In [28]:

# If we use double brackets Pandas will produce a DataFrame

[Link]('sex')[['salary']].mean()

Out[28]:

salary

sex

Female 101002.410256

Male 115045.153846

[Link]/examples/python/data_analysis/[Link]#Filtering 9/36
6/26/2019 dataScience-with-answers

In [29]:

# Group using 2 variables - sex and rank:

[Link](['sex','rank'], sort=False)[['salary']].mean()

Out[29]:

salary

sex rank

Male Prof 124690.142857

AssocProf 102697.666667

AsstProf 85918.000000

Female Prof 121967.611111

AssocProf 88512.800000

AsstProf 78049.909091

Excersize

In [30]:

# Group data by the discipline and find the average salary for each group
[Link]('discipline')['salary'].mean()

Out[30]:
discipline
A 98331.111111
B 116331.785714
Name: salary, dtype: float64

Filtering

[Link]/examples/python/data_analysis/[Link]#Filtering 10/36
6/26/2019 dataScience-with-answers

In [31]:

#Select observation with the value in the salary column > 120K
df_sub = df[ df['salary'] > 120000]
df_sub.head()

Out[31]:

rank discipline phd service sex salary

0 Prof B 56 49 Male 186960

3 Prof A 40 31 Male 131205

5 Prof A 20 20 Male 122400

7 Prof A 18 18 Male 126300

10 Prof B 39 33 Male 128250

In [32]:

#Select data for female professors

df_w = df[ df['sex'] == 'Female']
df_w.head()

Out[32]:

rank discipline phd service sex salary

39 Prof B 18 18 Female 129000

40 Prof A 39 36 Female 137000

41 AssocProf A 13 8 Female 74830

42 AsstProf B 4 2 Female 80225

43 AsstProf B 5 0 Female 77000

Excersize

In [33]:

# Using filtering, find the mean value of the salary for the discipline A
df[df['discipline'] == 'A']['salary'].mean()

Out[33]:

98331.111111111109

[Link]/examples/python/data_analysis/[Link]#Filtering 11/36
6/26/2019 dataScience-with-answers

In [34]:

# Challange:
# Extract (filter) only observations with high salary ( > 100K) and find how many femal
e and male professors in each group
df[df['salary'] > 120000].groupby('sex')['salary'].count()

Out[34]:
sex
Female 9
Male 16
Name: salary, dtype: int64

More on slicing the dataset

In [35]:

#Select column salary

df1 = df['salary']

In [36]:

#Check data type of the result

type(df1)

Out[36]:
[Link]

In [37]:

#Look at the first few elements of the output

[Link]()

Out[37]:
0 186960
1 93000
2 110515
3 131205
4 104800
Name: salary, dtype: int64

In [38]:

#Select column salary and make the output to be a data frame

df2 = df[['salary']]

In [39]:

#Check the type

type(df2)

Out[39]:
[Link]

[Link]/examples/python/data_analysis/[Link]#Filtering 12/36
6/26/2019 dataScience-with-answers

In [40]:

#Select a subset of rows (based on their position):

# Note 1: The location of the first row is 0
# Note 2: The last value in the range is not included
df[0:10]

Out[40]:

rank discipline phd service sex salary

0 Prof B 56 49 Male 186960

1 Prof A 12 6 Male 93000

2 Prof A 23 20 Male 110515

3 Prof A 40 31 Male 131205

4 Prof B 20 18 Male 104800

5 Prof A 20 20 Male 122400

6 AssocProf A 20 17 Male 81285

7 Prof A 18 18 Male 126300

8 Prof A 29 19 Male 94350

9 Prof A 51 51 Male 57800

In [41]:

#If we want to select both rows and columns we can use method .loc
[Link][10:20,['rank', 'sex','salary']]

Out[41]:

rank sex salary

10 Prof Male 128250

11 Prof Male 134778

12 AsstProf Male 88000

13 Prof Male 162200

14 Prof Male 153750

15 Prof Male 150480

16 AsstProf Male 75044

17 AsstProf Male 92000

18 Prof Male 107300

19 Prof Male 150500

20 AsstProf Male 92000

[Link]/examples/python/data_analysis/[Link]#Filtering 13/36
6/26/2019 dataScience-with-answers

In [42]:

#Let's see what we get for our df_sub data frame

# Method .loc subset the data frame based on the labels:
df_sub.loc[10:20,['rank','sex','salary']]

Out[42]:

rank sex salary

10 Prof Male 128250

11 Prof Male 134778

13 Prof Male 162200

14 Prof Male 153750

15 Prof Male 150480

19 Prof Male 150500

In [43]:

# Unlike method .loc, method iloc selects rows (and columns) by poistion:
df_sub.iloc[10:20, [0,3,4,5]]

Out[43]:

rank service sex salary

26 Prof 19 Male 148750

27 Prof 43 Male 155865

29 Prof 20 Male 123683

31 Prof 21 Male 155750

35 Prof 23 Male 126933

36 Prof 45 Male 146856

39 Prof 18 Female 129000

40 Prof 36 Female 137000

44 Prof 19 Female 151768

45 Prof 25 Female 140096

Sorting the Data

[Link]/examples/python/data_analysis/[Link]#Filtering 14/36
6/26/2019 dataScience-with-answers

In [44]:

#Sort the data frame by [Link] and create a new data frame
df_sorted = df.sort_values(by = 'service')
df_sorted.head()

Out[44]:

rank discipline phd service sex salary

55 AsstProf A 2 0 Female 72500

23 AsstProf A 2 0 Male 85000

43 AsstProf B 5 0 Female 77000

17 AsstProf B 4 0 Male 92000

12 AsstProf B 1 0 Male 88000

In [45]:

#Sort the data frame by [Link] and overwrite the original dataset
df.sort_values(by = 'service', ascending = False, inplace = True)
[Link]()

Out[45]:

rank discipline phd service sex salary

9 Prof A 51 51 Male 57800

0 Prof B 56 49 Male 186960

36 Prof B 45 45 Male 146856

27 Prof A 45 43 Male 155865

40 Prof A 39 36 Female 137000

In [46]:

# Restore the original order (by sorting using index)

df.sort_index(axis=0, ascending = True, inplace = True)
[Link]()

Out[46]:

rank discipline phd service sex salary

0 Prof B 56 49 Male 186960

1 Prof A 12 6 Male 93000

2 Prof A 23 20 Male 110515

3 Prof A 40 31 Male 131205

4 Prof B 20 18 Male 104800

[Link]/examples/python/data_analysis/[Link]#Filtering 15/36
6/26/2019 dataScience-with-answers

Excersize

In [47]:

# Sort data frame by the salary (in descending order) and display the first few records
of the output (head)
df.sort_values(by='salary', ascending=False).head()

Out[47]:

rank discipline phd service sex salary

0 Prof B 56 49 Male 186960

13 Prof B 35 33 Male 162200

72 Prof B 24 15 Female 161101

27 Prof A 45 43 Male 155865

31 Prof B 22 21 Male 155750

In [48]:

#Sort the data frame using 2 or more columns:

df_sorted = df.sort_values(by = ['service', 'salary'], ascending = [True,False])
df_sorted.head(10)

Out[48]:

rank discipline phd service sex salary

52 Prof A 12 0 Female 105000

17 AsstProf B 4 0 Male 92000

12 AsstProf B 1 0 Male 88000

23 AsstProf A 2 0 Male 85000

43 AsstProf B 5 0 Female 77000

55 AsstProf A 2 0 Female 72500

57 AsstProf A 3 1 Female 72500

28 AsstProf B 7 2 Male 91300

42 AsstProf B 4 2 Female 80225

68 AsstProf A 4 2 Female 77500

Missing Values

[Link]/examples/python/data_analysis/[Link]#Filtering 16/36
6/26/2019 dataScience-with-answers

In [49]:

# Read a dataset with missing values

flights = pd.read_csv("[Link]
[Link]()

Out[49]:

year month day dep_time dep_delay arr_time arr_delay carrier tailnum fligh

0 2013 1 1 517.0 2.0 830.0 11.0 UA N14228 1545

1 2013 1 1 533.0 4.0 850.0 20.0 UA N24211 1714

2 2013 1 1 542.0 2.0 923.0 33.0 AA N619AA 1141

3 2013 1 1 554.0 -6.0 812.0 -25.0 DL N668DN 461

4 2013 1 1 554.0 -4.0 740.0 12.0 UA N39463 1696

In [50]:

# Select the rows that have at least one missing value

flights[[Link]().any(axis=1)].head()

Out[50]:

year month day dep_time dep_delay arr_time arr_delay carrier tailnum fli

330 2013 1 1 1807.0 29.0 2251.0 NaN UA N31412 12

403 2013 1 1 NaN NaN NaN NaN AA N3EHAA 79

404 2013 1 1 NaN NaN NaN NaN AA N3EVAA 19

855 2013 1 2 2145.0 16.0 NaN NaN UA N12221 12

858 2013 1 2 NaN NaN NaN NaN AA NaN 13

In [51]:

# Filter all the rows where arr_delay value is missing:

flights1 = flights[ flights['arr_delay'].notnull( )]
[Link]()

Out[51]:

year month day dep_time dep_delay arr_time arr_delay carrier tailnum fligh

0 2013 1 1 517.0 2.0 830.0 11.0 UA N14228 1545

1 2013 1 1 533.0 4.0 850.0 20.0 UA N24211 1714

2 2013 1 1 542.0 2.0 923.0 33.0 AA N619AA 1141

3 2013 1 1 554.0 -6.0 812.0 -25.0 DL N668DN 461

4 2013 1 1 554.0 -4.0 740.0 12.0 UA N39463 1696

[Link]/examples/python/data_analysis/[Link]#Filtering 17/36
6/26/2019 dataScience-with-answers

In [52]:

# Remove all the observations with missing values

flights2 = [Link]()

In [53]:

# Fill missing values with zeros

nomiss =flights['dep_delay'].fillna(0)
[Link]().any()

Out[53]:

False

Excersize

In [54]:

# Count how many missing data are in dep_delay and arr_delay columns
flights[['dep_delay','arr_delay']].isnull().sum()

Out[54]:
dep_delay 2336
arr_delay 2827
dtype: int64

Common Aggregation Functions:

Function Description

min minimum

max maximum

count number of non-null observations

sum sum of values

mean arithmetic mean of values

median median

mad mean absolute deviation

mode mode

prod product of values

std standard deviation

var unbiased variance

[Link]/examples/python/data_analysis/[Link]#Filtering 18/36
6/26/2019 dataScience-with-answers

In [55]:

# Find the number of non-missing values in each column

[Link]()

Out[55]:
year 160754
month 160754
day 160754
dep_time 158418
dep_delay 158418
arr_time 158275
arr_delay 157927
carrier 160754
tailnum 159321
flight 160754
origin 160754
dest 160754
air_time 157927
distance 160754
hour 158418
minute 158418
dtype: int64

In [56]:

# Find mean value for all the columns in the dataset

[Link]()

Out[56]:
year 2013
month 1
day 1
dep_time 1
dep_delay -33
arr_time 1
arr_delay -75
carrier AA
flight 1
origin EWR
dest ANC
air_time 21
distance 17
hour 0
minute 0
dtype: object

[Link]/examples/python/data_analysis/[Link]#Filtering 19/36
6/26/2019 dataScience-with-answers

In [57]:

# Let's compute summary statistic per a group':

[Link]('carrier')['dep_delay'].mean()

Out[57]:
carrier
AA 8.586016
AS 5.804775
DL 9.264505
UA 12.106073
US 3.782418
Name: dep_delay, dtype: float64

In [58]:

# We can use agg() methods for aggregation:

flights[['dep_delay','arr_delay']].agg(['min','mean','max'])

Out[58]:

dep_delay arr_delay

min -33.000000 -75.000000

mean 9.463773 2.094537

max 1014.000000 1007.000000

In [59]:

# An example of computing different statistics for different columns

[Link]({'dep_delay':['min','mean',max], 'carrier':['nunique']})

Out[59]:

dep_delay carrier

max 1014.000000 NaN

mean 9.463773 NaN

min -33.000000 NaN

nunique NaN 5.0

Basic descriptive statistics

[Link]/examples/python/data_analysis/[Link]#Filtering 20/36
6/26/2019 dataScience-with-answers

Function Description

min minimum

max maximum

mean arithmetic mean of values

median median

mad mean absolute deviation

mode mode

std standard deviation

var unbiased variance

sem standard error of the mean

skew sample skewness

kurt kurtosis

quantile value at %

In [60]:

# Convinient describe() function computes a veriety of statistics

flights.dep_delay.describe()

Out[60]:
count 158418.000000
mean 9.463773
std 36.545109
min -33.000000
25% -5.000000
50% -2.000000
75% 7.000000
max 1014.000000
Name: dep_delay, dtype: float64

In [61]:

# find the index of the maximum or minimum value

# if there are multiple values matching idxmin() and idxmax() will return the first mat
ch
flights['dep_delay'].idxmin() #minimum value

Out[61]:
54111

[Link]/examples/python/data_analysis/[Link]#Filtering 21/36
6/26/2019 dataScience-with-answers

In [62]:

# Count the number of records for each different value in a vector

flights['carrier'].value_counts()

Out[62]:
UA 58665
DL 48110
AA 32729
US 20536
AS 714
Name: carrier, dtype: int64

Explore data using graphics

In [63]:

#Show graphs withint Python notebook

%matplotlib inline

In [64]:

#Use matplotlib to draw a histogram of a salary data

[Link](df['salary'],bins=8, normed=1)

Out[64]:
(array([ 7.14677085e-06, 8.73494215e-06, 1.74698843e-05,
8.73494215e-06, 9.52902780e-06, 6.35268520e-06,
3.17634260e-06, 7.94085650e-07]),
array([ 57800., 73945., 90090., 106235., 122380., 138525.,
154670., 170815., 186960.]),
<a list of 8 Patch objects>)

[Link]/examples/python/data_analysis/[Link]#Filtering 22/36
6/26/2019 dataScience-with-answers

In [65]:

#Use seaborn package to draw a histogram

[Link](df['salary']);

In [66]:

# Use regular matplotlib function to display a barplot

[Link](['rank'])['salary'].count().plot(kind='bar')

Out[66]:
<[Link]._subplots.AxesSubplot at 0x7ff58213f860>

[Link]/examples/python/data_analysis/[Link]#Filtering 23/36
6/26/2019 dataScience-with-answers

In [67]:

# Use seaborn package to display a barplot

sns.set_style("whitegrid")

ax = [Link](x='rank',y ='salary', data=df, estimator=len)

In [68]:

# Split into 2 groups:

ax = [Link](x='rank',y ='salary', hue='sex', data=df, estimator=len)

[Link]/examples/python/data_analysis/[Link]#Filtering 24/36
6/26/2019 dataScience-with-answers

In [69]:

#Violinplot
[Link](x = "salary", data=df)

Out[69]:
<[Link]._subplots.AxesSubplot at 0x7ff5819b79e8>

[Link]/examples/python/data_analysis/[Link]#Filtering 25/36
6/26/2019 dataScience-with-answers

In [70]:

#Scatterplot in seaborn
[Link](x='service', y='salary', data=df)

Out[70]:
<[Link] at 0x7ff581984550>

[Link]/examples/python/data_analysis/[Link]#Filtering 26/36
6/26/2019 dataScience-with-answers

In [71]:

#If we are interested in linear regression plot for 2 numeric variables we can use regp
lot
[Link](x='service', y='salary', data=df)

Out[71]:

<[Link]._subplots.AxesSubplot at 0x7ff58184c470>

In [72]:

# box plot
[Link](x='rank',y='salary', data=df)

Out[72]:
<[Link]._subplots.AxesSubplot at 0x7ff58170de80>

[Link]/examples/python/data_analysis/[Link]#Filtering 27/36
6/26/2019 dataScience-with-answers

In [73]:

# side-by-side box plot

[Link](x='rank',y='salary', data=df, hue='sex')

Out[73]:
<[Link]._subplots.AxesSubplot at 0x7ff5818edc18>

In [74]:

# swarm plot
[Link](x='rank',y='salary', data=df)

Out[74]:
<[Link]._subplots.AxesSubplot at 0x7ff5814f75c0>

[Link]/examples/python/data_analysis/[Link]#Filtering 28/36
6/26/2019 dataScience-with-answers

In [75]:

#factorplot
[Link](x='carrier',y='dep_delay', data=flights, kind='bar')

Out[75]:

<[Link] at 0x7ff58178a198>

[Link]/examples/python/data_analysis/[Link]#Filtering 29/36
6/26/2019 dataScience-with-answers

In [76]:

# Pairplot
[Link](df)

[Link]/examples/python/data_analysis/[Link]#Filtering 30/36
6/26/2019 dataScience-with-answers

Out[76]:
<[Link] at 0x7ff5822296a0>

[Link]/examples/python/data_analysis/[Link]#Filtering 31/36
6/26/2019 dataScience-with-answers

Excersize

In [77]:

#Using seaborn package explore the dependency of arr_delay on dep_delay (scatterplot or

regplot) for flights dataset
[Link](x='dep_delay', y='arr_delay', data=flights)

Out[77]:
<[Link] at 0x7ff580cb7a20>

Basic statistical Analysis

Linear Regression

In [78]:

# Import Statsmodel functions:

import [Link] as smf

[Link]/examples/python/data_analysis/[Link]#Filtering 32/36
6/26/2019 dataScience-with-answers

In [79]:

# create a fitted model

lm = [Link](formula='salary ~ service', data=df).fit()

#print model summary

print([Link]())

OLS Regression Results

==========================================================================
====
Dep. Variable: salary R-squared:
0.283
Model: OLS Adj. R-squared:
0.274
Method: Least Squares F-statistic: 3
0.03
Date: Fri, 15 Sep 2017 Prob (F-statistic): 5.31
e-07
Time: [Link] Log-Likelihood: -89
6.72
No. Observations: 78 AIC: 1
797.
Df Residuals: 76 BIC: 1
802.
Df Model: 1
Covariance Type: nonrobust
==========================================================================
====
coef std err t P>|t| [0.025 0.
975]
--------------------------------------------------------------------------
----
Intercept 8.935e+04 4365.651 20.468 0.000 8.07e+04 9.8
e+04
service 1240.3567 226.341 5.480 0.000 789.560 169
1.153
==========================================================================
====
Omnibus: 12.741 Durbin-Watson:
1.630
Prob(Omnibus): 0.002 Jarque-Bera (JB): 2
1.944
Skew: -0.576 Prob(JB): 1.72
e-05
Kurtosis: 5.329 Cond. No.
30.9
==========================================================================
====

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is cor
rectly specified.

[Link]/examples/python/data_analysis/[Link]#Filtering 33/36
6/26/2019 dataScience-with-answers

In [80]:

# print the coefficients

[Link]

Out[80]:
Intercept 89354.824215
service 1240.356654
dtype: float64

In [81]:

#using scikit-learn:
from sklearn import linear_model
est = linear_model.LinearRegression(fit_intercept = True) # create estimator object
[Link](df[['service']], df[['salary']])

#print result
print("Coef:", est.coef_, "\nIntercept:", est.intercept_)

Coef: [[ 1240.3566535]]
Intercept: [ 89354.82421525]

Excersize

[Link]/examples/python/data_analysis/[Link]#Filtering 34/36
6/26/2019 dataScience-with-answers

In [82]:

# Build a linear model for arr_delay ~ dep_delay

lm = [Link](formula='arr_delay ~ dep_delay', data=flights).fit()

#print model summary

print([Link]())

OLS Regression Results

==========================================================================
====
Dep. Variable: arr_delay R-squared:
0.794
Model: OLS Adj. R-squared:
0.794
Method: Least Squares F-statistic: 6.074
e+05
Date: Fri, 15 Sep 2017 Prob (F-statistic):
0.00
Time: [Link] Log-Likelihood: -6.8778
e+05
No. Observations: 157927 AIC: 1.376
e+06
Df Residuals: 157925 BIC: 1.376
e+06
Df Model: 1
Covariance Type: nonrobust
==========================================================================
====
coef std err t P>|t| [0.025 0.
975]
--------------------------------------------------------------------------
----
Intercept -7.4457 0.049 -152.050 0.000 -7.542 -
7.350
dep_delay 1.0138 0.001 779.358 0.000 1.011
1.016
==========================================================================
====
Omnibus: 38155.693 Durbin-Watson:
1.467
Prob(Omnibus): 0.000 Jarque-Bera (JB): 15917
8.104
Skew: 1.141 Prob(JB):
0.00
Kurtosis: 7.357 Cond. No.
38.9
==========================================================================
====

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is cor
rectly specified.

Student T-test

[Link]/examples/python/data_analysis/[Link]#Filtering 35/36
6/26/2019 dataScience-with-answers

In [83]:

# Using scipy package:

from scipy import stats
df_w = df[ df['sex'] == 'Female']['salary']
df_m = df[ df['sex'] == 'Male']['salary']
stats.ttest_ind(df_w, df_m)

Out[83]:
Ttest_indResult(statistic=-2.2486865976699053, pvalue=0.02742977865791010
3)

[Link]/examples/python/data_analysis/[Link]#Filtering 36/36

Python Data Science: Pandas & ML Basics
100% (1)
Python Data Science: Pandas & ML Basics
41 pages
Machine Learning Using Python PDF
No ratings yet
Machine Learning Using Python PDF
2 pages
Essential Python Libraries for Data Science
No ratings yet
Essential Python Libraries for Data Science
12 pages
Feature Engineering
No ratings yet
Feature Engineering
13 pages
Data Visualization Using Python
No ratings yet
Data Visualization Using Python
44 pages
EDA Cheat Sheet - Exploratory Data Analysis
No ratings yet
EDA Cheat Sheet - Exploratory Data Analysis
2 pages
Data Analysis Exercises for Beginners
No ratings yet
Data Analysis Exercises for Beginners
43 pages
CDC Python Learning Hierarchy
No ratings yet
CDC Python Learning Hierarchy
3 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
209 pages
Module 1
No ratings yet
Module 1
96 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Python Pandas: 12 Data Manipulation Techniques
100% (2)
Python Pandas: 12 Data Manipulation Techniques
19 pages
Programming For Data Science - Assignment 1
No ratings yet
Programming For Data Science - Assignment 1
2 pages
Ensemble Learning Techniques Explained
100% (1)
Ensemble Learning Techniques Explained
12 pages
05 Logistic - Regression
No ratings yet
05 Logistic - Regression
7 pages
Python Data Analysis for Beginners
No ratings yet
Python Data Analysis for Beginners
28 pages
Handling Missing Data in Pandas
100% (1)
Handling Missing Data in Pandas
14 pages
Python Revision Tour
No ratings yet
Python Revision Tour
14 pages
Data Science: Stats & Probability
No ratings yet
Data Science: Stats & Probability
13 pages
Documenting Data Science Projects
No ratings yet
Documenting Data Science Projects
9 pages
30 Amazing Machine Learning Projects For The Past Year (v.2018)
No ratings yet
30 Amazing Machine Learning Projects For The Past Year (v.2018)
22 pages
Python Seaborn Notes
No ratings yet
Python Seaborn Notes
28 pages
Amazon Fine Food Reviews Dataset Overview
No ratings yet
Amazon Fine Food Reviews Dataset Overview
1 page
Basic Interview Q's On ML PDF
100% (2)
Basic Interview Q's On ML PDF
243 pages
Data Visualization
No ratings yet
Data Visualization
35 pages
Data Science Projects in Python
No ratings yet
Data Science Projects in Python
18 pages
Career Plans For Next 2 Years
No ratings yet
Career Plans For Next 2 Years
11 pages
Database Management Systems by Raghu Ramakrishnan: Special Features of Book
No ratings yet
Database Management Systems by Raghu Ramakrishnan: Special Features of Book
3 pages
Feature Engineering Guide
100% (2)
Feature Engineering Guide
44 pages
Ensemble Learning Methods
100% (1)
Ensemble Learning Methods
24 pages
13 PracticalMachineLearning
100% (1)
13 PracticalMachineLearning
84 pages
Machine Learning Basics Stanford Notes
No ratings yet
Machine Learning Basics Stanford Notes
15 pages
Data Science and Big Data Overview
No ratings yet
Data Science and Big Data Overview
5 pages
Pandas Data Manipulation Extended CheatSheet 1731972219
No ratings yet
Pandas Data Manipulation Extended CheatSheet 1731972219
9 pages
Python Data Visualization Guide
No ratings yet
Python Data Visualization Guide
17 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
CCS334-BDA LAB MANUAL Final
No ratings yet
CCS334-BDA LAB MANUAL Final
46 pages
Data Visualization PDF
No ratings yet
Data Visualization PDF
3 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
31 pages
Data Science & Big Data Projects
100% (1)
Data Science & Big Data Projects
85 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
33 pages
Ethnotech - Data Science With Python
No ratings yet
Ethnotech - Data Science With Python
480 pages
Decision Tree Classification on Iris Dataset
No ratings yet
Decision Tree Classification on Iris Dataset
6 pages
Transformers in NLP: An Overview
No ratings yet
Transformers in NLP: An Overview
9 pages
Python Interview Questions
No ratings yet
Python Interview Questions
8 pages
Pandas Interview Prep Guide
No ratings yet
Pandas Interview Prep Guide
5 pages
Simple Linear and Logistic Regression
No ratings yet
Simple Linear and Logistic Regression
81 pages
Confusion Matrix in Machine Learning
No ratings yet
Confusion Matrix in Machine Learning
10 pages
Practical R Programming Guide
No ratings yet
Practical R Programming Guide
103 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick PDF
No ratings yet
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick PDF
39 pages
Data Science Roles, Stages in A Data Science Project
No ratings yet
Data Science Roles, Stages in A Data Science Project
14 pages
Pandas DataFrame Basics Cheatsheet
No ratings yet
Pandas DataFrame Basics Cheatsheet
3 pages
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy PDF
100% (1)
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy PDF
3 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Data Analysis
100% (1)
Data Analysis
4 pages
Data Science Interview Questions (#Day11) PDF
100% (1)
Data Science Interview Questions (#Day11) PDF
11 pages
Python Data Analysis Tutorial
No ratings yet
Python Data Analysis Tutorial
47 pages
Python Data Science Guide
100% (2)
Python Data Science Guide
47 pages
Python For Data Science
No ratings yet
Python For Data Science
45 pages
K Means Clustering
No ratings yet
K Means Clustering
5 pages
Decision Tree
No ratings yet
Decision Tree
2 pages
Decision Tree Entropy Gini
No ratings yet
Decision Tree Entropy Gini
5 pages
Understanding Bias and Variance in Regression
No ratings yet
Understanding Bias and Variance in Regression
4 pages
Understanding Bias and Variance in Regression
No ratings yet
Understanding Bias and Variance in Regression
4 pages
6 One Hot Encoding
No ratings yet
6 One Hot Encoding
3 pages
Ensemble Learning Techniques
No ratings yet
Ensemble Learning Techniques
3 pages
Multiple Linear Regression for Home Prices
No ratings yet
Multiple Linear Regression for Home Prices
2 pages
EHTrackR Service Agreement 2019
No ratings yet
EHTrackR Service Agreement 2019
3 pages
Eda Notes
No ratings yet
Eda Notes
4 pages
Main Exhibitor Application Form
No ratings yet
Main Exhibitor Application Form
2 pages
Unit 5 in MAT1141. 2024 Ac Year.
No ratings yet
Unit 5 in MAT1141. 2024 Ac Year.
73 pages
Jss3 Mock Exam2
No ratings yet
Jss3 Mock Exam2
2 pages
Statistics MCQs for Competitive Exams
100% (4)
Statistics MCQs for Competitive Exams
8 pages
Basic Health Statistics and Surveys Guide
No ratings yet
Basic Health Statistics and Surveys Guide
150 pages
(Ebook) Statistics and Data Analysis For Engineers and Scientists by Tanvir Mustafy, Md. Tauhid Ur Rahman ISBN 9789819946600, 9819946603
No ratings yet
(Ebook) Statistics and Data Analysis For Engineers and Scientists by Tanvir Mustafy, Md. Tauhid Ur Rahman ISBN 9789819946600, 9819946603
79 pages
GRE Math Flash Cards - 500 Math
No ratings yet
GRE Math Flash Cards - 500 Math
10 pages
2 - Descriptive Statistics v3
No ratings yet
2 - Descriptive Statistics v3
154 pages
Worksheets-Importance of Mathematics
No ratings yet
Worksheets-Importance of Mathematics
38 pages
Preparing For CPHQ .. An Overview of Concepts: Ghada Al-Barakati
No ratings yet
Preparing For CPHQ .. An Overview of Concepts: Ghada Al-Barakati
109 pages
Quality Control & Reliability Guide
No ratings yet
Quality Control & Reliability Guide
38 pages
Grade 7 Statistics Lesson Plan
100% (1)
Grade 7 Statistics Lesson Plan
11 pages
Course Code
No ratings yet
Course Code
23 pages
QA Chapter 1 Updated 1
No ratings yet
QA Chapter 1 Updated 1
83 pages
Biostatistics for Health Researchers
No ratings yet
Biostatistics for Health Researchers
34 pages
Measures of Variation
No ratings yet
Measures of Variation
26 pages
Class X Statistics Worksheet
No ratings yet
Class X Statistics Worksheet
5 pages
Arithmetic Mean and Standard Deviation
No ratings yet
Arithmetic Mean and Standard Deviation
15 pages
A Study On Service Quality With Special Reference Reliance Resq
No ratings yet
A Study On Service Quality With Special Reference Reliance Resq
107 pages
Lecture Note (Software Metric)
No ratings yet
Lecture Note (Software Metric)
60 pages
Mean
No ratings yet
Mean
9 pages
ESSAY&MCQ
No ratings yet
ESSAY&MCQ
21 pages
Economertics Chapter 1
No ratings yet
Economertics Chapter 1
61 pages
Mean and Class Mark Calculations
No ratings yet
Mean and Class Mark Calculations
9 pages
Chapter 2 Mathematics As A Tool Quantitative and Qualitative
No ratings yet
Chapter 2 Mathematics As A Tool Quantitative and Qualitative
28 pages
Statistics for Students
No ratings yet
Statistics for Students
1 page
Descriptive Statistics Guide
No ratings yet
Descriptive Statistics Guide
62 pages
Dax Aggregation Functions
No ratings yet
Dax Aggregation Functions
14 pages
Detailed Lesson Plan in Mathematics
No ratings yet
Detailed Lesson Plan in Mathematics
3 pages
Achievement Test Insights
100% (1)
Achievement Test Insights
57 pages