Introd u ction to E x plorator y Data Anal y sis STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y
E x plorator y data anal y sis The process of organi z ing , plo � ing , and s u mmari z ing a data set STATISTICAL THINKING IN PYTHON ( PART 1)
“ E x plorator y data anal y sis can ne v er be the w hole stor y, b u t nothing else can ser v e as the fo u ndation stone .” — John T u ke y STATISTICAL THINKING IN PYTHON ( PART 1)
2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)
2008 US s w ing state election res u lts import pandas as pd df_swing = pd.read_csv('2008_swing_states.csv') df_swing[['state', 'county', 'dem_share']] state county dem_share 0 PA Erie County 60.08 1 PA Bradford County 40.64 2 PA Tioga County 36.07 3 PA McKean County 41.21 4 PA Potter County 31.04 5 PA Wayne County 43.78 6 PA Susquehanna County 44.08 7 PA Warren County 46.85 8 OH Ashtabula County 56.94 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)
2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)
Let ' s practice ! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 )
Plotting a histogram STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y
2008 US s w ing state election res u lts Data retrie v ed from Data . go v ( h � ps ://www. data . go v/ ) STATISTICAL THINKING IN PYTHON ( PART 1)
Generating a histogram import matplotlib.pyplot as plt _ = plt.hist(df_swing['dem_share']) _ = plt.xlabel('percent of vote for Obama') _ = plt.ylabel('number of counties') plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)
Al w a y s label y o u r a x es STATISTICAL THINKING IN PYTHON ( PART 1)
2008 US s w ing state election res u lts Data retrie v ed from Data . go v ( h � ps ://www. data . go v/ ) STATISTICAL THINKING IN PYTHON ( PART 1)
Histograms w ith different binning Data retrie v ed from Data . go v ( h � ps ://www. data . go v/ ) STATISTICAL THINKING IN PYTHON ( PART 1)
Setting the bins of a histogram bin_edges = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100] _ = plt.hist(df_swing['dem_share'], bins=bin_edges) plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)
Setting the bins of a histogram _ = plt.hist(df_swing['dem_share'], bins=20) plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)
Seaborn An e x cellent Matplotlib - based statistical data v is u ali z ation package w ri � en b y Michael Waskom STATISTICAL THINKING IN PYTHON ( PART 1)
Setting Seaborn st y ling import seaborn as sns sns.set() _ = plt.hist(df_swing['dem_share']) _ = plt.xlabel('percent of vote for Obama') _ = plt.ylabel('number of counties') plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)
A Seaborn - st y led histogram 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)
Let ' s practice ! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 )
Plot all of y o u r data : Bee s w arm plots STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y
2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)
2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)
Binning bias The same data ma y be interpreted di � erentl y depending on choice of bins STATISTICAL THINKING IN PYTHON ( PART 1)
Bee s w arm plot 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)
Organi z ation of the data frame 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)
Organi z ation of the data frame STATISTICAL THINKING IN PYTHON ( PART 1)
Organi z ation of the data frame STATISTICAL THINKING IN PYTHON ( PART 1)
Generating a bee s w arm plot _ = sns.swarmplot(x='state', y='dem_share', data=df_swing) _ = plt.xlabel('state') _ = plt.ylabel('percent of vote for Obama') plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)
2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)
Let ' s practice ! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 )
Plot all of y o u r data : ECDFs STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y
2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)
2008 US election res u lts : East and West 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)
Empirical c u m u lati v e distrib u tion f u nction ( ECDF ) 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)
Empirical c u m u lati v e distrib u tion f u nction ( ECDF ) 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)
Empirical c u m u lati v e distrib u tion f u nction ( ECDF ) 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)
Making an ECDF import numpy as np x = np.sort(df_swing['dem_share']) y = np.arange(1, len(x)+1) / len(x) _ = plt.plot(x, y, marker='.', linestyle='none') _ = plt.xlabel('percent of vote for Obama') _ = plt.ylabel('ECDF') plt.margins(0.02) # Keeps data off plot edges plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)
2008 US s w ing state election ECDF 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)
2008 US s w ing state election ECDFs 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)
Let ' s practice ! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 )
On w ard to w ard the w hole stor y! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y
STATISTICAL THINKING IN PYTHON ( PART 1)
“ E x plorator y data anal y sis can ne v er be the w hole stor y, b u t nothing else can ser v e as the fo u ndation stone .” — John T u ke y STATISTICAL THINKING IN PYTHON ( PART 1)
Coming u p … Thinking probabilisticall y Discrete and contin u o u s distrib u tions The po w er of hacker statistics u sing np.random STATISTICAL THINKING IN PYTHON ( PART 1)
Let ' s practice ! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 )
Recommend
More recommend