introd u ction to e x plorator y data anal y sis
play

Introd u ction to E x plorator y Data Anal y sis STATISTIC AL TH IN - PowerPoint PPT Presentation

Introd u ction to E x plorator y Data Anal y sis STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y E x plorator y data anal y sis The process of organi z ing , plo ing ,


  1. Introd u ction to E x plorator y Data Anal y sis STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y

  2. E x plorator y data anal y sis The process of organi z ing , plo � ing , and s u mmari z ing a data set STATISTICAL THINKING IN PYTHON ( PART 1)

  3. “ E x plorator y data anal y sis can ne v er be the w hole stor y, b u t nothing else can ser v e as the fo u ndation stone .” — John T u ke y STATISTICAL THINKING IN PYTHON ( PART 1)

  4. 2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  5. 2008 US s w ing state election res u lts import pandas as pd df_swing = pd.read_csv('2008_swing_states.csv') df_swing[['state', 'county', 'dem_share']] state county dem_share 0 PA Erie County 60.08 1 PA Bradford County 40.64 2 PA Tioga County 36.07 3 PA McKean County 41.21 4 PA Potter County 31.04 5 PA Wayne County 43.78 6 PA Susquehanna County 44.08 7 PA Warren County 46.85 8 OH Ashtabula County 56.94 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  6. 2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  7. Let ' s practice ! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 )

  8. Plotting a histogram STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y

  9. 2008 US s w ing state election res u lts Data retrie v ed from Data . go v ( h � ps ://www. data . go v/ ) STATISTICAL THINKING IN PYTHON ( PART 1)

  10. Generating a histogram import matplotlib.pyplot as plt _ = plt.hist(df_swing['dem_share']) _ = plt.xlabel('percent of vote for Obama') _ = plt.ylabel('number of counties') plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)

  11. Al w a y s label y o u r a x es STATISTICAL THINKING IN PYTHON ( PART 1)

  12. 2008 US s w ing state election res u lts Data retrie v ed from Data . go v ( h � ps ://www. data . go v/ ) STATISTICAL THINKING IN PYTHON ( PART 1)

  13. Histograms w ith different binning Data retrie v ed from Data . go v ( h � ps ://www. data . go v/ ) STATISTICAL THINKING IN PYTHON ( PART 1)

  14. Setting the bins of a histogram bin_edges = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100] _ = plt.hist(df_swing['dem_share'], bins=bin_edges) plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)

  15. Setting the bins of a histogram _ = plt.hist(df_swing['dem_share'], bins=20) plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)

  16. Seaborn An e x cellent Matplotlib - based statistical data v is u ali z ation package w ri � en b y Michael Waskom STATISTICAL THINKING IN PYTHON ( PART 1)

  17. Setting Seaborn st y ling import seaborn as sns sns.set() _ = plt.hist(df_swing['dem_share']) _ = plt.xlabel('percent of vote for Obama') _ = plt.ylabel('number of counties') plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)

  18. A Seaborn - st y led histogram 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  19. Let ' s practice ! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 )

  20. Plot all of y o u r data : Bee s w arm plots STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y

  21. 2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  22. 2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  23. Binning bias The same data ma y be interpreted di � erentl y depending on choice of bins STATISTICAL THINKING IN PYTHON ( PART 1)

  24. Bee s w arm plot 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  25. Organi z ation of the data frame 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  26. Organi z ation of the data frame STATISTICAL THINKING IN PYTHON ( PART 1)

  27. Organi z ation of the data frame STATISTICAL THINKING IN PYTHON ( PART 1)

  28. Generating a bee s w arm plot _ = sns.swarmplot(x='state', y='dem_share', data=df_swing) _ = plt.xlabel('state') _ = plt.ylabel('percent of vote for Obama') plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)

  29. 2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  30. Let ' s practice ! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 )

  31. Plot all of y o u r data : ECDFs STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y

  32. 2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  33. 2008 US election res u lts : East and West 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  34. Empirical c u m u lati v e distrib u tion f u nction ( ECDF ) 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  35. Empirical c u m u lati v e distrib u tion f u nction ( ECDF ) 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  36. Empirical c u m u lati v e distrib u tion f u nction ( ECDF ) 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  37. Making an ECDF import numpy as np x = np.sort(df_swing['dem_share']) y = np.arange(1, len(x)+1) / len(x) _ = plt.plot(x, y, marker='.', linestyle='none') _ = plt.xlabel('percent of vote for Obama') _ = plt.ylabel('ECDF') plt.margins(0.02) # Keeps data off plot edges plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)

  38. 2008 US s w ing state election ECDF 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  39. 2008 US s w ing state election ECDFs 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  40. Let ' s practice ! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 )

  41. On w ard to w ard the w hole stor y! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y

  42. STATISTICAL THINKING IN PYTHON ( PART 1)

  43. “ E x plorator y data anal y sis can ne v er be the w hole stor y, b u t nothing else can ser v e as the fo u ndation stone .” — John T u ke y STATISTICAL THINKING IN PYTHON ( PART 1)

  44. Coming u p … Thinking probabilisticall y Discrete and contin u o u s distrib u tions The po w er of hacker statistics u sing np.random STATISTICAL THINKING IN PYTHON ( PART 1)

  45. Let ' s practice ! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 )

Recommend


More recommend