cs 109a data science
play

CS 109a: Data Science Effective Exploratory Data Analysis and - PowerPoint PPT Presentation

CS 109a: Data Science Effective Exploratory Data Analysis and Visualization Pavlos Protopapas, Kevin Rader, Rahul Dave, Margo Levine Ask an interesting What is the scientific goal ? What would you do if you had all the data ? question. What do


  1. CS 109a: Data Science Effective Exploratory Data Analysis and Visualization Pavlos Protopapas, Kevin Rader, Rahul Dave, Margo Levine

  2. Ask an interesting What is the scientific goal ? What would you do if you had all the data ? question. What do you want to predict or estimate ? How were the data sampled ? Get the data. Which data are relevant ? Are there privacy issues? Plot the data. Explore the data. Are there anomalies ? Are there patterns ? Build a model. Model the data. Fit the model. Validate the model. Communicate and What did we learn ? Do the results make sense ? visualize the results. Can we tell a story ?

  3. Ask an interesting What is the scientific goal ? What would you do if you had all the data ? question. What do you want to predict or estimate ? How were the data sampled ? Get the data. Which data are relevant ? Are there privacy issues? Plot the data. Explore the data. Are there anomalies ? Are there patterns ? VISUALIZE THE DATA Build a model. Model the data. Fit the model. Validate the model. Communicate and What did we learn ? Do the results make sense ? visualize the results. Can we tell a story ?

  4. https://www.autodeskresearch.com/publications/samestats

  5. https://www.autodeskresearch.com/publications/samestats

  6. Example: Antibiotics Will Burtin, 1951

  7. Data

  8. Data Genus, Species

  9. Data Genus, Species

  10. Data + - Genus, Species

  11. Data + - Genus, Species Min. Inhibitory 
 Concentration 
 [ml/g]

  12. What Questions?

  13. Gram Gram Positive Negative M. Bostock, Protovis after W. Burtin, 1951

  14. How effective are the drugs? Gram Gram Positive Negative M. Bostock, Protovis after W. Burtin, 1951

  15. How effective are the drugs? Gram Gram Positive Negative If bacteria is gram positive, If bacteria is gram negative, Penicillin & Neomycin are Neomycin is most effective most effective M. Bostock, Protovis after W. Burtin, 1951

  16. Wainer & Lysen, “That’s funny...” American Scientist, 2009 Adapted from Brian Schmotzer

  17. How do the bacteria compare? Wainer & Lysen, “That’s funny...” American Scientist, 2009 Adapted from Brian Schmotzer

  18. How do the bacteria compare? Wainer & Lysen, “That’s funny...” American Scientist, 2009 Adapted from Brian Schmotzer

  19. How do the bacteria compare? Not a streptococcus! (realized ~30 years later) Really a streptococcus! (realized ~20 years later) Wainer & Lysen, “That’s funny...” American Scientist, 2009 Adapted from Brian Schmotzer

  20. How do the bacteria compare? Wainer & Lysen, “That’s funny...” American Scientist, 2009

  21. How do the bacteria compare? Wainer & Lysen, “That’s funny...” American Scientist, 2009

  22. Exploratory Data Analysis “The greatest value of a picture is when it forces us to notice what we never expected to see.” John Tukey

  23. Visualization Goals Communicate (Explanatory) Present data and ideas Explain and inform Provide evidence and support Influence and persuade Analyze (Exploratory) Explore the data Assess a situation Determine how to proceed Decide what to do

  24. Communicate New York Times

  25. Explore

  26. EDA Workflow 1. Build a DataFrame from the data (ideally, put all data in this object) 2. Clean the DataFrame. It should have the following properties • Each row describes a single object • Each column describes a property of that object • Columns are numeric whenever appropriate • Columns contain atomic properties that cannot be further decomposed 3. Explore global properties . Use histograms, scatter plots, and aggregation functions to summarize the data. 4. Explore group properties . Use groupby and small multiples to compare subsets of the data.

  27. Viz options • Pandas Visualization module • Matplotlib • Seaborn • Above 3 are inter-mixable • Be lazy (to an extent…) • Other options: Bokeh, Vega, Vincent, Altair

  28. Cars Dataset Basic Pandas/matplotlib

  29. Can set limits, tick styles, scales, add lines, annotations, titles, legends Seaborn provides a different visual style and lots of canned plots.

  30. Effective Visualizations

  31. Not Effective... Sources: US Treasury and WHO reports

  32. Effective EDA Viz 1. Have graphical integrity 2. Keep it simple 3. Use the right display 4. Use color sensibly

  33. 1. Graphical Integrity

  34. Graphical Integrity Flowing Data

  35. Scale Distortions Flowing Data

  36. Scale Distortions

  37. “Double the axes, double the mischief” (Quote from Gary Smith’s Standard Deviations ) Graphic from Robert Reich’s Saving Capitalism http://www.thefunctionalart.com/2015/10/double-axes-double-mischief.html Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

  38. Be Proportional

  39. Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

  40. 2012 Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

  41. Include Uncertainty

  42. Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

  43. 8PM Tuesday 8PM Monday 8PM Sunday Hurricane CAIRO 8PM Saturday (category 5) What you show Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

  44. 8PM Tuesday 2/3 8PM Monday 1/3 8PM Sunday Hurricane CAIRO 8PM Saturday (category 5) What non-scientists are not aware of (cone is just 66% probability) Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

  45. Hurricane CAIRO (category 5) What we could be showing instead Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

  46. Plot all your data

  47. Counties with the LOWEST Counties with the HIGHEST kidney cancer death rates kidney cancer death rates (1980-1989) (1980-1989) Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

  48. Counties with the LOWEST Counties with the HIGHEST kidney cancer death rates kidney cancer death rates (1980-1989) (1980-1989) Alberto Cairo • University of Miami • www.thefunctionalart.com • Twitter: @albertocairo

  49. 2. Keep It Simple

  50. Avoid Chartjunk Extraneous visual elements that distract from the message ongoing, Tim Brey

  51. 1 2 3 4

  52. Don’t! matplotlib gallery Excel Charts Blog

  53. 3. Use The Right Display

  54. http://extremepresentation.typepad.com/blog/files/choosing_a_good_chart.pdf

  55. Comparisons

  56. Bars vs. Lines Zacks 1999

  57. Proportions

  58. Pie Charts

  59. Stacked Bar Chart S. Few

  60. Stacked Area Chart S. Few

  61. Correlations

  62. Scatterplots http://xkcd.com/388/

  63. London Cholera Epidemic From Edward Tufte, Visual and Statistical Thinking

  64. Don’t! matplot3d tutorial

  65. Trends

  66. Yahoo! Finance

  67. Distributions

  68. Histogram ggplot2

  69. Bin Width binwidth = 0.1 binwidth = 0.01 ggplot2

  70. Density Plots

  71. https:// www.autodeskresearch.com/ publications/samestats

  72. https:// www.autodeskresearch.com/ publications/samestats

  73. https://www.autodeskresearch.com/publications/samestats

  74. https://www.autodeskresearch.com/publications/samestats

  75. GROUP getting complex…

  76. Faceting and Small Multiples Use seaborn or multiple plots in matplotlib

  77. Small multiples

  78. SPLOM

  79. Design Exercise Hands-On Exercise

  80. How do you feel about doing science? Table Interest Before After Excited 19 38 Kind of interested 25 30 OK 40 14 Not great 5 6 Bored 11 12 Data courtesy of Cole Nussbaumer

  81. Come up with multiple visualizations. Pen and Paper Only.

  82. Pie Side by side bar

  83. Stacked bar, not very useful Data Transposed Bar Chart

  84. Difference Bar Chart

  85. Slopegraph

  86. After the pilot program, 68% of kids expressed interest towards science, compared to 44% going into the program.

  87. Perceptual Effectiveness

  88. Stephen’s Power Law, 1961 J. Bertin, 1967 Cleveland / McGill, 1984 J. Mackinlay, 1986 Heer / Bostock, 2010

  89. How much longer? A B

  90. How much longer? A 4x B

  91. How much steeper slope? A B

  92. How much steeper slope? A B 4x

  93. How much larger area? A B

  94. How much larger area? A B 10x

  95. How much darker? A B

  96. How much darker? A B 2x

Recommend


More recommend