Data Viz April 2, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter 1
Announcements • Videos on if you can! Use raise-hand feature for questions. • Any questions/concerns logistically? • Extra Office Hours tomorrow 2
Today • Questions from previous lectures? (Dimensionality Reduction, Classification, Regularization) • Data Viz tips and best practices 3
When do I do data viz during a project? 4
When do I do data viz during a project? Hypothesis: CS students sleep less than Brown students in general 5
When do I do data viz during a project? Viz #1: Quick side-by-side histogram of CS students’ sleep vs. the rest. Means + CIs Hypothesis: CS students sleep less than Brown students in general 6
When do I do data viz during a project? Viz #1: Quick side-by-side histogram of CS students’ sleep vs. the rest. Means + CIs Run linear regression, control for various things, find large coefficient on whether student has two concentrations Hypothesis: CS students sleep less than Brown students in general 7
When do I do data viz during a project? Viz #2: Quick histograms (or box-whiskers maybe) of Viz #1: Quick hours of sleep vs. number of side-by-side concentrations histogram of CS students’ sleep vs. the rest. Means + CIs Run linear regression, control for various things, find large coefficient on whether student has two concentrations Hypothesis: CS students sleep less than Brown students in general 8
When do I do data viz during a project? Viz #2: Quick histograms (or box-whiskers maybe) of Viz #1: Quick hours of sleep vs. number of side-by-side concentrations histogram of CS students’ sleep Viz #3: Quick vs. the rest. histogram of number Means + CIs of concentrations for CS vs. non-CS Run linear regression, control students for various things, find large coefficient on whether student has two concentrations Hypothesis: CS students sleep less than Brown students in general 9
When do I do data viz during a project? Viz #2: Quick histograms (or box-whiskers maybe) of Viz #1: Quick hours of sleep vs. number of side-by-side concentrations histogram of CS Viz #4: Final students’ sleep Viz #3: Quick polished vs. the rest. histogram of number visualizations for Means + CIs of concentrations for poster/paper/ CS vs. non-CS Run linear regression, control report students for various things, find large coefficient on whether student has two concentrations Hypothesis: CS students sleep less than Brown students in general 10
When do I do data viz while not during a project? converged Viz #ia: Quick histograms (or box-whiskers maybe) of Viz #1: Quick hours of sleep vs. number of side-by-side concentrations histogram of CS Viz #N+1: Final students’ sleep Viz #ib: Quick polished vs. the rest. histogram of number visualizations for Means + CIs of concentrations for poster/paper/ CS vs. non-CS Run linear regression, control report students for various things, find large coefficient on whether student has two concentrations Hypothesis: CS students sleep less than Brown students in general 11
When do I do data viz during a project? • At the very start of analysis, to find out wth is going on in my data • Periodically throughout, to vet the quantitative trends I am seeing • At the very end of a project, to showcase the results 12
When do I do data viz during a project? • At the very start of analysis, to find out wth is going on in my data • Periodically throughout, to vet the quantitative trends I am seeing • At the very end of a project, to showcase the results More important (matplotlib, excel, whatever is easy) 13
When do I do data viz during a project? Most attention, cause its fun ;) (D3, etc.) • At the very start of analysis, to find out wth is going on in my data • Periodically throughout, to vet the quantitative trends I am seeing • At the very end of a project, to showcase the results 14
When do I do data viz during a project? • At the very start of analysis, to find out wth is going on in my data • Periodically throughout, to vet the quantitative trends I am seeing • At the very end of a project, to showcase the results You are the main audience, goal is to make sure you understand what you are looking at 15
When do I do data viz during a project? Everyone else is the main audience. Goal is to make point as clearly and concisely as possible. • At the very start of analysis, to find out wth is going on in my data • Periodically throughout, to vet the quantitative trends I am seeing • At the very end of a project, to showcase the results 16
So many bad figures… Diane Maggie Neil 17
My “three pillars”* of Data Viz *:) 18
My “three pillars” of Data Viz — Your figures should speak for themselves. The analysis should be understandable and your conclusions should be obviously supported, without too much effort 19
My “three pillars” of Data Viz — Your figures should speak for themselves. The analysis should be understandable and your conclusions should be obviously supported, without too much effort Don’t obfuscate the data or H ide the pr O cess you used to come to your co N clusions. Giv E people enough data S o that T hey can disagree with Y ou if they want to. 20
My “three pillars” of Data Viz — Your figures should speak for themselves. The analysis should be understandable and your conclusions should be obviously supported, without too much effort Don’t obfuscate the data or H ide the pr O cess you used to come to your co N clusions. Giv E people enough data S o that T hey can disagree with Y ou if they want to. Minimalism — Substance over style. Make your point concisely, without redundant or distracting information or ornamentation. 21
Ellie rants about culture for 2 seconds. Indulge me…. “form follows function” 22
Great tangent to go on… Edward Tufte—dogma of data viz 23
My “three pillars” of Data Viz — Your figures should speak for themselves. The analysis should be understandable and your conclusions should be obviously supported, without too much effort 24
Missing or Cryptic Labels Learning curve 100 75 50 25 0 25
Missing or Cryptic Labels Learning curve 100 Classification Accuracy (%) 75 50 25 0 10 100 1000 10000 1000000 Training Size 26
Skewed or Crunched Data Population 1 Population 2 10000 7500 Frequency 5000 2500 0 20 40 60 80 100 Age 27
Skewed or Crunched Data Population 1 Population 2 10 Sometimes can use logs 7.5 (but say you did so…) Log Frequency 5 2.5 0 20 40 60 80 100 Age 28
Skewed or Crunched Data 90 67.5 Frequency 45 22.5 0 20 40 60 80 100 Age 29
Skewed or Crunched Data 40 Sometimes can remove outliers 30 (but say you did so…) Frequency 20 10 0 4 8 12 16 20 Age 30
Skewed or Crunched Data 40 30 20 90 40 10 30 0 67.5 4 8 12 16 20 20 Frequency 10 45 0 90 92 94 96 100 22.5 Sometimes better to 0 analyze separately. 20 40 60 80 100 Age (Look at your data!) 31
Skewed or Crunched Data 100 75 50 25 0 20 40 60 80 100 32
Skewed or Crunched Data 100 100 100 100 100 75 75 75 75 75 50 50 50 50 50 25 25 25 25 25 0 0 0 0 0 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 Sometimes better to split into multiple charts… 33
Chart/Data Type Mismatch Company Earnings by Year (in 2012 millions) 2013 2014 2015 2016 1.3 2017 2.3 1.7 2.1 2.1 2.0 34
Chart/Data Type Mismatch Company Earnings by Year (in 2012 millions) Not really 2013 2014 interpretable as 2015 2016 1.3 “parts of a 2017 2.3 whole”… 1.7 2.1 2.1 2.0 35
Chart/Data Type Mismatch Company Earnings by Year (in millions) 2.3 2.0 1.8 1.5 1.3 2012 2013 2014 2015 2016 2017 36
Chart/Data Type Mismatch Earnings Gap in Canada is Smaller 70 52.5 Earnings 35 17.5 0 US Canada College No College 37
Chart/Data Type Mismatch Earnings Gap in Canada is Smaller 16 12 Earnings Gap 8 4 0 US Canada 38
Clicker Question! States I have lived in 18 13.5 Years 9 4.5 0 Michigan Maryland Pennsylvania New York Rhode Island What is the biggest problem with this? (a) Crunched/Skewed Data (b) Missing/Cryptic Labels (c) Chart/Data Type Mismatch (d) Its just ugly 39
Clicker Question! States I have lived in 18 13.5 Years 9 4.5 0 Michigan Maryland Pennsylvania New York Rhode Island What is the biggest problem with this? (a) Crunched/Skewed Data (b) Missing/Cryptic Labels (c) Chart/Data Type Mismatch (d) Its just ugly 40
Clicker Question! States I have lived in 18 13.5 Years 9 4.5 0 Michigan Maryland Pennsylvania New York Rhode Island What is the biggest problem with this? (a) Crunched/Skewed Data (b) Missing/Cryptic Labels (c) Chart/Data Type Mismatch (d) Its just ugly 41
My “three pillars” of Data Viz Don’t obfuscate the data or H ide the pr O cess you used to come to your co N clusions. Giv E people enough data S o that T hey can disagree with Y ou if they want to. 42
Recommend
More recommend