Introduction to Data Science Winter Semester 2019/20 Oliver Ernst TU Chemnitz, Fakultät für Mathematik, Professur Numerische Mathematik Lecture Slides
Organizational Issues • Module: Introduction to Data Science (M24, Einführung in Data Science) • Class web page: www.tu-chemnitz.de/mathematik/numa/lehre/ds-2018/ • Lecturer: Oliver Ernst (oernst@math.tu-chemnitz.de) • Class meets Mon 9:15 and Thu 13:45 in Rh70 Room B202. • Course Assistant: Jan Blechschmidt (jan.blechschmidt@math.tu-chemnitz.de) • Lab exercises Mon 13:45 Rh39/41 Room 738. • 50 % of coursework must be handed in correctly to register for exam. • Oral exam, 30 minutes, at end of teaching term (February) • Textbook: James, Witten, Hastie &Tibshirani. Introduction to Statistical Learning. Springer, 2013. Note: the majority of images appearing in these lecture slides are taken from this textbook; permission for their use for tea- ching is gratefully acknowledged. Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 2 / 461
Contents I 1 What is Data Science? 2 Learning Theory 2.1 What is Statistical Learning? 2.2 Assessing Model Accuracy 3 Linear Regression 3.1 Simple Linear Regression 3.2 Multiple Linear Regression 3.3 Other Considerations in the Regression Model 3.4 Revisiting the Marketing Data Questions 3.5 Linear Regression vs. K -Nearest Neighbors 4 Classification 4.1 Overview of Classification 4.2 Why Not Linear Regression? 4.3 Logistic Regression 4.4 Linear Discriminant Analysis 4.5 A Comparison of Classification Methods 5 Resampling Methods Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 3 / 461
Contents II 5.1 Cross Validation 5.2 The Bootstrap 6 Linear Model Selection and Regularization 6.1 Subset Selection 6.2 Shrinkage Methods 6.3 Dimension Reduction Methods 6.4 Considerations in High Dimensions 6.5 Miscellanea 7 Nonlinear Regression Models 7.1 Polynomial Regression 7.2 Step Functions 7.3 Regression Splines 7.4 Smoothing Splines 7.5 Generalized Additive Models 8 Tree-Based Methods 8.1 Decision Tree Fundamentals 8.2 Bagging, Random Forests and Boosting Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 4 / 461
Contents III 9 Unsupervised Learning 9.1 Principal Components Analysis 9.2 Clustering Methods Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 5 / 461
Contents 1 What is Data Science? Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 6 / 461
What is Data Science? Point of departure • Explosion of data: sensor technology (e.g. weather); purchase histories (customer loyalty programs, fraud detection); soon every person on the pla- net ( ≈ 8 billion) will have a smartphone and generate GPS traces, trillions of photos each year; DNA sequencing . . . Nagel (2018): • Gigabyte (10 9 ); your hard disk • Terabyte (10 12 ); Facebook - 500 TB/day 1 • Petabyte (10 15 ); CERN - Large Hadron Collider: 15 PB/year • Zettabyte (10 21 ); mobile network traffic 2016 • Yottabyte (10 24 ); event analysis • Brontobyte (10 27 ); sensor data from the IoT (Internet of Things) 1 Facebook currently has 2.2 billion users worldwide, Can Mark Zuckerberg Fix Facebook Before It Breaks Democracy? New Yorker, 09/2018 Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 7 / 461
What is Data Science? Point of departure • Explosion of data: sensor technology (e.g. weather); purchase histories (customer loyalty programs, fraud detection); soon every person on the pla- net ( ≈ 8 billion) will have a smartphone and generate GPS traces, trillions of photos each year; DNA sequencing . . . Nagel (2018): • Gigabyte (10 9 ); your hard disk • Terabyte (10 12 ); Facebook - 500 TB/day 1 • Petabyte (10 15 ); CERN - Large Hadron Collider: 15 PB/year • Zettabyte (10 21 ); mobile network traffic 2016 • Yottabyte (10 24 ); event analysis • Brontobyte (10 27 ); sensor data from the IoT (Internet of Things) • Heterogeneity: besides big, data unstructured, noisy, heterogeneous, not collected systematically. • Enabling technologies: storage capacity; computing hardware; algorithms; 350 years of statistics; 100 years of numerical analysis. 1 Facebook currently has 2.2 billion users worldwide, Can Mark Zuckerberg Fix Facebook Before It Breaks Democracy? New Yorker, 09/2018 Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 7 / 461
What is Data Science Buzz words • Data Science • Big Data • Data Mining • Data Analysis • Business Intelligence • Analytics Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 8 / 461
What is Data Science Buzz words trends.google.com Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 9 / 461
What is Data Science Conway’s data science Venn diagram (2010) . . . http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 10 / 461
What is Data Science Conway’s data science Venn diagram (2010) . . . and variations Brendan Tierney (2012) http://www.prooffreader.com/2016/09/battle-of-data-science-venn-diagrams.html Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 11 / 461
What is Data Science Conway’s data science Venn diagram (2010) . . . and variations Joel Grus (2013), at the time of the Snowden revelations http://www.prooffreader.com/2016/09/battle-of-data-science-venn-diagrams.html Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 11 / 461
What is Data Science Conway’s data science Venn diagram (2010) . . . and variations Shelly Palmer (2015) http://www.prooffreader.com/2016/09/battle-of-data-science-venn-diagrams.html Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 11 / 461
What is Data Science . . . and from the twitterverse: “A data scientist is a statistician who lives in San Francisco.” Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 12 / 461
What is Data Science . . . and from the twitterverse: “A data scientist is a statistician who lives in San Francisco.” “Data Science is statistics on a Mac” Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 12 / 461
What is Data Science . . . and from the twitterverse: “A data scientist is a statistician who lives in San Francisco.” “Data Science is statistics on a Mac” “A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statisti- cian.” Josh Wills, Director of Data Science at Cloudera Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 12 / 461
What is Data Science . . . and from the twitterverse: “A data scientist is a statistician who lives in San Francisco.” “Data Science is statistics on a Mac” “A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statisti- cian.” Josh Wills, Director of Data Science at Cloudera “A data scientist is someone who is worse at statistics than any stati- stician and worse at software engineering than any software engineer.” Will Cukierski, Data Scientist at Kaggle https://twitter.com/cdixon/status/428914681911070720 Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 12 / 461
What is Data Science Donoho’s first definition David Donoho (2015) 2 Classifies the activities of Greater Data Science (GDS) into six divisions. (GDS1) Data Gathering, Preparation and Exploration experimental design; collect; reformat, treat anomalies, expose unexpected features. (GDS2) Data Representation and Transformation databases; represent e.g. acoustic, image, sensor, network data. (GDS3) Computing with Data SW engineering; cluster computing; developing workflows. (GDS4) Data Visualization and Presentation classical EDA plots; high-dimensional or streaming data (GDS5) Data Modeling generative vs. predictive modeling (GDS6) Science about Data Science evaluate results, processes, workflows; analysis of methods. 2 50 Years of Data Science. Journal of Computational and Graphical Statistics. 26 (2017) Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 13 / 461
Some Examples Celestial motion Brahe, Kepler & Newton: • Tycho Brahe (1546–1501): Accurate and detailed observations of positions of celestial bodies over many years. (data collection) • Johannes Kepler (1571–1630): Based on Brahe’s observations, derives laws of orbital motion. (model fitting) • Isaac Newton (1643–1727): Derives Newtonian mechanics, from which Kepler’s laws follow. (deeper underlying truths) Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 14 / 461
Recommend
More recommend