Lecture 1: Introduction Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 25th March 2019
What is Big Data?
Google Scholar. What is Big Data? [1] https://www.businessinsider.com/big-data-and-cancer-2015-9?r=US&IR=T&IR=T doi 10.1126/science.1248506 [3] https://www.nytimes.com/2018/03/22/opinion/democracy-survive-data.html [4] https://www.ft.com/content/21a6e7d8-b479-11e3-a09a-00144feabdc0#axzz2yQ2QQfQX 1/29 ▶ Is it just a buzz word? ▶ Is it a cure to everything? See e.g. [1] ▶ Big Data - Big Problems? ▶ Big Data does not mean correct answers, see e.g. [2] ▶ Privacy concerns, see e.g. [3] ▶ Big Data is often not collected systematically, see e.g. [4] ▶ It’s a huge topic in science! Almost 5 million hits on [2] Lazer et al. (2014) The Parable of Google Flu: Traps in Big Data Analysis. Science 343 (6176):1203–1205.
So Big Data is about size? Yes and no. Note that size is a flexible term. Here mostly: Big- 𝑜 setting Big- 𝑞 setting Big- 𝑜 / Big- 𝑞 setting Is this all? 2/29 ▶ Size as in: Number of observations ▶ Size as in: Number of variables ▶ Size as in: Number of observations and variables
The Four Vs of Big Data Four attributes commonly assigned to Big Data. Volume Large scale of the data. Challenges are storage, computation, finding the interesting parts, … Variety Different data types, data sources, many variables, … Veracity Uncertainty of data due to convenience sampling, missing values, varying data quality, insufficient data cleaning/preparation, … Velocity Data arriving at high speeds and need to be dealt with immediately (e.g. production plant, self-driving cars) See also https://www.ibmbigdatahub.com/infographic/four-vs-big-data 3/29
How does statistics come into play? Statistics as a science has always been concerned with… parameters/predictions Focus is on the last three in this course. 4/29 ▶ sampling designs ▶ modelling of data and underlying assumptions ▶ inference of parameters ▶ uncertainty quantification in estimated
Statistical challenges in Big Data complexity and variety of data ( 𝑞 grows with 𝑜 ) require statistics much more likely with an increase in 𝑜 points in high-dimensional space 5/29 ▶ Increase in sample size often leads to increase in ▶ More data ≠ less uncertainty ▶ A lot of classical theory is for fixed 𝑞 and growing 𝑜 ▶ Exploration and visualisation of Big Data can already ▶ Probability of extreme values: Unlikely results become ▶ Curse of dimensionality: Lot’s of space between data
Course Overview & Expectations
A clarification upfront This course focusses on statistics, not on the logistics of data processing. and reasonable interpretations are our main goals. theory and their modifications for big data sets specialised courses for this (e.g. FFR135/FIM720 or TDA231/DIT380). 6/29 ▶ Understanding of algorithms , modelling assumptions ▶ We will focus on well-understood methods supported by ▶ No neural networks or deep learning. There are
Themes classification 7/29 ▶ Statistical learning/prediction: Regression and ▶ Unsupervised classification, i.e. clustering ▶ Variable selection, both explicit and implicit ▶ Data representations/Dimension reduction ▶ Large sample methods
Who’s involved Felix Held , felix.held@chalmers.se Rebecka Jörnsten jornsten@chalmers.se Juan Inda Diaz inda@chalmers.se 8/29
A course in three parts 1. Lectures 2. Projects 3. Take-home exam 9/29
Course literature Hastie, T, Tibshirani, R, and Friedman, J (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer Science+Business Media, LLC application suggestions on course website. 10/29 ▶ Covers a lot of statistical methods ▶ Freely available online ▶ Balanced presentation of theory and ▶ Not always very detailed. Other
Projects project mandatory to be allowed to take the exam 11/29 ▶ Five (small) projects throughout the course ▶ Purpose: ▶ Hands-on experience in data-analysis ▶ Further exploration of course topics ▶ Practice how to present statistical results ▶ You will work in groups and have at least one week per ▶ Projects will be presented in class ▶ Attendance (and presenting) of project presentations is ▶ More information next week
Exam analysis/statistical tasks 12/29 ▶ Take-home exam ▶ Structure: ▶ 50% of the exam/grade: Revise your projects individually ▶ 50% of the exam/grade: Additional data ▶ Exam will be handed out on 24 th May ▶ Hard deadline on 14 th June
Statistical Learning
Basics about random variables quantities variable variables 13/29 ▶ We will consider discrete and continuous random ▶ Probability mass function (pmf) 𝑞(𝑙) for a discrete ▶ Probability density function (pdf) 𝑞(𝐲) for a continuous
Two important rules (and a consequence) Marginalisation For a joint density 𝑞(𝑦, 𝑧) it holds that 𝑧 𝑞(𝑦, 𝑧) or Conditioning For a joint density 𝑞(𝑦, 𝑧) it holds that 𝑞(𝑦, 𝑧) = 𝑞(𝑦|𝑧)𝑞(𝑧) = 𝑞(𝑧|𝑦)𝑞(𝑦) 𝑞(𝑧) 14/29 𝑞(𝑦) = ∑ 𝑞(𝑦) = ∫ 𝑞(𝑦, 𝑧) d 𝑧 Both rules together imply Bayes’ law 𝑞(𝑦|𝑧) = 𝑞(𝑧|𝑦)𝑞(𝑦)
Expectation and variance Expectations and variance depend on an underlying pdf/pmf. Notation: 2 ] 15/29 ▶ 𝔽 𝑞(𝑦) [𝑔(𝑦)] = ∫ 𝑔(𝑦)𝑞(𝑦) d 𝑦 ▶ Var 𝑞(𝑦) [𝑔(𝑦)] = 𝔽 𝑞(𝑦) [(𝑔(𝑦) − 𝔽 𝑞(𝑦) [𝑔(𝑦)])
What is Statistical Learning? Learn a model from data by minimizing expected prediction (predictive modelling) observed data and predictions 16/29 error determined by a loss function. ▶ Model: Find a model that is suitable for the data ▶ Data: Data with known outcomes is needed ▶ Expected prediction error: Focus on quality of prediction ▶ Loss function: Quantifies the discrepancy between
Linear regression - An old friend 17/29 2.5 ● ● ● ● ● ● 2.0 ● ● ● ● ● y ● ● ● ● ● ● ● 1.5 ● ● 0.0 0.5 1.0 1.5 2.0 2.5 x
Statistical Learning and Linear Regression regression problems, i.e. squared error loss 2 ) 𝜸 ⏟⎵⎵⎵⏟⎵⎵⎵⏟ 18/29 design matrix 𝐘 has rank 𝑞 + 1 𝑗 = 1, … , 𝑜 (𝑧 𝑗 , 𝐲 𝑗 ), ▶ Data: Training data consists of independent pairs Observed response 𝑧 𝑗 ∈ ℝ for predictors 𝐲 𝑗 ∈ ℝ 𝑞 and ▶ Model: 𝑧 𝑗 = 𝐲 𝑈 𝑗 𝜸 + 𝜁 𝑗 where 𝜁 𝑗 ∼ 𝑂(0, 𝜏 2 ) independent ▶ Loss function: Least squares solves standard linear 𝑧) 2 = (𝑧 − 𝐲 𝑈 ((𝐘 𝑈 𝐘) −1 𝐘 𝑈 𝐳) 𝑀(𝑧, ̂ 𝑧) = (𝑧 − ̂ = ̂
Statistical decision theory for regression (I) 𝑔(𝐲) dependent on the variable(s) 𝑦 𝑀(𝑧, 𝑔(𝐲)) = (𝑧 − 𝑔(𝐲)) 2 from training data (population) as the training data arrives, expected ˆ 𝑔 = arg min 𝑔 𝐾(𝑔) 19/29 ▶ Squared error loss between outcome 𝑧 and a prediction ▶ Assume we want to find the “best” 𝑔 that can be learned ▶ When a new pair of data (𝑧, 𝐲) from the same distribution prediction loss for a given 𝑔 is 𝐾(𝑔) = 𝔽 𝑞(𝐲,𝑧) [𝑀(𝑧, 𝑔(𝐲))] = 𝔽 𝑞(𝐲) [𝔽 𝑞(𝑧|𝐲) [𝑀(𝑧, 𝑔(𝐲))]] ▶ Define “best” by:
Statistical decision theory for regression (II) 𝑧 𝑗 𝑧 𝑗 𝑚 𝐲 𝑗𝑚 ∈𝑂 𝑙 (𝐲) ∑ 𝑙 Then neighbours of 𝐲 in the training data 𝑂 𝑙 (𝐲) = {𝐲 𝑗 1 , … , 𝐲 𝑗 𝑙 } . 𝐲 𝑗 =𝐲 1 𝔽 𝑞(𝑧|𝐲) [𝑧] ≈ mean) the expectation of 𝑧 given that 𝐲 is fixed (conditional 𝑔(𝐲) = 𝔽 𝑞(𝑧|𝐲) [𝑧] ˆ 20/29 ▶ It can be derived (see blackboard) that ▶ Regression methods approximate the conditional mean ▶ For many observations 𝑧 with identical 𝐲 we could use |{𝑧 𝑗 ∶ 𝐲 𝑗 = 𝐲}| ∑ ▶ Probably more realistic to look for the 𝑙 closest 𝔽 𝑞(𝑧|𝐲) [𝑧] ≈ 1
Average of 𝑙 neighbours 21/29 2.5 ● ● ● ● ● ● 2.0 ● ● ● ● ● y ● ● ● ● ● ● ● 1.5 ● ● 0.0 0.5 1.0 1.5 2.0 2.5 x k 2 5
Back to linear regression Linear regression is a model-based approach and assumes that the dependence of 𝑧 on 𝐲 can be written as a weighted sum 𝔽 𝑞(𝑧|𝑦) [𝑧] ≈ 𝐲 𝑈 𝜸 22/29
A simple example of classification How do we classify a pair of new coordinates 𝐲 = (𝑦 1 , 𝑦 2 ) ? 23/29 5.0 ● ● ● ● 2.5 ● ● ● ● ● ● ● ● ● ● ● ● Class ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● x 2 ● ● ● ● ● ● ● 1 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2.5 ● ● ● ● ● ● ● ● ● −5.0 −4 −2 0 2 4 x 1
𝑙 -nearest neighbour classifier (kNN) 𝑂 𝑙 (𝐲) = {𝐲 𝑗 1 , … , 𝐲 𝑗 𝑙 } in the training sample, that are closest to 𝐲 in the Euclidean norm. in 𝑂 𝑙 (𝐲) belong to (highest frequency) 24/29 ▶ Find the 𝑙 predictors ▶ Majority vote: Assign 𝐲 to the class that most predictors
Recommend
More recommend