Lecture 16: Summary and outlook Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 24th May 2019
The big topics 1. Statistical Learning 2. Supervised learning 3. Unsupervised learning 4. Data representations and dimension reduction 5. Large scale methods 1/58 ▶ Classification ▶ Regression ▶ Clustering
The big data paradigms centroids, …), low-rank approximations (SVD, NMF), … approximations (randomized SVD), … extensions), subspace clustering, low-rank and classification (Lasso, ridge regression, shrunken mixture models, … quadratic discriminant analysis (LDA and QDA), Gaussian 2/58 ▶ Small to medium sized data ▶ “Good old stats” ▶ Typical methods: 𝑙 -nearest neighbour (kNN), linear and ▶ High-dimensional data ▶ big- 𝑞 paradigm ▶ Typical methods: Feature selection, penalized regression ▶ Curse of dimensionality ▶ Large scale data ▶ big- 𝑜 paradigm (sometimes in combination with big- 𝑞 ) ▶ Typical methods: Random forests (with its big- 𝑜
Statistical Learning
What is Statistical Learning? Learn a model from data by minimizing expected prediction (predictive modelling) observed data and predictions 3/58 error determined by a loss function . ▶ Model: Find a model that is suitable for the data ▶ Data: Data with known outcomes is needed ▶ Expected prediction error: Focus on quality of prediction ▶ Loss function: Quantifies the discrepancy between
Statistical Learning for Regression loss ˆ 𝑔(𝐲) = 𝔽 𝑞(𝑧|𝐲) [𝑧] 1. k-nearest neighbour regression 𝑙 ∑ 𝐲 𝑗𝑚 ∈𝑂 𝑙 (𝐲) 𝑧 𝑗 𝑚 2. linear regression (viewpoint: generalized linear models (GLM)) 𝔽 𝑞(𝑧|𝐲) [𝑧] ≈ 𝐲 𝑈 𝜸 4/58 ▶ Theoretically best regression function for squared error ▶ Approximate (1) or make model-assumptions (2) 𝔽 𝑞(𝑧|𝐲) [𝑧] ≈ 1
Statistical Learning for Classification 𝐲 𝑚 ∈𝑂 𝑙 (𝐲) 𝐿−1 1 𝑞(𝐿|𝐲) = and 𝐿−1 𝑓 𝐲 𝑈 𝜸 (𝑗) 𝑞(𝑗|𝐲) = 2. Multi-class logistic regression 5/58 ∑ 𝑙 1. k-nearest neighbour classification 𝑞(𝑗|𝐲) 1≤𝑗≤𝐿 𝑑(𝐲) = arg max ̂ possible classes ( Bayes rule ) ▶ Theoretically best classification rule for 0-1 loss and 𝐿 ▶ Approximate (1) or make model-assumptions (2) 1 (𝑗 𝑚 = 𝑗) 𝑞(𝑗|𝐲) ≈ 1 1 + ∑ 𝑚=1 𝑓 𝐲 𝑈 𝜸 (𝑚) 1 + ∑ 𝑚=1 𝑓 𝐲 𝑈 𝜸 (𝑚)
Empirical error rates (I) 𝑛 same distribution as 𝒰 , i.e. 𝑞(𝒰) . 𝐲 𝑚 ) for 1 ≤ 𝑚 ≤ 𝑛 are new samples from the 𝑧 𝑚 , ̃ 𝐲 𝑚 |𝒰)) 𝑔( ̃ 𝑚=1 ∑ 𝑛 𝒰 = {(𝑧 𝑚 , 𝐲 𝑚 ) ∶ 1 ≤ 𝑚 ≤ 𝑜} where 𝑔(𝐲 𝑚 |𝒰)) 𝑚=1 ∑ 𝑜 𝑜 6/58 ▶ Training error 𝑆 𝑢𝑠 = 1 𝑀(𝑧 𝑚 , ˆ ▶ Test error 𝑆 𝑢𝑓 = 1 𝑧 𝑚 , ˆ 𝑀( ̃ where ( ̃
Splitting up the data 1. Randomly split available data into 𝑑 equally large subsets, so-called folds . 7/58 ▶ Holdout method: If we have a lot of samples, randomly split available data into training set and test set ▶ 𝑑 -fold cross-validation: If we have few samples 2. By taking turns, use 𝑑 − 1 folds as the training set and the last fold as the test set
Approximations of expected prediction error 𝐲 𝑚 ) for 1 ≤ 𝑚 ≤ 𝑛 are the elements in the test set. (LOOCV) where ℱ −𝑘 )) 𝑔(𝐲 𝑚 |ℱ 𝑘 (𝑧 𝑚 ,𝐲 𝑚 )∈ℱ ∑ 𝑘=1 ∑ 𝑑 𝑜 𝑧 𝑚 , ̃ 𝑔( ̃ 𝑛 𝑛 ∑ 𝑚=1 8/58 𝐲 𝑚 |𝒰)) ▶ Use test error for hold-out method, i.e. 𝑆 𝑢𝑓 = 1 𝑧 𝑚 , ˆ 𝑀( ̃ where ( ̃ ▶ Use average test error for c-fold cross-validation, i.e. 𝑆 𝑑𝑤 = 1 𝑀(𝑧 𝑚 , ˆ 𝑘 is the 𝑘 -th fold and ℱ −𝑘 is all data except fold 𝑘 . Note: For 𝑑 = 1 this is called leave-one-out cross validation
Careful data splitting training sets need to be identically distributed Examples: intervals than others (e.g. high values more often than low values) 9/58 ▶ Note: For the approximations to be justifiable, test and ▶ Splitting has to be done randomly ▶ If data is unbalanced, then stratification is necessary. ▶ Class imbalance ▶ Continuous outcome is observed more often in some
Bias-Variance Tradeoff ] Variance Bias 2 Error Model complexity Overfit Underfit 𝑆 𝑔 averaged over 𝐲 𝑔(𝐲)]] + Bias-Variance Decomposition 10/58 2 𝑔(𝐲)]) + Irreducible Error 𝜏 2 = Total expected prediction error 𝑔(𝐲)) 2 ] = 𝑆 𝔽 𝑞(𝒰,𝐲,𝑧) [(𝑧 − ˆ Bias 2 averaged over 𝐲 𝔽 𝑞(𝐲) [(𝑔(𝐲) − 𝔽 𝑞(𝒰) [ ˆ 𝔽 𝑞(𝐲) [ Var 𝑞(𝒰) [ ˆ Variance of ˆ Irreducible Error
Classification
Overview 1. 𝑙 -nearest neighbours (Lecture 1) 2. 0-1 regression (Lecture 2) 3. Logistic regression (Lecture 2, both binary and 4. Nearest Centroids (Lecture 2) and shrunken centroids (Lecture 10) 5. Discriminant analysis (Lecture 2) Fisher’s LDA/reduced-rank LDA (Lecture 6), mixture DA (Lecture 8) 6. Classification and Regression trees (CART) (Lecture 4) 7. Random Forests (Lecture 5 & 15) 11/58 ▶ just an academic example - do not use in practice multi-class; Lecture 11 for sparse case) ▶ Many variants: linear (LDA), quadratic (QDA), diagonal/Naive Bayes, regularized (RDA; Lecture 5),
Multiple angles on the same problem 𝑞(𝑗) separately feature space and assign each a class 12/58 1. Bayes rule: Approximate 𝑞(𝑗|𝐲) and choose largest ▶ e.g. kNN or logistic regression 2. Model of the feature space: Assume models for 𝑞(𝐲|𝑗) and ▶ e.g. discriminant analysis 3. Partitioning methods: Create explicit partitions of the ▶ e.g. CART or Random Forests
Finding the parameters of DA leads to 𝝂 𝑗 ) 𝑈 𝑗 𝑚 =𝑗 1 ˆ 𝑦 𝑚 𝑗 𝑚 =𝑗 ∑ 𝑜 𝑗 ˆ 𝑚=1 ∑ 𝑜 ˆ with 13/58 subject to parameters arg max 𝝂,𝚻,𝝆 𝑜 ∏ 𝑚=1 𝑂(𝐲 𝑚 |𝝂 𝑗 𝑚 , 𝚻 𝑗 𝑚 )𝜌 𝑗 𝑚 𝐿 ∑ 𝑗=1 ▶ Notation: Write 𝑞(𝑗) = 𝜌 𝑗 and consider them as unknown ▶ Given data (𝑗 𝑚 , 𝐲 𝑚 ) the likelihood maximization problem is 𝜌 𝑗 = 1. ▶ Can be solved using a Lagrange multiplier (try it!) and 𝜌 𝑗 = 𝑜 𝑗 1 (𝑗 𝑚 = 𝑗) 𝑜 , 𝑜 𝑗 = 𝝂 𝑗 = 1 𝚻 𝑗 = 𝑜 𝑗 − 1 ∑ (𝑦 𝑚 − ˆ 𝝂 𝑗 )(𝑦 𝑚 − ˆ
Performing classification in DA 𝜀 𝑗 (𝐲) This is a quadratic function in 𝐲 . (+ 𝐷) 2(𝐲 − 𝝂 𝑗 ) 𝑈 𝚻 −1 𝜀 𝑗 (𝐲) = log 𝑂(𝐲|𝝂 𝑗 , 𝚻 𝑗 ) + log 𝜌 𝑗 Bayes’ rule implies the classification rule where 1≤𝑗≤𝐿 𝑑(𝐲) = arg max Note that since log is strictly increasing this is equivalent to 𝑂(𝐲|𝝂 𝑗 , 𝚻 𝑗 )𝜌 𝑗 1≤𝑗≤𝐿 𝑑(𝐲) = arg max 14/58 = log 𝜌 𝑗 − 1 𝑗 (𝐲 − 𝝂 𝑗 ) − 1 2 log |𝚻 𝑗 |
Different levels of complexity 𝑗=1 between features are assumed to have the same correlation structure 𝝂 𝑗 ) 𝑈 𝑗 𝑚 =𝑗 ∑ 𝑗=1 ∑ 𝐿 𝑜 − 𝐿 1 ˆ 𝚻 𝑗 ∑ estimate (QDA) dimension 𝐿 15/58 ˆ 𝚻 = ▶ This method is called Quadratic Discriminant Analysis ▶ Problem: Many parameters that grow quickly with ▶ 𝐿 − 1 for all 𝜌 𝑗 ▶ 𝑞 ⋅ 𝐿 for all 𝝂 𝑗 ▶ 𝑞(𝑞 + 1)/2 ⋅ 𝐿 for all 𝚻 𝑗 (most costly) ▶ Solution: Replace covariance matrices 𝚻 𝑗 by a pooled 𝑜 𝑗 − 1 𝑜 − 𝐿 = (𝑦 𝑚 − ˆ 𝝂 𝑗 )(𝑦 𝑚 − ˆ ▶ Simpler correlation and variance structure: All classes
Performing classification in the simplified case As before, consider 𝑑(𝐲) = arg max 1≤𝑗≤𝐿 𝜀 𝑗 (𝐲) where 2𝝂 𝑈 (+ 𝐷) This is a linear function in 𝐲 . The method is therefore called Linear Discriminant Analysis (LDA) . 16/58 𝜀 𝑗 (𝐲) = log 𝜌 𝑗 + 𝐲 𝑈 𝚻 −1 𝝂 𝑗 − 1 𝑗 𝚻 −1 𝝂 𝑗
Even more simplifications Other simplifications of the correlation structure are possible ( Diagonal QDA or Naive Bayes’ Classifier ) 17/58 ▶ Ignore all correlations between features but allow different variances, i.e. 𝚻 𝑗 = 𝚳 𝑗 for a diagonal matrix 𝚳 𝑗 ▶ Ignore all correlations and make feature variances equal, i.e. 𝚻 𝑗 = 𝚳 for a diagonal matrix 𝚳 ( Diagonal LDA ) ▶ Ignore correlations and variances, i.e. 𝚻 𝑗 = 𝜏 2 𝐉 𝑞×𝑞 ( Nearest Centroids adjusted for class frequencies 𝜌 𝑗 )
Classification and Regression Trees (CART) > values/classes in each region sequence of binary splits Partition from a 18/58 Partition > Partition Arbitrary Rectangular ▶ Complexity of partitioning: ▶ Classification and Regression Trees create a sequence of binary axis-parallel splits in order to reduce variability of 0 0 0 0 00 x2 >= 2.2 4 0 0 0 0 0 0 yes no 00 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 1.00 .00 x1 >= 3.5 00 0 x 2 0 0 0 0 0 0 60% 2 1 1 0 0 0 1 0 1 1 0 0 1 1 0 1 1.00 .00 .00 1.00 0 11 1 0 0 0 1 11 0 20% 20% 0 2 4 6 x 1
Recommend
More recommend