Lecture 11: Data representations Felix Held, Mathematical Sciences - PowerPoint PPT Presentation

Lecture 11: Data representations Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 3rd May 2019

Recap: Elnet and group lasso but one) or along an edge (fewer). corner or along an edge. In addition, the curved edges encourage coefficients to be closer together. variables. It encourages whole groups to be zero simultaneously. Within a group, it encourages the coefficients to be as similar as possible. 1/23 ▶ The lasso sets variables exactly to zero either on a corner (all ▶ The elastic net similarly sets variables exactly to zero on a ▶ The group lasso has actual information about groups of

The lasso and bias (I) Shrinkage is good for variable/model selection but can → 𝑂(𝟏, 𝚻) − 𝑒 𝜸 − 𝜸 true ) 𝛾 One problem with penalisation methods is the bias that is decrease predictive performance. soft-thresholding) creates biased estimates. and therefore any non-linear transformation (like 𝛾 OLS ,𝑘 , 𝜇) ˆ introduced by shrinkage. 2/23 Remember: For orthogonal predictors, i.e. 𝐘 𝑈 𝐘 = 𝐉 𝑞 we have 𝛾 lasso ,𝑘 = ST (ˆ The least squares estimates are unbiased (i.e. 𝔽[𝜸 OLS ] = 𝜸 true ) Ideal case: Oracle procedure (Fan and Li, 2001) Assume the true subset of non-zero coefficients is 𝒝 = {𝑘 ∶ 𝛾 true = 0} 1. Identifies the right variables, i.e. {𝑘 ∶ ˆ 𝑘 ≠ 0} = 𝒝 2. Optimal estimation rate: √𝑜( ̂

The lasso and bias (II) estimates (i.e. combines all these apart from convexity estimates away from zero would be ̂ 3/23 asymptotically. It therefore produces inconsistent uncovering the true subset of variables predictors is small), then the lasso does a good job in (i.e. correlation between relevant and unrelevant ▶ We saw that if the irrepresentable condition is fulfilled ▶ The lasso however introduces bias that will not vanish 𝜸 lasso ↛ 𝜸 true for 𝑜 → ∞ ) ▶ Solution: Different penalty function? An ideal penalty ▶ singular at zero, leading to sparsity ▶ no penalty for large coefficients, leading to unbiased ▶ convex and differentiable ▶ The smoothly clipped absolute deviation (SCAD) penalty

Smoothly clipped absolute deviation (SCAD) 𝜇𝜄 The penalty is defined by its derivative 𝜄 > 𝑏𝜇 2 (𝑏+1)𝜇 2 𝜇 < 𝜄 ≤ 𝑏𝜇 2(𝑏−1) 𝜄 2 −2𝑏𝜇𝜄+𝜇 2 − 0 < 𝜄 ≤ 𝜇 4/23 ⎩ 𝑞 𝜇,𝑏 (𝜄) = 𝑞 ′ for 𝜄 > 0 , 𝜇 ≥ 0 , and 𝑏 > 2 . This integrates to ⎪ ⎧ ⎪ ⎨ 𝜇,𝑏 (𝜄) = 𝜇 ( 1 (𝜄 ≤ 𝜇) + (𝑏𝜇 − 𝜄) + (𝑏 − 1)𝜇 1 (𝜄 > 𝜇)) Lasso SCAD 20 7.5 15 p 2,3.5 (| β |) λ | β | 5.0 10 5 2.5 0 0.0 −10 −5 0 5 10 −10 −5 0 5 10 β β

SCAD and bias (I) 𝛾 OLS ,𝑘 | ≤ 2𝜇 The SCAD penalty is applied to each coefficient of 𝜸 and (in 𝛾 OLS ,𝑘 | > 𝑏𝜇 |ˆ 𝛾 OLS ,𝑘 ˆ 𝛾 OLS ,𝑘 | ≤ 𝑏𝜇 2𝜇 < |ˆ 𝑏−2 𝛾 OLS ,𝑘 )𝑏𝜇 5/23 |ˆ ⎨ case of orthonormal predictors) leads to the estimate 𝛾 OLS ,𝑘 , 𝜇) ˆ ⎧ ⎪ ⎪ ⎩ ST (ˆ (𝑏−1) ˆ 𝛾 OLS ,𝑘 − sign ( ˆ 𝛾 SCAD ,𝑘 = Ridge Lasso SCAD 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10 β β β

SCAD and bias (II) coefficients, but also creates sparsity an oracle procedure optimization approaches cannot be used. The authors of the method (Fan and Li, 2001) proposed an algorithm based on local approximations. 6/23 ▶ Good news: The SCAD penalty gets rid of bias for larger ▶ It can be shown theoretically that if 𝜇 𝑜 → 0 and √𝑜𝜇 𝑜 → ∞ when 𝑜 → ∞ , then the SCAD penalty leads to ▶ Bad news: The penalty is not convex and the standard ▶ Is there a way to stay in the realm of convex functions?

Adaptive Lasso (I) 𝑘=1 turn this procedure into an oracle procedure , i.e. Surprisingly, it turns out that the right choice of weights can where 𝑥 𝑘 | 𝑘 |𝛾 Consider the weighted lasso problem 𝑥 ∑ 𝑞 2‖𝐳 − 𝐘𝜸‖ 2 1 𝜸 ̂ 7/23 𝜸 ada = arg min 2 + 𝜇 𝑘 ≥ 0 for all 𝑘 . This is called the adaptive lasso . ▶ large coefficients are unbiased ▶ the convergence rate is optimal ▶ the right variables are identified

Adaptive Lasso (II) Computation: lasso is an oracle procedure . lasso , which is the solution. ̂ lasso . 8/23 𝜸 ridge . ̂ ̂ converges (in probability) to 𝟏 at rate 𝑜 −1/2 , e.g. Let 𝜸 ∗ be a √𝑜 -consistent estimate of 𝜸 true , i.e. 𝜸 ∗ − 𝜸 true 𝜸 OLS or 𝑘 | 𝛿 and 𝐘 ∗ = diag (𝐱 ∗ ) −1 𝐘 . ▶ For 𝛿 > 0 set 𝑥 ∗ 𝑘 = 1/|𝛾 ∗ ▶ Solve the unweighted lasso problem for 𝐘 ∗ to get 𝜸 ∗ ▶ Set 𝜸 ada = diag (𝐱 ∗ ) −1 𝜸 ∗ It can be shown (Zou, 2006), that if 𝜸 ∗ is a √𝑜 -consistent estimate, 𝜇 𝑜 /√𝑜 → 0 , and 𝜇 𝑜 𝑜 (𝛿−1)/2 → ∞ , then the adaptive

Penalisation in GLMs Penalisation can also be used in generalised linear models is assumed, this is equivalent to RSS minimisation. Note: If 𝑞(𝑧|𝜸, 𝐲) is Gaussian and the linear model 𝐳 = 𝐘𝜸 + 𝜻 −ℒ(𝜸|𝐳, 𝐘) + 𝜇‖𝜸‖ 1 𝜸 arg min is penalized , i.e. Instead of penalising the minimisation of the residual sum of log (𝑞(𝑧 𝑚 |𝜸, 𝐲 𝑚 )) 𝑚=1 ∑ 𝑜 ℒ(𝜸|𝐳, 𝐘) = Given 𝑞(𝑧|𝜸, 𝐲) the log-likelihood of the model is (GLMs), e.g. to perform sparse logistic regression . 9/23 squares (RSS), the minimisation of the negative log-likelihood

Sparse logistic regression 𝜸 nearest shrunken centroids) coordinate descent can be used to solve this problem. in 𝜸 . Iterative quadratic approximations combined with (𝑗 𝑚 𝐲 𝑈 𝑚=1 ∑ 𝑜 − arg min and the penalised minimisation problem becomes 1 + exp (𝐲 𝑈 𝜸) 1 𝑞(0|𝜸, 𝐲) = and 1 + exp (𝐲 𝑈 𝜸) exp (𝐲 𝑈 𝜸) 𝑞(1|𝜸, 𝐲) = 10/23 Recall: For logistic regression with 𝑗 𝑚 ∈ {0, 1} it holds that 𝑚 𝜸 − log (1 + exp (𝐲 𝑈 𝜸))) + 𝜇‖𝜸‖ 1 ▶ The minimisation problem is still convex, but non-linear ▶ Another way to perform sparse classification (like e.g.

Sparse multi-class logistic regression and 𝑘 = 1, … , 𝑞 . can be penalised. 𝑘 ) 𝐿−1 ∑ 1 𝑞(𝐿|𝐂, 𝐲) = 𝑘 ) 𝐿−1 ∑ exp (𝐲 𝑈 𝜸 𝑗 ) 𝑞(𝑗|𝐂, 𝐲) = 𝑗 = 1, … , 𝐿 − 1 that 11/23 In multi-class logistic regression with 𝑗 𝑚 ∈ {1, … , 𝐿} , there is a matrix of coefficients 𝐂 ∈ ℝ 𝑞×(𝐿−1) and it holds for 𝑘=1 exp (𝐲 𝑈 𝜸 𝑘=1 exp (𝐲 𝑈 𝜸 ▶ As in two-class case, the absolute value of each entry in 𝐂 ▶ Another possibility is to use the group lasso on all coefficients for one variable, i.e. penalise with ‖𝐂 𝑘⋅ ‖ 2 for

Example for sparse multi-class logistic regression MNIST-derived zip code digits ( 𝑜 = 7291 , 𝑞 = 256 ) coefficients. The numbers below are the class averages. Orange tiles show positive coefficients and blue tiles show negative 12/23 Sparse multi-class logistic regression was applied to the whole data set and the penalisation parameter was selected by 10-fold CV. 0 1 2 3 4 5 6 7 8 9

The lasso and significance testing small changes in the data can change a coefficient from exact some recent approaches variables use the other to perform least squares on the selected into two subsets, perform variable selection on one part and Meinshausen et al., 2009) split the data once or multiple times zero to non-zero distribution ) approximately Gaussian distribution ( spike-and-slab point mass at zero, but otherwise coefficients have an sparse coefficient vectors is tricky 13/23 ▶ Calculating p-values or performing significance testing on ▶ Probabilistic point-of-view: Sparsity and exact zeros create a ▶ In practice: Naive bootstrap is not going to work well since ▶ Some other approaches (Wasserman and Roeder, 2009; ▶ Still an active research topic, but the R package hdi contains

Data representations

Goals of data representation Dimension reduction while retaining important aspects of the data Goals can be to the posed question 14/23 ▶ Visualisation ▶ Interpretability/Variable selection ▶ Data compression ▶ Finding a representation of the data that is more suitable Let us start with linear dimension reduction .

Short re-cap: SVD The singular value decomposition (SVD) of a matrix 𝐘 ∈ ℝ 𝑜×𝑞 , 𝑜 ≥ 𝑞 , is 𝐘 = 𝐕𝐄𝐖 𝑈 𝐕 𝑈 𝐕 = 𝐉 𝑞 and holds for the diagonal elements of 𝐄 . 15/23 where 𝐕 ∈ ℝ 𝑜×𝑞 and 𝐖 ∈ ℝ 𝑞×𝑞 with 𝐖 𝑈 𝐖 = 𝐖𝐖 𝑈 = 𝐉 𝑞 and 𝐄 ∈ ℝ 𝑞×𝑞 is diagonal. Usually 𝑒 11 ≥ 𝑒 22 ≥ ⋯ ≥ 𝑒 𝑞𝑞

SVD and best rank- 𝑟 -approximation (I) ‖ ‖ ‖ ‖ 𝑞 ∑ 𝑘=𝑟+1 𝑒 𝑘𝑘 𝐯 𝑘 𝐰 𝑈 𝑘 ‖ 2 ‖ ‖ 2 𝐺 = 𝑞 ∑ 𝑘=𝑟+1 𝑒 2 𝑘 ‖ ‖ Write 𝐯 Best rank- 𝑟 -approximation: For 𝑟 < 𝑞 𝑞 ∑ 𝑘=1 𝑒 𝑘𝑘 𝐯 𝑘 𝐰 𝑈 𝑘 ⏟ ‖𝐘 − 𝐘 𝑟 ‖ rank-1-matrix 𝑟 ∑ 𝑘=1 𝑒 𝑘𝑘 𝐯 𝑘 𝐰 𝑈 𝑘 ‖ 16/23 𝑘 and 𝐰 𝑘 for the columns of 𝐕 and 𝐖 , respectively. Then 𝐘 = 𝐕𝐄𝐖 𝑈 = 𝐘 𝑟 = with approximation error 𝐺 =

Lecture 11: Data representations Felix Held, Mathematical Sciences - PowerPoint PPT Presentation

Lecture 11: Data representations Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 3rd May 2019 Recap: Elnet and group lasso but one) or along an edge (fewer). corner or along an edge. In addition, the curved

61A Lecture 16 Announcements String Representations String Representations 4 String

CSC321 Lecture 7: Distributed Representations Roger Grosse Roger Grosse CSC321 Lecture 7:

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

NLU lecture 5: Word representations and morphology Adam Lopez alopez@inf.ed.ac.uk Essential

Learning text representations from character-level data Grzegorz Chrupa la Department of

Inside Out: Two Jointly Predictive Models for Word Representations and Phrase Representations Fei

CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations Jimmy Ba

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Negotiating Commercial Loan Covenants, Representations and Warranties Representations and

On SAT representations of XOR constraints (towards a theory of good SAT representations) Oliver

New formula representations of high- New formula representations of high- latitude O + +

Abstract rule representations in a Abstract rule representations in a bilinear model bilinear

Image Processing Todays Class Image Representations: Matrices Image Representations: RGB,

Rich representations for Rich representations for learning visual recognition learning visual

Introduction to Representations Definition: Representation A representation of dimension n is a

On SAT representations of XOR constraints (towards a theory of good SAT representations) Matthew

ClearSCADA WEB-X CLIENT! Diary of the Penetration Tester ! Aditya K Sood, Senior Security

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables SIDI ZAKARI Ibrahim LIB-MA,

Models for Neutrino Masses and Physics Beyond Standard Model Salah Nasrj The 2nd Toyama

UNDULOID-LIKE EQUILIBRIUM SHAPES OF SINGLE-WALL CARBON NANOTUBES UNDER PRESSURE Vassil M.

Selective Inference via the Condition on Selection Framework: Inference after Variable Selection

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no

Distributional Results for Thresholding Estimators in High-Dimensional Gaussian Regression Ulrike

st t st rrs

Lecture 11: Data representations Felix Held, Mathematical Sciences - PowerPoint PPT Presentation

Lecture 11: Data representations Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 3rd May 2019 Recap: Elnet and group lasso but one) or along an edge (fewer). corner or along an edge. In addition, the curved

61A Lecture 16 Announcements String Representations String Representations 4 String

CSC321 Lecture 7: Distributed Representations Roger Grosse Roger Grosse CSC321 Lecture 7:

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

NLU lecture 5: Word representations and morphology Adam Lopez alopez@inf.ed.ac.uk Essential

Learning text representations from character-level data Grzegorz Chrupa la Department of

Inside Out: Two Jointly Predictive Models for Word Representations and Phrase Representations Fei

CSC421/2516 Lecture 3: Automatic Differentiation &amp; Distributed Representations Jimmy Ba

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Negotiating Commercial Loan Covenants, Representations and Warranties Representations and

On SAT representations of XOR constraints (towards a theory of good SAT representations) Oliver

New formula representations of high- New formula representations of high- latitude O + +

Abstract rule representations in a Abstract rule representations in a bilinear model bilinear

Image Processing Todays Class Image Representations: Matrices Image Representations: RGB,

Rich representations for Rich representations for learning visual recognition learning visual

Introduction to Representations Definition: Representation A representation of dimension n is a

On SAT representations of XOR constraints (towards a theory of good SAT representations) Matthew

ClearSCADA WEB-X CLIENT! Diary of the Penetration Tester ! Aditya K Sood, Senior Security

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables SIDI ZAKARI Ibrahim LIB-MA,

Models for Neutrino Masses and Physics Beyond Standard Model Salah Nasrj The 2nd Toyama

UNDULOID-LIKE EQUILIBRIUM SHAPES OF SINGLE-WALL CARBON NANOTUBES UNDER PRESSURE Vassil M.

Selective Inference via the Condition on Selection Framework: Inference after Variable Selection

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no

Distributional Results for Thresholding Estimators in High-Dimensional Gaussian Regression Ulrike

st t st rrs

CSC421/2516 Lecture 3: Automatic Differentiation & Distributed Representations Jimmy Ba