Lecture 11: Data representations Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 3rd May 2019
Recap: Elnet and group lasso but one) or along an edge (fewer). corner or along an edge. In addition, the curved edges encourage coefficients to be closer together. variables. It encourages whole groups to be zero simultaneously. Within a group, it encourages the coefficients to be as similar as possible. 1/23 ▶ The lasso sets variables exactly to zero either on a corner (all ▶ The elastic net similarly sets variables exactly to zero on a ▶ The group lasso has actual information about groups of
The lasso and bias (I) Shrinkage is good for variable/model selection but can → 𝑂(𝟏, 𝚻) − 𝑒 𝜸 − 𝜸 true ) 𝛾 One problem with penalisation methods is the bias that is decrease predictive performance. soft-thresholding) creates biased estimates. and therefore any non-linear transformation (like 𝛾 OLS ,𝑘 , 𝜇) ˆ introduced by shrinkage. 2/23 Remember: For orthogonal predictors, i.e. 𝐘 𝑈 𝐘 = 𝐉 𝑞 we have 𝛾 lasso ,𝑘 = ST (ˆ The least squares estimates are unbiased (i.e. 𝔽[𝜸 OLS ] = 𝜸 true ) Ideal case: Oracle procedure (Fan and Li, 2001) Assume the true subset of non-zero coefficients is = {𝑘 ∶ 𝛾 true = 0} 1. Identifies the right variables, i.e. {𝑘 ∶ ˆ 𝑘 ≠ 0} = 2. Optimal estimation rate: √𝑜( ̂
The lasso and bias (II) estimates (i.e. combines all these apart from convexity estimates away from zero would be ̂ 3/23 asymptotically. It therefore produces inconsistent uncovering the true subset of variables predictors is small), then the lasso does a good job in (i.e. correlation between relevant and unrelevant ▶ We saw that if the irrepresentable condition is fulfilled ▶ The lasso however introduces bias that will not vanish 𝜸 lasso ↛ 𝜸 true for 𝑜 → ∞ ) ▶ Solution: Different penalty function? An ideal penalty ▶ singular at zero, leading to sparsity ▶ no penalty for large coefficients, leading to unbiased ▶ convex and differentiable ▶ The smoothly clipped absolute deviation (SCAD) penalty
Smoothly clipped absolute deviation (SCAD) 𝜇𝜄 The penalty is defined by its derivative 𝜄 > 𝑏𝜇 2 (𝑏+1)𝜇 2 𝜇 < 𝜄 ≤ 𝑏𝜇 2(𝑏−1) 𝜄 2 −2𝑏𝜇𝜄+𝜇 2 − 0 < 𝜄 ≤ 𝜇 4/23 ⎩ 𝑞 𝜇,𝑏 (𝜄) = 𝑞 ′ for 𝜄 > 0 , 𝜇 ≥ 0 , and 𝑏 > 2 . This integrates to ⎪ ⎧ ⎪ ⎨ 𝜇,𝑏 (𝜄) = 𝜇 ( 1 (𝜄 ≤ 𝜇) + (𝑏𝜇 − 𝜄) + (𝑏 − 1)𝜇 1 (𝜄 > 𝜇)) Lasso SCAD 20 7.5 15 p 2,3.5 (| β |) λ | β | 5.0 10 5 2.5 0 0.0 −10 −5 0 5 10 −10 −5 0 5 10 β β
SCAD and bias (I) 𝛾 OLS ,𝑘 | ≤ 2𝜇 The SCAD penalty is applied to each coefficient of 𝜸 and (in 𝛾 OLS ,𝑘 | > 𝑏𝜇 |ˆ 𝛾 OLS ,𝑘 ˆ 𝛾 OLS ,𝑘 | ≤ 𝑏𝜇 2𝜇 < |ˆ 𝑏−2 𝛾 OLS ,𝑘 )𝑏𝜇 5/23 |ˆ ⎨ case of orthonormal predictors) leads to the estimate 𝛾 OLS ,𝑘 , 𝜇) ˆ ⎧ ⎪ ⎪ ⎩ ST (ˆ (𝑏−1) ˆ 𝛾 OLS ,𝑘 − sign ( ˆ 𝛾 SCAD ,𝑘 = Ridge Lasso SCAD 10 10 10 5 5 5 0 0 0 −5 −5 −5 −10 −10 −10 −10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10 β β β
SCAD and bias (II) coefficients, but also creates sparsity an oracle procedure optimization approaches cannot be used. The authors of the method (Fan and Li, 2001) proposed an algorithm based on local approximations. 6/23 ▶ Good news: The SCAD penalty gets rid of bias for larger ▶ It can be shown theoretically that if 𝜇 𝑜 → 0 and √𝑜𝜇 𝑜 → ∞ when 𝑜 → ∞ , then the SCAD penalty leads to ▶ Bad news: The penalty is not convex and the standard ▶ Is there a way to stay in the realm of convex functions?
Adaptive Lasso (I) 𝑘=1 turn this procedure into an oracle procedure , i.e. Surprisingly, it turns out that the right choice of weights can where 𝑥 𝑘 | 𝑘 |𝛾 Consider the weighted lasso problem 𝑥 ∑ 𝑞 2‖𝐳 − 𝐘𝜸‖ 2 1 𝜸 ̂ 7/23 𝜸 ada = arg min 2 + 𝜇 𝑘 ≥ 0 for all 𝑘 . This is called the adaptive lasso . ▶ large coefficients are unbiased ▶ the convergence rate is optimal ▶ the right variables are identified
Adaptive Lasso (II) Computation: lasso is an oracle procedure . lasso , which is the solution. ̂ lasso . 8/23 𝜸 ridge . ̂ ̂ converges (in probability) to 𝟏 at rate 𝑜 −1/2 , e.g. Let 𝜸 ∗ be a √𝑜 -consistent estimate of 𝜸 true , i.e. 𝜸 ∗ − 𝜸 true 𝜸 OLS or 𝑘 | 𝛿 and 𝐘 ∗ = diag (𝐱 ∗ ) −1 𝐘 . ▶ For 𝛿 > 0 set 𝑥 ∗ 𝑘 = 1/|𝛾 ∗ ▶ Solve the unweighted lasso problem for 𝐘 ∗ to get 𝜸 ∗ ▶ Set 𝜸 ada = diag (𝐱 ∗ ) −1 𝜸 ∗ It can be shown (Zou, 2006), that if 𝜸 ∗ is a √𝑜 -consistent estimate, 𝜇 𝑜 /√𝑜 → 0 , and 𝜇 𝑜 𝑜 (𝛿−1)/2 → ∞ , then the adaptive
Penalisation in GLMs Penalisation can also be used in generalised linear models is assumed, this is equivalent to RSS minimisation. Note: If 𝑞(𝑧|𝜸, 𝐲) is Gaussian and the linear model 𝐳 = 𝐘𝜸 + 𝜻 −ℒ(𝜸|𝐳, 𝐘) + 𝜇‖𝜸‖ 1 𝜸 arg min is penalized , i.e. Instead of penalising the minimisation of the residual sum of log (𝑞(𝑧 𝑚 |𝜸, 𝐲 𝑚 )) 𝑚=1 ∑ 𝑜 ℒ(𝜸|𝐳, 𝐘) = Given 𝑞(𝑧|𝜸, 𝐲) the log-likelihood of the model is (GLMs), e.g. to perform sparse logistic regression . 9/23 squares (RSS), the minimisation of the negative log-likelihood
Sparse logistic regression 𝜸 nearest shrunken centroids) coordinate descent can be used to solve this problem. in 𝜸 . Iterative quadratic approximations combined with (𝑗 𝑚 𝐲 𝑈 𝑚=1 ∑ 𝑜 − arg min and the penalised minimisation problem becomes 1 + exp (𝐲 𝑈 𝜸) 1 𝑞(0|𝜸, 𝐲) = and 1 + exp (𝐲 𝑈 𝜸) exp (𝐲 𝑈 𝜸) 𝑞(1|𝜸, 𝐲) = 10/23 Recall: For logistic regression with 𝑗 𝑚 ∈ {0, 1} it holds that 𝑚 𝜸 − log (1 + exp (𝐲 𝑈 𝜸))) + 𝜇‖𝜸‖ 1 ▶ The minimisation problem is still convex, but non-linear ▶ Another way to perform sparse classification (like e.g.
Sparse multi-class logistic regression and 𝑘 = 1, … , 𝑞 . can be penalised. 𝑘 ) 𝐿−1 ∑ 1 𝑞(𝐿|𝐂, 𝐲) = 𝑘 ) 𝐿−1 ∑ exp (𝐲 𝑈 𝜸 𝑗 ) 𝑞(𝑗|𝐂, 𝐲) = 𝑗 = 1, … , 𝐿 − 1 that 11/23 In multi-class logistic regression with 𝑗 𝑚 ∈ {1, … , 𝐿} , there is a matrix of coefficients 𝐂 ∈ ℝ 𝑞×(𝐿−1) and it holds for 𝑘=1 exp (𝐲 𝑈 𝜸 𝑘=1 exp (𝐲 𝑈 𝜸 ▶ As in two-class case, the absolute value of each entry in 𝐂 ▶ Another possibility is to use the group lasso on all coefficients for one variable, i.e. penalise with ‖𝐂 𝑘⋅ ‖ 2 for
Example for sparse multi-class logistic regression MNIST-derived zip code digits ( 𝑜 = 7291 , 𝑞 = 256 ) coefficients. The numbers below are the class averages. Orange tiles show positive coefficients and blue tiles show negative 12/23 Sparse multi-class logistic regression was applied to the whole data set and the penalisation parameter was selected by 10-fold CV. 0 1 2 3 4 5 6 7 8 9
The lasso and significance testing small changes in the data can change a coefficient from exact some recent approaches variables use the other to perform least squares on the selected into two subsets, perform variable selection on one part and Meinshausen et al., 2009) split the data once or multiple times zero to non-zero distribution ) approximately Gaussian distribution ( spike-and-slab point mass at zero, but otherwise coefficients have an sparse coefficient vectors is tricky 13/23 ▶ Calculating p-values or performing significance testing on ▶ Probabilistic point-of-view: Sparsity and exact zeros create a ▶ In practice: Naive bootstrap is not going to work well since ▶ Some other approaches (Wasserman and Roeder, 2009; ▶ Still an active research topic, but the R package hdi contains
Data representations
Goals of data representation Dimension reduction while retaining important aspects of the data Goals can be to the posed question 14/23 ▶ Visualisation ▶ Interpretability/Variable selection ▶ Data compression ▶ Finding a representation of the data that is more suitable Let us start with linear dimension reduction .
Short re-cap: SVD The singular value decomposition (SVD) of a matrix 𝐘 ∈ ℝ 𝑜×𝑞 , 𝑜 ≥ 𝑞 , is 𝐘 = 𝐕𝐄𝐖 𝑈 𝐕 𝑈 𝐕 = 𝐉 𝑞 and holds for the diagonal elements of 𝐄 . 15/23 where 𝐕 ∈ ℝ 𝑜×𝑞 and 𝐖 ∈ ℝ 𝑞×𝑞 with 𝐖 𝑈 𝐖 = 𝐖𝐖 𝑈 = 𝐉 𝑞 and 𝐄 ∈ ℝ 𝑞×𝑞 is diagonal. Usually 𝑒 11 ≥ 𝑒 22 ≥ ⋯ ≥ 𝑒 𝑞𝑞
SVD and best rank- 𝑟 -approximation (I) ‖ ‖ ‖ ‖ 𝑞 ∑ 𝑘=𝑟+1 𝑒 𝑘𝑘 𝐯 𝑘 𝐰 𝑈 𝑘 ‖ 2 ‖ ‖ 2 𝐺 = 𝑞 ∑ 𝑘=𝑟+1 𝑒 2 𝑘 ‖ ‖ Write 𝐯 Best rank- 𝑟 -approximation: For 𝑟 < 𝑞 𝑞 ∑ 𝑘=1 𝑒 𝑘𝑘 𝐯 𝑘 𝐰 𝑈 𝑘 ⏟ ‖𝐘 − 𝐘 𝑟 ‖ rank-1-matrix 𝑟 ∑ 𝑘=1 𝑒 𝑘𝑘 𝐯 𝑘 𝐰 𝑈 𝑘 ‖ 16/23 𝑘 and 𝐰 𝑘 for the columns of 𝐕 and 𝐖 , respectively. Then 𝐘 = 𝐕𝐄𝐖 𝑈 = 𝐘 𝑟 = with approximation error 𝐺 =
Recommend
More recommend