Lecture 11: Data representations Felix Held, Mathematical Sciences - - PowerPoint PPT Presentation
Lecture 11: Data representations Felix Held, Mathematical Sciences - - PowerPoint PPT Presentation
Lecture 11: Data representations Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 3rd May 2019 Recap: Elnet and group lasso but one) or along an edge (fewer). corner or along an edge. In addition, the curved
Recap: Elnet and group lasso
▶ The lasso sets variables exactly to zero either on a corner (all
but one) or along an edge (fewer).
▶ The elastic net similarly sets variables exactly to zero on a
corner or along an edge. In addition, the curved edges encourage coefficients to be closer together.
▶ The group lasso has actual information about groups of
- variables. It encourages whole groups to be zero
- simultaneously. Within a group, it encourages the coefficients
to be as similar as possible.
1/23
The lasso and bias (I)
One problem with penalisation methods is the bias that is introduced by shrinkage. Remember: For orthogonal predictors, i.e. 𝐘𝑈𝐘 = 𝐉𝑞 we have ˆ 𝛾lasso,𝑘 = ST(ˆ 𝛾OLS,𝑘, 𝜇) The least squares estimates are unbiased (i.e. 𝔽[𝜸OLS] = 𝜸true) and therefore any non-linear transformation (like soft-thresholding) creates biased estimates. Shrinkage is good for variable/model selection but can decrease predictive performance. Ideal case: Oracle procedure (Fan and Li, 2001) Assume the true subset of non-zero coefficients is = {𝑘 ∶ 𝛾true = 0}
- 1. Identifies the right variables, i.e. {𝑘 ∶ ˆ
𝛾
𝑘 ≠ 0} =
- 2. Optimal estimation rate: √𝑜( ̂
𝜸 − 𝜸true)
𝑒
− → 𝑂(𝟏, 𝚻)
2/23
The lasso and bias (II)
▶ We saw that if the irrepresentable condition is fulfilled
(i.e. correlation between relevant and unrelevant predictors is small), then the lasso does a good job in uncovering the true subset of variables
▶ The lasso however introduces bias that will not vanish
- asymptotically. It therefore produces inconsistent
estimates (i.e. ̂ 𝜸lasso ↛ 𝜸true for 𝑜 → ∞)
▶ Solution: Different penalty function? An ideal penalty
would be
▶ singular at zero, leading to sparsity ▶ no penalty for large coefficients, leading to unbiased
estimates away from zero
▶ convex and differentiable
▶ The smoothly clipped absolute deviation (SCAD) penalty
combines all these apart from convexity
3/23
Smoothly clipped absolute deviation (SCAD)
The penalty is defined by its derivative 𝑞′
𝜇,𝑏(𝜄) = 𝜇 (1(𝜄 ≤ 𝜇) + (𝑏𝜇 − 𝜄)+
(𝑏 − 1)𝜇 1(𝜄 > 𝜇)) for 𝜄 > 0, 𝜇 ≥ 0, and 𝑏 > 2. This integrates to 𝑞𝜇,𝑏(𝜄) = ⎧ ⎪ ⎨ ⎪ ⎩ 𝜇𝜄 0 < 𝜄 ≤ 𝜇 −
𝜄2−2𝑏𝜇𝜄+𝜇2 2(𝑏−1)
𝜇 < 𝜄 ≤ 𝑏𝜇
(𝑏+1)𝜇2 2
𝜄 > 𝑏𝜇
5 10 15 20 −10 −5 5 10 β λ|β| Lasso 0.0 2.5 5.0 7.5 −10 −5 5 10 β p2,3.5(|β|) SCAD 4/23
SCAD and bias (I)
The SCAD penalty is applied to each coefficient of 𝜸 and (in case of orthonormal predictors) leads to the estimate ˆ 𝛾SCAD,𝑘 = ⎧ ⎪ ⎨ ⎪ ⎩ ST(ˆ 𝛾OLS,𝑘, 𝜇) |ˆ 𝛾OLS,𝑘| ≤ 2𝜇
(𝑏−1) ˆ 𝛾OLS,𝑘−sign( ˆ 𝛾OLS,𝑘)𝑏𝜇 𝑏−2
2𝜇 < |ˆ 𝛾OLS,𝑘| ≤ 𝑏𝜇 ˆ 𝛾OLS,𝑘 |ˆ 𝛾OLS,𝑘| > 𝑏𝜇
−10 −5 5 10 −10 −5 5 10 β
Ridge
−10 −5 5 10 −10 −5 5 10 β
Lasso
−10 −5 5 10 −10 −5 5 10 β
SCAD 5/23
SCAD and bias (II)
▶ Good news: The SCAD penalty gets rid of bias for larger
coefficients, but also creates sparsity
▶ It can be shown theoretically that if 𝜇𝑜 → 0 and
√𝑜𝜇𝑜 → ∞ when 𝑜 → ∞, then the SCAD penalty leads to an oracle procedure
▶ Bad news: The penalty is not convex and the standard
- ptimization approaches cannot be used. The authors of
the method (Fan and Li, 2001) proposed an algorithm based on local approximations.
▶ Is there a way to stay in the realm of convex functions?
6/23
Adaptive Lasso (I)
Consider the weighted lasso problem ̂ 𝜸ada = arg min
𝜸
1 2‖𝐳 − 𝐘𝜸‖2
2 + 𝜇 𝑞
∑
𝑘=1
𝑥
𝑘|𝛾 𝑘|
where 𝑥
𝑘 ≥ 0 for all 𝑘. This is called the adaptive lasso.
Surprisingly, it turns out that the right choice of weights can turn this procedure into an oracle procedure, i.e.
▶ large coefficients are unbiased ▶ the convergence rate is optimal ▶ the right variables are identified
7/23
Adaptive Lasso (II)
Let 𝜸∗ be a √𝑜-consistent estimate of 𝜸true, i.e. 𝜸∗ − 𝜸true converges (in probability) to 𝟏 at rate 𝑜−1/2, e.g. ̂ 𝜸OLS or ̂ 𝜸ridge. Computation:
▶ For 𝛿 > 0 set 𝑥∗
𝑘 = 1/|𝛾∗ 𝑘 |𝛿 and 𝐘∗ = diag(𝐱∗)−1𝐘.
▶ Solve the unweighted lasso problem for 𝐘∗ to get 𝜸∗
lasso.
▶ Set
̂ 𝜸ada = diag(𝐱∗)−1𝜸∗
lasso, which is the solution.
It can be shown (Zou, 2006), that if 𝜸∗ is a √𝑜-consistent estimate, 𝜇𝑜/√𝑜 → 0, and 𝜇𝑜𝑜(𝛿−1)/2 → ∞, then the adaptive lasso is an oracle procedure.
8/23
Penalisation in GLMs
Penalisation can also be used in generalised linear models (GLMs), e.g. to perform sparse logistic regression. Given 𝑞(𝑧|𝜸, 𝐲) the log-likelihood of the model is ℒ(𝜸|𝐳, 𝐘) =
𝑜
∑
𝑚=1
log(𝑞(𝑧𝑚|𝜸, 𝐲𝑚)) Instead of penalising the minimisation of the residual sum of squares (RSS), the minimisation of the negative log-likelihood is penalized, i.e. arg min
𝜸
−ℒ(𝜸|𝐳, 𝐘) + 𝜇‖𝜸‖1 Note: If 𝑞(𝑧|𝜸, 𝐲) is Gaussian and the linear model 𝐳 = 𝐘𝜸 + 𝜻 is assumed, this is equivalent to RSS minimisation.
9/23
Sparse logistic regression
Recall: For logistic regression with 𝑗𝑚 ∈ {0, 1} it holds that 𝑞(1|𝜸, 𝐲) = exp(𝐲𝑈𝜸) 1 + exp(𝐲𝑈𝜸) and 𝑞(0|𝜸, 𝐲) = 1 1 + exp(𝐲𝑈𝜸) and the penalised minimisation problem becomes arg min
𝜸
−
𝑜
∑
𝑚=1
(𝑗𝑚𝐲𝑈
𝑚 𝜸 − log (1 + exp(𝐲𝑈𝜸))) + 𝜇‖𝜸‖1
▶ The minimisation problem is still convex, but non-linear
in 𝜸. Iterative quadratic approximations combined with coordinate descent can be used to solve this problem.
▶ Another way to perform sparse classification (like e.g.
nearest shrunken centroids)
10/23
Sparse multi-class logistic regression
In multi-class logistic regression with 𝑗𝑚 ∈ {1, … , 𝐿}, there is a matrix of coefficients 𝐂 ∈ ℝ𝑞×(𝐿−1) and it holds for 𝑗 = 1, … , 𝐿 − 1 that 𝑞(𝑗|𝐂, 𝐲) = exp(𝐲𝑈𝜸𝑗) ∑
𝐿−1 𝑘=1 exp(𝐲𝑈𝜸 𝑘)
and 𝑞(𝐿|𝐂, 𝐲) = 1 ∑
𝐿−1 𝑘=1 exp(𝐲𝑈𝜸 𝑘)
▶ As in two-class case, the absolute value of each entry in 𝐂
can be penalised.
▶ Another possibility is to use the group lasso on all
coefficients for one variable, i.e. penalise with ‖𝐂𝑘⋅‖2 for 𝑘 = 1, … , 𝑞.
11/23
Example for sparse multi-class logistic regression
MNIST-derived zip code digits (𝑜 = 7291, 𝑞 = 256)
Sparse multi-class logistic regression was applied to the whole data set and the penalisation parameter was selected by 10-fold CV.
5 6 7 8 9 1 2 3 4
Orange tiles show positive coefficients and blue tiles show negative
- coefficients. The numbers below are the class averages.
12/23
The lasso and significance testing
▶ Calculating p-values or performing significance testing on
sparse coefficient vectors is tricky
▶ Probabilistic point-of-view: Sparsity and exact zeros create a
point mass at zero, but otherwise coefficients have an approximately Gaussian distribution (spike-and-slab distribution)
▶ In practice: Naive bootstrap is not going to work well since
small changes in the data can change a coefficient from exact zero to non-zero
▶ Some other approaches (Wasserman and Roeder, 2009;
Meinshausen et al., 2009) split the data once or multiple times into two subsets, perform variable selection on one part and use the other to perform least squares on the selected variables
▶ Still an active research topic, but the R package hdi contains
some recent approaches
13/23
Data representations
Goals of data representation
Dimension reduction while retaining important aspects of the data Goals can be
▶ Visualisation ▶ Interpretability/Variable selection ▶ Data compression ▶ Finding a representation of the data that is more suitable
to the posed question Let us start with linear dimension reduction.
14/23
Short re-cap: SVD
The singular value decomposition (SVD) of a matrix 𝐘 ∈ ℝ𝑜×𝑞, 𝑜 ≥ 𝑞, is 𝐘 = 𝐕𝐄𝐖𝑈 where 𝐕 ∈ ℝ𝑜×𝑞 and 𝐖 ∈ ℝ𝑞×𝑞 with 𝐕𝑈𝐕 = 𝐉𝑞 and 𝐖𝑈𝐖 = 𝐖𝐖𝑈 = 𝐉𝑞 and 𝐄 ∈ ℝ𝑞×𝑞 is diagonal. Usually 𝑒11 ≥ 𝑒22 ≥ ⋯ ≥ 𝑒𝑞𝑞 holds for the diagonal elements of 𝐄.
15/23
SVD and best rank-𝑟-approximation (I)
Write 𝐯
𝑘 and 𝐰 𝑘 for the columns of 𝐕 and 𝐖, respectively. Then
𝐘 = 𝐕𝐄𝐖𝑈 =
𝑞
∑
𝑘=1
𝑒𝑘𝑘 𝐯
𝑘𝐰𝑈 𝑘
⏟
rank-1-matrix
Best rank-𝑟-approximation: For 𝑟 < 𝑞 𝐘𝑟 =
𝑟
∑
𝑘=1
𝑒𝑘𝑘𝐯
𝑘𝐰𝑈 𝑘
with approximation error ‖ ‖𝐘 − 𝐘𝑟‖ ‖
2 𝐺 =
‖ ‖ ‖ ‖
𝑞
∑
𝑘=𝑟+1
𝑒𝑘𝑘𝐯
𝑘𝐰𝑈 𝑘
‖ ‖ ‖ ‖
2 𝐺
=
𝑞
∑
𝑘=𝑟+1
𝑒2
𝑘 16/23
SVD and best rank-𝑟-approximation (II)
Notes
▶ 𝐘𝑟 =
𝑟
∑
𝑘=1
𝑒𝑘𝑘𝐯
𝑘𝐰𝑈 𝑘 approximates 𝐘 as a sum of layers
▶ This is the best possible rank-𝑟-approximations ▶ How to choose 𝑟? Possibility: Look at singular values and
decide a cut-off
▶ Interpretation is difficult since layers both add and
subtract information
▶ 𝐕 and 𝐖 are not unique and could be made sparse
17/23
Alternative view of best rank-𝑟-approximation
The matrix 𝐘𝑟 is the unique solution to the following minimization problem (see notes on SVD on website) arg min
rank(𝐍)=𝑟
‖𝐘 − 𝐍‖2
𝐺
Alternative view: Assume 𝐘 stores samples as columns, i.e. 𝐘 ∈ ℝ𝑞×𝑜. Set 𝐈 ∶= 𝐄𝑟𝐕𝑈
𝑟 ∈ ℝ𝑟×𝑜 and 𝐗 = 𝐖 𝑟 ∈ ℝ𝑞×𝑟, where 𝐄𝑟, 𝐕𝑟,
and 𝐖
𝑟 contain only the first 𝑟 columns.
Then 𝐘𝑟 = 𝐗𝐈 is a solution of arg min
𝐗∈ℝ𝑞×𝑟,𝐈∈ℝ𝑟×𝑜 ‖𝐘 − 𝐗𝐈‖2 𝐺
Note: Whereas 𝐘𝑟 is the unique minimizer for the upper minimisation problem, the matrices 𝐗 and 𝐈 are not unique.
18/23
Low-rank matrix factorisation
Let 𝑟 < min(𝑞, 𝑜) arg min
𝐗∈ℝ𝑞×𝑟,𝐈∈ℝ𝑟×𝑜 ‖𝐘 − 𝐗𝐈‖2 𝐺
Interpretation
▶ The columns of 𝐗 can be seen as basis vectors or
coordinates of a subspace in feature space
▶ The columns of 𝐈 provide coefficients that combine the
basis vectors in 𝐗 to the closest 𝑟-dimensional approximation of the respective observation
▶ In the framework of factor analysis the columns of 𝐗 are
called factors and the columns of 𝐈 are called loadings
19/23
Notes on factor analysis
▶ Originated mostly in psychometrics with the idea that
factors could describe unobservable (latent) properties (e.g. intelligence)
▶ Typically assumes that 𝐗 is orthogonal ▶ Even orthogonality of 𝐗 does not ensure identifiability
since for a orthogonal matrix 𝐒 ∈ ℝ𝑟×𝑟 𝐗′𝐈′ ∶= (𝐗𝐒)(𝐒𝑈𝐈) = 𝐗𝐈 and 𝐗′ is orthogonal if 𝐗 is
▶ Every orthogonal matrix describes a rotation and when
applied to factors and loadings it is called a factor rotation
▶ Can be used to make either factors (varimax rotation) or
loadings (quartimax rotation) sparse
20/23
Non-negative Matrix Factorization (NMF)
Idea: We can add constraints to the low-rank matrix factorisation problem. Non-negative matrix factorisation (NMF): Let 𝑟 < min(𝑞, 𝑜) arg min
𝐗∈ℝ𝑞×𝑟,𝐈∈ℝ𝑟×𝑜 ‖𝐘 − 𝐗𝐈‖2 𝐺
such that 𝐗 ≥ 0, 𝐈 ≥ 0
▶ 𝐗 and 𝐈 are again not uniquely identifiable. ▶ No analytic solution (the numerical problem is actually
NP-hard in general)
▶ Choice of 𝑟 not as straight-forward as for SVD ▶ Not directly applicable to data matrices with negative
entries (can be solved through translation)
▶ So, why even bother?
21/23
Advantages of NMF
▶ Interpretability: As in the case of truncated SVD we are
adding layers, but now all layers are positive and each layer adds information
▶ Clustering interpretation:
▶ The columns of 𝐗 can be interpreted as cluster centroids ▶ Cluster membership of each observation is determined by
the columns of 𝐈
▶ Observation 𝑘 is assigned to the cluster 𝑙 such that
𝐼𝑙𝑘 > 𝐼𝑗𝑘 for all 𝑗 ≠ 𝑙
22/23
Take-home message
▶ Bias in lasso estimates can be bad for predictive strength.
Using modified penalties can help
▶ Linear dimension reduction helps to factorise matrices
into more interpretable components
▶ By adding non-negativity constraints to the matrix