Lecture 5: Classification and dimension reduction Felix Held, - - PowerPoint PPT Presentation
Lecture 5: Classification and dimension reduction Felix Held, - - PowerPoint PPT Presentation
Lecture 5: Classification and dimension reduction Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 4th April 2019 Random Forests 1. Given a training sample with features, do for = 1, , on
Random Forests
- 1. Given a training sample with π features, do for π = 1, β¦ , πΆ
1.1 Draw a bootstrap sample of size π from training data (with replacement) 1.2 Grow a tree ππ until each node reaches minimal node size πmin
1.2.1 Randomly select π variables from the π available 1.2.2 Find best splitting variable among these π 1.2.3 Split the node
- 2. For a new π² predict
Regression: Λ π
π π(π²) = 1 πΆ β πΆ π=1 ππ(π²)
Classification: Majority vote at π² across trees Note: Step 1.2.1 leads to less correlation between trees built
- n bootstrapped data.
1/21
Comparison of RF, Bagging and CART
Toy example π§ = π¦2
1 + π
where π βΌ π(0, 1) π² βΌ π(π, π»), π² β β5, π»ππ = 1, π»ππ = 0.98, π β π
Training and test data were sampled from the true model. Results for RF, bagged CART and a single CART, using π¦1, β¦ , π¦5 as predictor
- variables. (ππ’π = 50, ππ’π = 100)
- 1.5
1.8 2.1 100 200 300
Number of trees Test error
2/21
Variable importance
- 1. Impurity index: Splitting on a feature leads to a reduction
- f node impurity. Summing all improvements over all
trees per feature gives a measure for variable importance
- 2. Out-of-bag error
βΆ During bootstrapping for large enough π, each sample has
a chance of about 63% to be selected
βΆ For bagging the remaining samples are out-of-bag. βΆ These out-of-bag samples for tree ππ can be used as a test
set for that particular tree, since they were not used during training. Resulting in test error πΉ0
βΆ Permute variable π in the out-of-bag samples and
calculate test error again πΉ(π)
1
βΆ The increase in error
πΉ(π)
1
β πΉ0 β₯ 0 serves as an importance measure for variable π
3/21
RF applied to cardiovascular dataset
Monica dataset (http://thl.fi/monica, π = 6367, π = 11)
Predicting whether or not patients survive a 10 year period given a number of cardiovascular risk factors (class ratio 1.25 alive : 1 dead)
- 0.05
0.10 0.15 0.20 0.25 50 100 150 200
Number of Trees Outβofβbag error
Type
- Alive
Dead OOB
Error estimate
age angina diabetes hichol highbp hosp premi sex smstat stroke yronset 1 2 3
Decrease
Type
Alive Acc. Dead Acc. Mean Acc. Mean Gini
Variable importance 4/21
RF applied to heart disease dataset
South African coronary heart disease (SAheart) dataset
π = 462, π = 9, predicting cholesterol levels in variable ldl
- 4
5 6 100 200 300
Number of Trees Outβofβbag error (MSE)
adiposity age alcohol chd famhist
- besity
sbp tobacco typea 0.0 0.5 1.0 1.5 2.0 2.5
Decrease
Type
Mean accuracy Mean MSE
Variable importance
4 8 12 16 100 125 150 175 200
sbp y
4 8 12 16 10 20 30
tobacco y
4 8 12 16 10 20 30 40
adiposity y
4 8 12 16 20 40 60 80
typea y
4 8 12 16 20 30 40
- besity
y
4 8 12 16 50 100 150
alcohol y
4 8 12 16 20 30 40 50 60
age y
4 8 12 16 Absent Present
famhist y
4 8 12 16 CHD No CHD
chd y
5/21
Principal Component Analysis
Projection onto a subspace
Assume π² β βπ. Given orthonormal vectors π1, β¦ , ππ, i.e. βπ
πβ = 1
and ππ
π ππ = 0 for π β π
where π < π, the projection of π² onto the π-dimensional linear subspace π
π = span(π1, β¦ , ππ) is
Μ π² =
π
β
π=1
(π²ππ
π)π π = ( π
β
π=1
π
πππ π )
β β΅ β΅ β β΅ β΅ β
Projection matrix
π² The projection is orthogonal, i.e. (π² β Μ π²)ππ
π = 0
for all π
π.
- β
- 6/21
Rayleigh Quotient
Let π β βπΓπ be a symmetric matrix. For π β π² β βπ define πΎ(π²) = π²πππ² π²ππ² πΎ(π²) is called the Rayleigh Quotient for π. Maximizing the Rayleigh Quotient The maximization problem max
π²
πΎ(π²) subject to π²ππ² = 1 is solved by a unit eigenvector π² of π corresponding to the largest eigenvalue π of π. Note: βπ² is also a solution.
7/21
Principal Component Analysis (PCA) (I)
Goal: Given continuous data, find an orthogonal coordinate system such that the variance of the data is maximal along each direction. Given data points π²1, β¦ , π²π and a unit vector π¬, the variance of the data along π¬ is π(π¬) =
π
β
π=1
(π¬π(π²πβπ²))2 = (πβ1)π¬πΛ π»π¬ where Λ π» is the empirical covariance matrix.
- Axes
Cartesian Principal Component
8/21
Principal Component Analysis (PCA) (II)
Direction with maximal variance: Find π¬ such that max
π¬
π(π¬) subject to βπ¬β2 = π¬ππ¬ = 1
βΆ This is the same problem as maximizing the Rayleigh
Quotient for the matrix Λ π».
βΆ The solution is the eigenvector π¬1 of Λ
π» corresponding to the largest eigenvalue π1. How do we find the other directions? Project data on
- rthogonal complement of π¬1, i.e.
Μ π²π = (ππ β π¬1π¬π
1 ) π²π
and repeat the procedure above.
9/21
Principal Component Analysis (PCA) (III)
Computational Procedure:
- 1. Centre and standardize the columns of the data matrix
π β βπΓπ
- 2. Calculate the empirical covariance matrix Λ
π» = 1 π β 1πππ
- 3. Determine the eigenvalues π
π and corresponding
- rthonormal eigenvectors π¬
π of Λ
π» for π = 1, β¦ , π and order them such that π1 β₯ π2 β₯ β― β₯ ππ β₯ 0
- 4. The vectors π¬
π give the direction of the principal
components (PC) π¬π
π π² and the eigenvalues π π are the
variances along the PC directions Note: Set π = (π¬1, β¦ , π¬π) and π = diag(π1, β¦ , ππ) then Λ π» = ππππ and πππ = πππ = ππ
10/21
PCA and Dimension Reduction
Recall: For a matrix π β βπΓπ with eigenvalues π1, β¦ , ππ it holds that tr(π) =
π
β
π=1
π
π
For the empirical covariance matrix Λ π» and the variance of the π-th feature Var[π¦
π]
tr(Λ π») =
π
β
π=1
Var[π¦
π] = π
β
π=1
π
π
is called the total variation. Using only the first π < π principal components leads to π1 + β― + ππ π1 + β― + ππ β 100%
- f explained variance
11/21
PCA and Dimension Reduction: Example (I)
Variant of the MNIST handwritten digits dataset (π = 7291, 16 Γ 16 greyscale images, i.e. π = 256)
Digit Frequency 0.16 1 0.14 2 0.10 3 0.09 4 0.09 5 0.08 6 0.09 7 0.09 8 0.07 9 0.09
7 3 6 6 5 4
12/21
PCA and Dimension Reduction: Example (II)
For standardized variables tr(Λ π») = π Typical selection rule: Components with π
π β₯ 1
π tr(Λ π») (= 1)
- 0.1
1.0 10.0 100 200
Principal Component Eigenvalue
Scree plot
3 4 1 2
13/21
PCA and Dimension Reduction: Example (III)
Using the selection rule leads to 44
- components. Using the projection
Μ π² = (
44
β
π=1
π¬
ππ¬π π ) π²
creates a reconstruction of π².
4 7 6 5
14/21
PCA and Dimension Reduction: Example (IV)
Projecting the digits onto the first two principal component directions gives a very clear distinction of digits 0 and 1.
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β10
β5 5 β10 10 20
r1
Tx = PC1
r2
Tx = PC2
Digit
- 1
Running QDA naively on all 256 variables to predict the digits does not work. Use the two most variable features across both classes. Table 1: Missclassifaction rate (20-fold CV)
1 Overall QDA + PCA 0.000 0.010 0.005 LDA + PCA 0.044 0.000 0.024 LDA + max var 0.007 0.024 0.015 QDA + max var 0.015 0.028 0.021 15/21
Singular Value Decomposition
Singular Value Decomposition (SVD)
The singular value decomposition (SVD) of a matrix π β βπΓπ, π β₯ π, is π = ππππ where π β βπΓπ and π β βπΓπ with πππ = ππ and πππ = πππ = ππ and π β βπΓπ is diagonal. Usually π11 β₯ π22 β₯ β― β₯ πππ Note: Due to the orthogonality conditions for π and π ππππ = ππ2 ππππ = ππ2
16/21
SVD and PCA
In PCA the empirical covariance matrix Λ π» is in focus, whereas SVD focuses on the data matrix π directly. Connection: For centred variables Λ π» = πππ π β 1 = ππππππππ π β 1 = π ( π2 π β 1) ππ The PC directions are in π and the eigenvalues of Λ π» are π2
ππ/(π β 1).
Note: This is how PCA is typically calculated. SVD is a more general tool and is used in many other contexts as well.
17/21
SVD and best rank-π-approximation / dimension reduction
Write π―
π and π° π for the columns of π and π, respectively. Then
π = ππππ =
π
β
π=1
πππ π―
ππ°π π
β
rank-1-matrix
Best rank-π-approximation: For π < π ππ =
π
β
π=1
ππππ―
ππ°π π
with approximation error β βπ β ππβ β
2 2 =
β β β β
π
β
π=π+1
ππππ―
ππ°π π
β β β β
2 2
=
π
β
π=π+1
π2
π 18/21
Connections to Discriminant Analysis
Discriminant Analysis and the Inverse Covariance Matrix
From PCA or SVD we get Λ π» = ππππ where πππ = πππ = ππ and π11 β₯ β― β₯ πππ β₯ 0. Then Λ π»β1 = ππβ1ππ = ππβ1/2πβ1/2ππ = (Λ π»β1/2)
π Λ
π»β1/2 where (πβ1/2)ππ βΆ= 1/βπππ and Λ π»β1/2 βΆ= πβ1/2ππ. In DA the term involving the inverse covariance matrix is then (π² β Λ π)πΛ π»β1(π² β Λ π) = (π² β Λ π)π (Λ π»β1/2)
π Λ
π»β1/2(π² β Λ π) = (ππ(π² β Λ π))
π πβ1 (ππ(π² β Λ
π)) = β
π=1
1 πππ ( Μ π¦
π β Μ
π
π)2
Inverse of the eigenvalues can lead to numerical instability!
19/21
Regularized Discriminant Analysis (RDA)
The empirical covariance matrix can be stabilized: Λ π»π βΆ= Λ π» + πππ = π(π + πππ)ππ where π > 0 is a tuning parameter.
βΆ Using Λ
π»π in LDA is called regularized discriminant analysis (RDA).
βΆ Instead of 1/πππ the values 1/(πππ + π) are now involved. βΆ For small πππ this can lead to numerical stability, whereas
large πππ are not much affected.
βΆ For large π the πππ will have diminishing impact and RDA
starts to become nearest centroids.
βΆ RDA can be used with QDA as well by considering:
Λ π»π,π βΆ= Λ π»π β
QDA
+π Λ π» β
LDA 20/21
Take-home message
βΆ Random forests is very flexible and can determine
variable importance
βΆ Principal component analysis gives a convenient
decomposition of the data with respect to variance
βΆ Singular value decomposition is a universal workhorse for