Lecture 5: Classification and dimension reduction Felix Held, - - PowerPoint PPT Presentation

β–Ά
lecture 5 classification and dimension reduction
SMART_READER_LITE
LIVE PREVIEW

Lecture 5: Classification and dimension reduction Felix Held, - - PowerPoint PPT Presentation

Lecture 5: Classification and dimension reduction Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 4th April 2019 Random Forests 1. Given a training sample with features, do for = 1, , on


slide-1
SLIDE 1

Lecture 5: Classification and dimension reduction

Felix Held, Mathematical Sciences

MSA220/MVE440 Statistical Learning for Big Data 4th April 2019

slide-2
SLIDE 2

Random Forests

  • 1. Given a training sample with π‘ž features, do for 𝑐 = 1, … , 𝐢

1.1 Draw a bootstrap sample of size π‘œ from training data (with replacement) 1.2 Grow a tree π‘ˆπ‘ until each node reaches minimal node size π‘œmin

1.2.1 Randomly select 𝑛 variables from the π‘ž available 1.2.2 Find best splitting variable among these 𝑛 1.2.3 Split the node

  • 2. For a new 𝐲 predict

Regression: Λ† 𝑔

𝑠𝑐(𝐲) = 1 𝐢 βˆ‘ 𝐢 𝑐=1 π‘ˆπ‘(𝐲)

Classification: Majority vote at 𝐲 across trees Note: Step 1.2.1 leads to less correlation between trees built

  • n bootstrapped data.

1/21

slide-3
SLIDE 3

Comparison of RF, Bagging and CART

Toy example 𝑧 = 𝑦2

1 + 𝜁

where 𝜁 ∼ 𝑂(0, 1) 𝐲 ∼ 𝑂(𝟏, 𝚻), 𝐲 ∈ ℝ5, πš»π‘šπ‘š = 1, πš»π‘šπ‘™ = 0.98, π‘š β‰  𝑙

Training and test data were sampled from the true model. Results for RF, bagged CART and a single CART, using 𝑦1, … , 𝑦5 as predictor

  • variables. (π‘œπ‘’π‘  = 50, π‘œπ‘’π‘“ = 100)
  • 1.5

1.8 2.1 100 200 300

Number of trees Test error

2/21

slide-4
SLIDE 4

Variable importance

  • 1. Impurity index: Splitting on a feature leads to a reduction
  • f node impurity. Summing all improvements over all

trees per feature gives a measure for variable importance

  • 2. Out-of-bag error

β–Ά During bootstrapping for large enough π‘œ, each sample has

a chance of about 63% to be selected

β–Ά For bagging the remaining samples are out-of-bag. β–Ά These out-of-bag samples for tree π‘ˆπ‘ can be used as a test

set for that particular tree, since they were not used during training. Resulting in test error 𝐹0

β–Ά Permute variable π‘˜ in the out-of-bag samples and

calculate test error again 𝐹(π‘˜)

1

β–Ά The increase in error

𝐹(π‘˜)

1

βˆ’ 𝐹0 β‰₯ 0 serves as an importance measure for variable π‘˜

3/21

slide-5
SLIDE 5

RF applied to cardiovascular dataset

Monica dataset (http://thl.fi/monica, π‘œ = 6367, π‘ž = 11)

Predicting whether or not patients survive a 10 year period given a number of cardiovascular risk factors (class ratio 1.25 alive : 1 dead)

  • 0.05

0.10 0.15 0.20 0.25 50 100 150 200

Number of Trees Outβˆ’ofβˆ’bag error

Type

  • Alive

Dead OOB

Error estimate

age angina diabetes hichol highbp hosp premi sex smstat stroke yronset 1 2 3

Decrease

Type

Alive Acc. Dead Acc. Mean Acc. Mean Gini

Variable importance 4/21

slide-6
SLIDE 6

RF applied to heart disease dataset

South African coronary heart disease (SAheart) dataset

π‘œ = 462, π‘ž = 9, predicting cholesterol levels in variable ldl

  • 4

5 6 100 200 300

Number of Trees Outβˆ’ofβˆ’bag error (MSE)

adiposity age alcohol chd famhist

  • besity

sbp tobacco typea 0.0 0.5 1.0 1.5 2.0 2.5

Decrease

Type

Mean accuracy Mean MSE

Variable importance

4 8 12 16 100 125 150 175 200

sbp y

4 8 12 16 10 20 30

tobacco y

4 8 12 16 10 20 30 40

adiposity y

4 8 12 16 20 40 60 80

typea y

4 8 12 16 20 30 40

  • besity

y

4 8 12 16 50 100 150

alcohol y

4 8 12 16 20 30 40 50 60

age y

4 8 12 16 Absent Present

famhist y

4 8 12 16 CHD No CHD

chd y

5/21

slide-7
SLIDE 7

Principal Component Analysis

slide-8
SLIDE 8

Projection onto a subspace

Assume 𝐲 ∈ β„π‘ž. Given orthonormal vectors 𝐜1, … , πœπ‘›, i.e. β€–πœ

π‘˜β€– = 1

and πœπ‘ˆ

π‘˜ πœπ‘™ = 0 for π‘˜ β‰  𝑙

where 𝑛 < π‘ž, the projection of 𝐲 onto the 𝑛-dimensional linear subspace π‘Š

𝑛 = span(𝐜1, … , πœπ‘›) is

Μ‚ 𝐲 =

𝑛

βˆ‘

π‘˜=1

(π²π‘ˆπœ

π‘˜)𝐜 π‘˜ = ( 𝑛

βˆ‘

π‘˜=1

𝐜

π‘˜πœπ‘ˆ π‘˜ )

⏟ ⎡ ⎡ ⏟ ⎡ ⎡ ⏟

Projection matrix

𝐲 The projection is orthogonal, i.e. (𝐲 βˆ’ Μ‚ 𝐲)π‘ˆπœ

π‘˜ = 0

for all 𝐜

π‘˜.

  • ●
  • 6/21
slide-9
SLIDE 9

Rayleigh Quotient

Let 𝐁 ∈ ℝ𝑙×𝑙 be a symmetric matrix. For 𝟏 β‰  𝐲 ∈ ℝ𝑙 define 𝐾(𝐲) = π²π‘ˆππ² π²π‘ˆπ² 𝐾(𝐲) is called the Rayleigh Quotient for 𝐁. Maximizing the Rayleigh Quotient The maximization problem max

𝐲

𝐾(𝐲) subject to π²π‘ˆπ² = 1 is solved by a unit eigenvector 𝐲 of 𝐁 corresponding to the largest eigenvalue πœ‡ of 𝐁. Note: βˆ’π² is also a solution.

7/21

slide-10
SLIDE 10

Principal Component Analysis (PCA) (I)

Goal: Given continuous data, find an orthogonal coordinate system such that the variance of the data is maximal along each direction. Given data points 𝐲1, … , π²π‘œ and a unit vector 𝐬, the variance of the data along 𝐬 is 𝑇(𝐬) =

π‘œ

βˆ‘

π‘š=1

(π¬π‘ˆ(π²π‘šβˆ’π²))2 = (π‘œβˆ’1)π¬π‘ˆΛ† 𝚻𝐬 where Λ† 𝚻 is the empirical covariance matrix.

  • Axes

Cartesian Principal Component

8/21

slide-11
SLIDE 11

Principal Component Analysis (PCA) (II)

Direction with maximal variance: Find 𝐬 such that max

𝐬

𝑇(𝐬) subject to ‖𝐬‖2 = π¬π‘ˆπ¬ = 1

β–Ά This is the same problem as maximizing the Rayleigh

Quotient for the matrix Λ† 𝚻.

β–Ά The solution is the eigenvector 𝐬1 of Λ†

𝚻 corresponding to the largest eigenvalue πœ‡1. How do we find the other directions? Project data on

  • rthogonal complement of 𝐬1, i.e.

Μ‚ π²π‘š = (π‰π‘ž βˆ’ 𝐬1π¬π‘ˆ

1 ) π²π‘š

and repeat the procedure above.

9/21

slide-12
SLIDE 12

Principal Component Analysis (PCA) (III)

Computational Procedure:

  • 1. Centre and standardize the columns of the data matrix

𝐘 ∈ β„π‘œΓ—π‘ž

  • 2. Calculate the empirical covariance matrix Λ†

𝚻 = 1 π‘œ βˆ’ 1π˜π‘ˆπ˜

  • 3. Determine the eigenvalues πœ‡

π‘˜ and corresponding

  • rthonormal eigenvectors 𝐬

π‘˜ of Λ†

𝚻 for π‘˜ = 1, … , π‘ž and order them such that πœ‡1 β‰₯ πœ‡2 β‰₯ β‹― β‰₯ πœ‡π‘ž β‰₯ 0

  • 4. The vectors 𝐬

π‘˜ give the direction of the principal

components (PC) π¬π‘ˆ

π‘˜ 𝐲 and the eigenvalues πœ‡ π‘˜ are the

variances along the PC directions Note: Set 𝐒 = (𝐬1, … , π¬π‘ž) and 𝐄 = diag(πœ‡1, … , πœ‡π‘ž) then Λ† 𝚻 = π’π„π’π‘ˆ and π’π‘ˆπ’ = π’π’π‘ˆ = π‰π‘ž

10/21

slide-13
SLIDE 13

PCA and Dimension Reduction

Recall: For a matrix 𝐁 ∈ ℝ𝑙×𝑙 with eigenvalues πœ‡1, … , πœ‡π‘™ it holds that tr(𝐁) =

𝑙

βˆ‘

π‘˜=1

πœ‡

π‘˜

For the empirical covariance matrix Λ† 𝚻 and the variance of the π‘˜-th feature Var[𝑦

π‘˜]

tr(Λ† 𝚻) =

π‘ž

βˆ‘

π‘˜=1

Var[𝑦

π‘˜] = π‘ž

βˆ‘

π‘˜=1

πœ‡

π‘˜

is called the total variation. Using only the first 𝑛 < π‘ž principal components leads to πœ‡1 + β‹― + πœ‡π‘› πœ‡1 + β‹― + πœ‡π‘ž β‹… 100%

  • f explained variance

11/21

slide-14
SLIDE 14

PCA and Dimension Reduction: Example (I)

Variant of the MNIST handwritten digits dataset (π‘œ = 7291, 16 Γ— 16 greyscale images, i.e. π‘ž = 256)

Digit Frequency 0.16 1 0.14 2 0.10 3 0.09 4 0.09 5 0.08 6 0.09 7 0.09 8 0.07 9 0.09

7 3 6 6 5 4

12/21

slide-15
SLIDE 15

PCA and Dimension Reduction: Example (II)

For standardized variables tr(Λ† 𝚻) = π‘ž Typical selection rule: Components with πœ‡

π‘˜ β‰₯ 1

π‘ž tr(Λ† 𝚻) (= 1)

  • 0.1

1.0 10.0 100 200

Principal Component Eigenvalue

Scree plot

3 4 1 2

13/21

slide-16
SLIDE 16

PCA and Dimension Reduction: Example (III)

Using the selection rule leads to 44

  • components. Using the projection

Μ‚ 𝐲 = (

44

βˆ‘

π‘˜=1

𝐬

π‘˜π¬π‘ˆ π‘˜ ) 𝐲

creates a reconstruction of 𝐲.

4 7 6 5

14/21

slide-17
SLIDE 17

PCA and Dimension Reduction: Example (IV)

Projecting the digits onto the first two principal component directions gives a very clear distinction of digits 0 and 1.

  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • βˆ’10

βˆ’5 5 βˆ’10 10 20

r1

Tx = PC1

r2

Tx = PC2

Digit

  • 1

Running QDA naively on all 256 variables to predict the digits does not work. Use the two most variable features across both classes. Table 1: Missclassifaction rate (20-fold CV)

1 Overall QDA + PCA 0.000 0.010 0.005 LDA + PCA 0.044 0.000 0.024 LDA + max var 0.007 0.024 0.015 QDA + max var 0.015 0.028 0.021 15/21

slide-18
SLIDE 18

Singular Value Decomposition

slide-19
SLIDE 19

Singular Value Decomposition (SVD)

The singular value decomposition (SVD) of a matrix 𝐘 ∈ β„π‘œΓ—π‘ž, π‘œ β‰₯ π‘ž, is 𝐘 = π•π„π–π‘ˆ where 𝐕 ∈ β„π‘œΓ—π‘ž and 𝐖 ∈ β„π‘žΓ—π‘ž with π•π‘ˆπ• = π‰π‘ž and π–π‘ˆπ– = π–π–π‘ˆ = π‰π‘ž and 𝐄 ∈ β„π‘žΓ—π‘ž is diagonal. Usually 𝑒11 β‰₯ 𝑒22 β‰₯ β‹― β‰₯ π‘’π‘žπ‘ž Note: Due to the orthogonality conditions for 𝐕 and 𝐖 π˜π˜π‘ˆπ• = 𝐕𝐄2 π˜π‘ˆπ˜π– = 𝐖𝐄2

16/21

slide-20
SLIDE 20

SVD and PCA

In PCA the empirical covariance matrix Λ† 𝚻 is in focus, whereas SVD focuses on the data matrix 𝐘 directly. Connection: For centred variables Λ† 𝚻 = π˜π‘ˆπ˜ π‘œ βˆ’ 1 = π–π„π•π‘ˆπ•π„π–π‘ˆ π‘œ βˆ’ 1 = 𝐖 ( 𝐄2 π‘œ βˆ’ 1) π–π‘ˆ The PC directions are in 𝐖 and the eigenvalues of Λ† 𝚻 are 𝑒2

π‘˜π‘˜/(π‘œ βˆ’ 1).

Note: This is how PCA is typically calculated. SVD is a more general tool and is used in many other contexts as well.

17/21

slide-21
SLIDE 21

SVD and best rank-π‘Ÿ-approximation / dimension reduction

Write 𝐯

π‘˜ and 𝐰 π‘˜ for the columns of 𝐕 and 𝐖, respectively. Then

𝐘 = π•π„π–π‘ˆ =

π‘ž

βˆ‘

π‘˜=1

π‘’π‘˜π‘˜ 𝐯

π‘˜π°π‘ˆ π‘˜

⏟

rank-1-matrix

Best rank-π‘Ÿ-approximation: For π‘Ÿ < π‘ž π˜π‘Ÿ =

π‘Ÿ

βˆ‘

π‘˜=1

π‘’π‘˜π‘˜π―

π‘˜π°π‘ˆ π‘˜

with approximation error β€– β€–π˜ βˆ’ π˜π‘Ÿβ€– β€–

2 2 =

β€– β€– β€– β€–

π‘ž

βˆ‘

π‘˜=π‘Ÿ+1

π‘’π‘˜π‘˜π―

π‘˜π°π‘ˆ π‘˜

β€– β€– β€– β€–

2 2

=

π‘ž

βˆ‘

π‘˜=π‘Ÿ+1

𝑒2

π‘˜ 18/21

slide-22
SLIDE 22

Connections to Discriminant Analysis

slide-23
SLIDE 23

Discriminant Analysis and the Inverse Covariance Matrix

From PCA or SVD we get Λ† 𝚻 = π–π„π–π‘ˆ where π–π‘ˆπ– = π–π–π‘ˆ = π‰π‘ž and 𝑒11 β‰₯ β‹― β‰₯ π‘’π‘žπ‘ž β‰₯ 0. Then Λ† πš»βˆ’1 = π–π„βˆ’1π–π‘ˆ = π–π„βˆ’1/2π„βˆ’1/2π–π‘ˆ = (Λ† πš»βˆ’1/2)

π‘ˆ Λ†

πš»βˆ’1/2 where (π„βˆ’1/2)π‘˜π‘˜ ∢= 1/βˆšπ‘’π‘˜π‘˜ and Λ† πš»βˆ’1/2 ∢= π„βˆ’1/2π–π‘ˆ. In DA the term involving the inverse covariance matrix is then (𝐲 βˆ’ Λ† 𝝂)π‘ˆΛ† πš»βˆ’1(𝐲 βˆ’ Λ† 𝝂) = (𝐲 βˆ’ Λ† 𝝂)π‘ˆ (Λ† πš»βˆ’1/2)

π‘ˆ Λ†

πš»βˆ’1/2(𝐲 βˆ’ Λ† 𝝂) = (π–π‘ˆ(𝐲 βˆ’ Λ† 𝝂))

π‘ˆ π„βˆ’1 (π–π‘ˆ(𝐲 βˆ’ Λ†

𝝂)) = βˆ‘

π‘˜=1

1 π‘’π‘˜π‘˜ ( Μƒ 𝑦

π‘˜ βˆ’ Μƒ

𝜈

π‘˜)2

Inverse of the eigenvalues can lead to numerical instability!

19/21

slide-24
SLIDE 24

Regularized Discriminant Analysis (RDA)

The empirical covariance matrix can be stabilized: Λ† πš»πœ‡ ∢= Λ† 𝚻 + πœ‡π‰π‘ž = 𝐖(𝐄 + πœ‡π‰π‘ž)π–π‘ˆ where πœ‡ > 0 is a tuning parameter.

β–Ά Using Λ†

πš»πœ‡ in LDA is called regularized discriminant analysis (RDA).

β–Ά Instead of 1/π‘’π‘˜π‘˜ the values 1/(π‘’π‘˜π‘˜ + πœ‡) are now involved. β–Ά For small π‘’π‘˜π‘˜ this can lead to numerical stability, whereas

large π‘’π‘˜π‘˜ are not much affected.

β–Ά For large πœ‡ the π‘’π‘˜π‘˜ will have diminishing impact and RDA

starts to become nearest centroids.

β–Ά RDA can be used with QDA as well by considering:

Λ† πš»π‘—,πœ‡ ∢= Λ† πš»π‘— ⏟

QDA

+πœ‡ Λ† 𝚻 ⏟

LDA 20/21

slide-25
SLIDE 25

Take-home message

β–Ά Random forests is very flexible and can determine

variable importance

β–Ά Principal component analysis gives a convenient

decomposition of the data with respect to variance

β–Ά Singular value decomposition is a universal workhorse for

dimension reduction

21/21