Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics – Spring 2012
Overview Review: Conditional Probability LDA / QDA: Theory Fisher’s Discriminant Analysis LDA: Example Quality control: Testset and Crossvalidation Case study: Text recognition 1
Conditional Probability Sample space T: Med. Test positive T (Marginal) Probability: P(T), P(C) C: Patient has cancer C New sample space: New sample space: People with pos. test Conditional Probability: People with cancer P(T|C), P(C|T) P(C|T) P(T|C) small large Bayes Theorem: P ( C j T ) = P ( T j C ) P ( C ) posterior P ( T ) prior Class conditional probability 2
One approach to supervised learning P ( C j X ) = P ( C ) P ( X j C ) » P ( C ) P ( X j C ) P ( X ) Prior / prevalence: Assume: Find some estimate Fraction of samples X j C » N ( ¹ c ; § c ) in that class Bayes rule: Choose class where P(C|X) is maximal (rule is “optimal” if all types of error are equally costly) Special case: Two classes (0/1) - choose c=1 if P(C=1|X) > 0.5 or - choose c=1 if posterior odds P(C=1|X)/P(C=0|X) > 1 In Practice: Estimate 𝑄 𝐷 , 𝜈 𝐷 , Σ 𝐷 3
¡ ¢ QDA: Doing the math… 2 ( x ¡ ¹ c ) T § ¡ 1 1 ¡ 1 p (2 ¼ ) d j § C j exp C ( x ¡ ¹ c ) 𝑄 𝐷 𝑌 ~ 𝑄 𝐷 𝑄(𝑌|𝐷) Use the fact: max 𝑄 𝐷 𝑌 max(log 𝑄 𝐷 𝑌 ) 𝜀 𝑑 𝑦 = log 𝑄 𝐷 𝑌 = log 𝑄 𝐷 + log 𝑄 𝑌 𝐷 = −1 𝑦 − 𝜈 𝐷 + 𝑑 1 1 2 𝑦 − 𝜈 𝐷 𝑈 Σ 𝐷 = log 𝑄 𝐷 − 2 log Σ 𝐷 − Prior Additional Sq. Mahalanobis distance term Choose class where 𝜀 𝑑 𝑦 is maximal Special case: Two classes Decision boundary: Values of x where 𝜀 0 𝑦 = 𝜀 1 (𝑦) is quadratic in x Quadratic Discriminant Analysis (QDA) 4
Simplification Assume same covariance matrix in all classes, i.e. 𝑌|𝐷 ~ 𝑂(𝜈 𝑑 , Σ) Fix for all classes 2 𝑦 − 𝜈 𝐷 𝑈 Σ −1 𝑦 − 𝜈 𝐷 + 𝑑 = 1 1 𝜀 𝑑 𝑦 = log 𝑄 𝐷 − 2 log Σ − 2 𝑦 − 𝜈 𝐷 𝑈 Σ −1 𝑦 − 𝜈 𝐷 + 𝑒 = Prior 1 Sq. Mahalanobis distance = log 𝑄 𝐷 − 1 + 𝑦 𝑈 Σ −1 𝜈 𝐷 − 𝑈 Σ −1 𝜈 𝐷 ) (= log 𝑄 𝐷 2 𝜈 𝐷 Decision boundary is linear in x Linear Discriminant Analysis (LDA) 1 Classify to which class (assume equal prior)? • Physical distance in space is equal 0 • Classify to class 0, since Mahal. Dist. is smaller 5
LDA vs. QDA + Only few parameters to - Many parameters to estimate; less accurate estimate; accurate estimates + More flexible - Inflexible (quadratic decision boundary) (linear decision boundary) 6
Fisher’s Discriminant Analysis: Idea Find direction(s) in which groups are separated best • Class Y, predictors 𝑌 = 𝑌 1 , … , 𝑌 𝑒 1. Principal Component 𝑉 = 𝑥 𝑈 𝑌 1. Linear Discriminant • Find w so that groups are separated = along U best 1. Canonical Variable • Measure of separation: Rayleigh coefficient 𝐾 𝑥 = 𝐸(𝑉) 𝑊𝑏𝑠(𝑉) 2 where 𝐸 𝑉 = 𝐹 𝑉 𝑍 = 0 − 𝐹 𝑉 𝑍 = 1 • 𝐹 𝑌 𝑍 = 𝑘 = 𝜈 𝑘 , 𝑊𝑏𝑠 𝑌 𝑍 = 𝑘 = Σ 𝐹 𝑉 𝑍 = 𝑘 = 𝑥 𝑈 𝜈 𝑘 , 𝑊 𝑉 = 𝑥 𝑈 Σw • Concept extendable to many groups D(U) D(U) 𝐾 𝑥 small 𝐾 𝑥 large Var(U) Var(U) 7
LDA and Linear Discriminants - Direction with largest J(w): 1. Linear Discriminant (LD 1) - orthogonal to LD1, again largest J(w): LD 2 - etc. At most: min(Nmb. dimensions, Nmb. Groups -1) LD’s e.g.: 3 groups in 10 dimensions – need 2 LD’s Computed using Eigenvalue Decomposition or Singular Value Decomposition Proportion of trace: Captured % of variance between group means for each LD R: Function «lda» in package MASS does LDA and computes linear discriminants (also «qda» available) 8
Example: Classification of Iris flowers Iris setosa Iris versicolor Classify according to sepal/petal length/width Iris virginica 9
Quality of classification Use training data also as test data: Overfitting Too optimistic for error on new data Separate test data Test Training Cross validation (CV; e.g. “leave -one-out cross validation): Every row is the test case once, the rest in the training data 10
Measures for prediction error Confusion matrix (e.g. 100 samples) Truth = 0 Truth = 1 Truth = 2 Estimate = 0 23 7 6 Estimate = 1 3 27 4 Estimate = 2 3 1 26 Error rate: 1 – sum(diagonal entries) / (number of samples) = = 1 – 76/100 = 0.24 We expect that our classifier predicts 24% of new observations incorrectly (this is just a rough estimate) 11
Example: Digit recognition 7129 hand-written digits Sample of digits Each (centered) digit was put in a 16*16 grid Measure grey value in each part of the grid, i.e. 256 grey values Example with 8*8 grid 12
Concepts to know Idea of LDA / QDA Meaning of Linear Discriminants Cross Validation Confusion matrix, error rate 13
R functions to know lda 14
Recommend
More recommend