supervised learning linear methods 1 2
play

Supervised Learning: Linear Methods (1/2) Applied Multivariate - PowerPoint PPT Presentation

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012 Overview Review: Conditional Probability LDA / QDA: Theory Fishers Discriminant Analysis LDA: Example Quality control: Testset and


  1. Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics – Spring 2012

  2. Overview  Review: Conditional Probability  LDA / QDA: Theory  Fisher’s Discriminant Analysis  LDA: Example  Quality control: Testset and Crossvalidation  Case study: Text recognition 1

  3. Conditional Probability Sample space T: Med. Test positive T (Marginal) Probability: P(T), P(C) C: Patient has cancer C New sample space: New sample space: People with pos. test Conditional Probability: People with cancer P(T|C), P(C|T) P(C|T) P(T|C) small large Bayes Theorem: P ( C j T ) = P ( T j C ) P ( C ) posterior P ( T ) prior Class conditional probability 2

  4. One approach to supervised learning P ( C j X ) = P ( C ) P ( X j C ) » P ( C ) P ( X j C ) P ( X ) Prior / prevalence: Assume: Find some estimate Fraction of samples X j C » N ( ¹ c ; § c ) in that class Bayes rule: Choose class where P(C|X) is maximal (rule is “optimal” if all types of error are equally costly) Special case: Two classes (0/1) - choose c=1 if P(C=1|X) > 0.5 or - choose c=1 if posterior odds P(C=1|X)/P(C=0|X) > 1 In Practice: Estimate 𝑄 𝐷 , 𝜈 𝐷 , Σ 𝐷 3

  5. ¡ ¢ QDA: Doing the math… 2 ( x ¡ ¹ c ) T § ¡ 1 1 ¡ 1 p (2 ¼ ) d j § C j exp C ( x ¡ ¹ c ) 𝑄 𝐷 𝑌 ~ 𝑄 𝐷 𝑄(𝑌|𝐷)  Use the fact: max 𝑄 𝐷 𝑌 max(log 𝑄 𝐷 𝑌 )  𝜀 𝑑 𝑦 = log 𝑄 𝐷 𝑌 = log 𝑄 𝐷 + log 𝑄 𝑌 𝐷 =  −1 𝑦 − 𝜈 𝐷 + 𝑑 1 1 2 𝑦 − 𝜈 𝐷 𝑈 Σ 𝐷 = log 𝑄 𝐷 − 2 log Σ 𝐷 − Prior Additional Sq. Mahalanobis distance term Choose class where 𝜀 𝑑 𝑦 is maximal   Special case: Two classes Decision boundary: Values of x where 𝜀 0 𝑦 = 𝜀 1 (𝑦) is quadratic in x  Quadratic Discriminant Analysis (QDA) 4

  6. Simplification  Assume same covariance matrix in all classes, i.e. 𝑌|𝐷 ~ 𝑂(𝜈 𝑑 , Σ) Fix for all classes 2 𝑦 − 𝜈 𝐷 𝑈 Σ −1 𝑦 − 𝜈 𝐷 + 𝑑 = 1 1 𝜀 𝑑 𝑦 = log 𝑄 𝐷 − 2 log Σ −  2 𝑦 − 𝜈 𝐷 𝑈 Σ −1 𝑦 − 𝜈 𝐷 + 𝑒 = Prior 1 Sq. Mahalanobis distance = log 𝑄 𝐷 − 1 + 𝑦 𝑈 Σ −1 𝜈 𝐷 − 𝑈 Σ −1 𝜈 𝐷 ) (= log 𝑄 𝐷 2 𝜈 𝐷 Decision boundary is linear in x  Linear Discriminant Analysis (LDA) 1 Classify to which class (assume equal prior)? • Physical distance in space is equal 0 • Classify to class 0, since Mahal. Dist. is smaller 5

  7. LDA vs. QDA + Only few parameters to - Many parameters to estimate; less accurate estimate; accurate estimates + More flexible - Inflexible (quadratic decision boundary) (linear decision boundary) 6

  8. Fisher’s Discriminant Analysis: Idea Find direction(s) in which groups are separated best • Class Y, predictors 𝑌 = 𝑌 1 , … , 𝑌 𝑒 1. Principal Component 𝑉 = 𝑥 𝑈 𝑌 1. Linear Discriminant • Find w so that groups are separated = along U best 1. Canonical Variable • Measure of separation: Rayleigh coefficient 𝐾 𝑥 = 𝐸(𝑉) 𝑊𝑏𝑠(𝑉) 2 where 𝐸 𝑉 = 𝐹 𝑉 𝑍 = 0 − 𝐹 𝑉 𝑍 = 1 • 𝐹 𝑌 𝑍 = 𝑘 = 𝜈 𝑘 , 𝑊𝑏𝑠 𝑌 𝑍 = 𝑘 = Σ 𝐹 𝑉 𝑍 = 𝑘 = 𝑥 𝑈 𝜈 𝑘 , 𝑊 𝑉 = 𝑥 𝑈 Σw • Concept extendable to many groups D(U) D(U) 𝐾 𝑥 small 𝐾 𝑥 large Var(U) Var(U) 7

  9. LDA and Linear Discriminants  - Direction with largest J(w): 1. Linear Discriminant (LD 1) - orthogonal to LD1, again largest J(w): LD 2 - etc.  At most: min(Nmb. dimensions, Nmb. Groups -1) LD’s e.g.: 3 groups in 10 dimensions – need 2 LD’s  Computed using Eigenvalue Decomposition or Singular Value Decomposition Proportion of trace: Captured % of variance between group means for each LD  R: Function «lda» in package MASS does LDA and computes linear discriminants (also «qda» available) 8

  10. Example: Classification of Iris flowers Iris setosa Iris versicolor Classify according to sepal/petal length/width Iris virginica 9

  11. Quality of classification  Use training data also as test data: Overfitting Too optimistic for error on new data  Separate test data Test Training  Cross validation (CV; e.g. “leave -one-out cross validation): Every row is the test case once, the rest in the training data 10

  12. Measures for prediction error  Confusion matrix (e.g. 100 samples) Truth = 0 Truth = 1 Truth = 2 Estimate = 0 23 7 6 Estimate = 1 3 27 4 Estimate = 2 3 1 26  Error rate: 1 – sum(diagonal entries) / (number of samples) = = 1 – 76/100 = 0.24  We expect that our classifier predicts 24% of new observations incorrectly (this is just a rough estimate) 11

  13. Example: Digit recognition  7129 hand-written digits Sample of digits  Each (centered) digit was put in a 16*16 grid  Measure grey value in each part of the grid, i.e. 256 grey values Example with 8*8 grid 12

  14. Concepts to know  Idea of LDA / QDA  Meaning of Linear Discriminants  Cross Validation  Confusion matrix, error rate 13

  15. R functions to know  lda 14

Recommend


More recommend