Supervised Learning: Linear Methods (1/2) Applied Multivariate - PowerPoint PPT Presentation

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics – Spring 2012

Overview  Review: Conditional Probability  LDA / QDA: Theory  Fisher’s Discriminant Analysis  LDA: Example  Quality control: Testset and Crossvalidation  Case study: Text recognition 1

Conditional Probability Sample space T: Med. Test positive T (Marginal) Probability: P(T), P(C) C: Patient has cancer C New sample space: New sample space: People with pos. test Conditional Probability: People with cancer P(T|C), P(C|T) P(C|T) P(T|C) small large Bayes Theorem: P ( C j T ) = P ( T j C ) P ( C ) posterior P ( T ) prior Class conditional probability 2

One approach to supervised learning P ( C j X ) = P ( C ) P ( X j C ) » P ( C ) P ( X j C ) P ( X ) Prior / prevalence: Assume: Find some estimate Fraction of samples X j C » N ( ¹ c ; § c ) in that class Bayes rule: Choose class where P(C|X) is maximal (rule is “optimal” if all types of error are equally costly) Special case: Two classes (0/1) - choose c=1 if P(C=1|X) > 0.5 or - choose c=1 if posterior odds P(C=1|X)/P(C=0|X) > 1 In Practice: Estimate 𝑄 𝐷 , 𝜈 𝐷 , Σ 𝐷 3

¡ ¢ QDA: Doing the math… 2 ( x ¡ ¹ c ) T § ¡ 1 1 ¡ 1 p (2 ¼ ) d j § C j exp C ( x ¡ ¹ c ) 𝑄 𝐷 𝑌 ~ 𝑄 𝐷 𝑄(𝑌|𝐷)  Use the fact: max 𝑄 𝐷 𝑌 max(log 𝑄 𝐷 𝑌 )  𝜀 𝑑 𝑦 = log 𝑄 𝐷 𝑌 = log 𝑄 𝐷 + log 𝑄 𝑌 𝐷 =  −1 𝑦 − 𝜈 𝐷 + 𝑑 1 1 2 𝑦 − 𝜈 𝐷 𝑈 Σ 𝐷 = log 𝑄 𝐷 − 2 log Σ 𝐷 − Prior Additional Sq. Mahalanobis distance term Choose class where 𝜀 𝑑 𝑦 is maximal   Special case: Two classes Decision boundary: Values of x where 𝜀 0 𝑦 = 𝜀 1 (𝑦) is quadratic in x  Quadratic Discriminant Analysis (QDA) 4

Simplification  Assume same covariance matrix in all classes, i.e. 𝑌|𝐷 ~ 𝑂(𝜈 𝑑 , Σ) Fix for all classes 2 𝑦 − 𝜈 𝐷 𝑈 Σ −1 𝑦 − 𝜈 𝐷 + 𝑑 = 1 1 𝜀 𝑑 𝑦 = log 𝑄 𝐷 − 2 log Σ −  2 𝑦 − 𝜈 𝐷 𝑈 Σ −1 𝑦 − 𝜈 𝐷 + 𝑒 = Prior 1 Sq. Mahalanobis distance = log 𝑄 𝐷 − 1 + 𝑦 𝑈 Σ −1 𝜈 𝐷 − 𝑈 Σ −1 𝜈 𝐷 ) (= log 𝑄 𝐷 2 𝜈 𝐷 Decision boundary is linear in x  Linear Discriminant Analysis (LDA) 1 Classify to which class (assume equal prior)? • Physical distance in space is equal 0 • Classify to class 0, since Mahal. Dist. is smaller 5

LDA vs. QDA + Only few parameters to - Many parameters to estimate; less accurate estimate; accurate estimates + More flexible - Inflexible (quadratic decision boundary) (linear decision boundary) 6

Fisher’s Discriminant Analysis: Idea Find direction(s) in which groups are separated best • Class Y, predictors 𝑌 = 𝑌 1 , … , 𝑌 𝑒 1. Principal Component 𝑉 = 𝑥 𝑈 𝑌 1. Linear Discriminant • Find w so that groups are separated = along U best 1. Canonical Variable • Measure of separation: Rayleigh coefficient 𝐾 𝑥 = 𝐸(𝑉) 𝑊𝑏𝑠(𝑉) 2 where 𝐸 𝑉 = 𝐹 𝑉 𝑍 = 0 − 𝐹 𝑉 𝑍 = 1 • 𝐹 𝑌 𝑍 = 𝑘 = 𝜈 𝑘 , 𝑊𝑏𝑠 𝑌 𝑍 = 𝑘 = Σ 𝐹 𝑉 𝑍 = 𝑘 = 𝑥 𝑈 𝜈 𝑘 , 𝑊 𝑉 = 𝑥 𝑈 Σw • Concept extendable to many groups D(U) D(U) 𝐾 𝑥 small 𝐾 𝑥 large Var(U) Var(U) 7

LDA and Linear Discriminants  - Direction with largest J(w): 1. Linear Discriminant (LD 1) - orthogonal to LD1, again largest J(w): LD 2 - etc.  At most: min(Nmb. dimensions, Nmb. Groups -1) LD’s e.g.: 3 groups in 10 dimensions – need 2 LD’s  Computed using Eigenvalue Decomposition or Singular Value Decomposition Proportion of trace: Captured % of variance between group means for each LD  R: Function «lda» in package MASS does LDA and computes linear discriminants (also «qda» available) 8

Example: Classification of Iris flowers Iris setosa Iris versicolor Classify according to sepal/petal length/width Iris virginica 9

Quality of classification  Use training data also as test data: Overfitting Too optimistic for error on new data  Separate test data Test Training  Cross validation (CV; e.g. “leave -one-out cross validation): Every row is the test case once, the rest in the training data 10

Measures for prediction error  Confusion matrix (e.g. 100 samples) Truth = 0 Truth = 1 Truth = 2 Estimate = 0 23 7 6 Estimate = 1 3 27 4 Estimate = 2 3 1 26  Error rate: 1 – sum(diagonal entries) / (number of samples) = = 1 – 76/100 = 0.24  We expect that our classifier predicts 24% of new observations incorrectly (this is just a rough estimate) 11

Example: Digit recognition  7129 hand-written digits Sample of digits  Each (centered) digit was put in a 16*16 grid  Measure grey value in each part of the grid, i.e. 256 grey values Example with 8*8 grid 12

Concepts to know  Idea of LDA / QDA  Meaning of Linear Discriminants  Cross Validation  Confusion matrix, error rate 13

R functions to know  lda 14

Supervised Learning: Linear Methods (1/2) Applied Multivariate - PowerPoint PPT Presentation

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012 Overview Review: Conditional Probability LDA / QDA: Theory Fishers Discriminant Analysis LDA: Example Quality control: Testset and

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Web Mining and Recommender Systems Supervised learning Regression Learning Goals Introduce

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Supervised Learning Prof. Kuan-Ting Lai 2020/4/9 Machine Learning Supervised Unsupervised

Linear regression Linear regression is a simple approach to supervised learning. It assumes

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL,

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Test Case Software

ADVANCED MACHINE LEARNING Caveats and Techniques to Deal with Imbalanced Datasets (Adapted from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

w o o o o o o o x o o o x o o o that represents how aligned the x x x x x x

CS480/680 Lecture 7: May 29, 2019 Classification with Mixture of Gaussians [B] Sections 4.2,

E9 205 Machine Learning for Signal Procesing Support Vector Machines 9-10-2019 Linear

Lecture 9: Logistic Regression Discriminative vs. Generative Classification Aykut Erdem

Kindergarten students to another location). Overflow does become an added cost to the district