Linear Discriminant Functions Linear Discriminant Functions 5.8, - PDF document

10/2/2008 Linear Discriminant Functions Linear Discriminant Functions 5.8, 5.9, 5.11 Jacob Hays Amit Pillay James DeFelice Minimum Squared Error Minimum Squared Error � Previous methods only worked on linear separable cases, by looking at misclassified samples to correct error � MSE looks at all samples, using linear equations to find estimate 1

10/2/2008 Minimum Squared Error Minimum Squared Error � x space mapped to y space. � For all samples x i in dimension d, there exists a y i of dimension d^ � Find vector a making all a t y i > 0 � All samples y i in matrix Y , dim n x d^, � Ya = b (b is vector of positive constants)   b 1       y y ... y a 10 11 1 d 0       b � b is our margin 2     y y ... y a   20 21 2 d 1 =     ... for error   ... ... ... .       ...           y y ... y a n 0 n 1 nd d   b n Minimum Squared Error Minimum Squared Error � Y is rectangular (n x d^), so it does not have a direct inverse to solve Ya = b � Ya – b = e – gives error, minimize it n ∑ 2 t 2 J ( a ) = Ya − b = ( a y − b ) � Square error ||e|| 2 s i i i = 1 n ∑ � Take Gradient t t ∇ J = 2 ( a y − b ) y = 2 Y ( Ya − b ) s i i i i = 1 � Gradient should goto Zero t t Y Ya = Y b 2

10/2/2008 Minimum Squared Error Minimum Squared Error � Y t Ya = Y t b goes to a = (Y t Y) -1 Y t b � (Y t Y) -1 Y t is the psuedo-inverse of Y, dimension d^ x n, can be written as Y † YY † ≠ I � Y † Y = I � a = Y † b gives us a solution with b being a margin. Minimum Squared Error Minimum Squared Error 3

10/2/2008 Fisher’s Linear Discriminant Fisher’s Linear Discriminant � Based on projection of d-dimensional data onto a line. � Loses a lot of data, but some orientation of the line might give a good split y = w t x , ||w|| = 1 � y i is projection of x i onto line w � Goal: Find best w to separate them � Highly overlapping data performs poorly Fisher’s Linear Discriminant Fisher’s Linear Discriminant 1 ∑ � Mean of each class D i m = x i n i � w = m 1 – m 2 / || m 1 – m 2 || x ∈ D i 4

10/2/2008 Fisher’s Linear Discriminant Fisher’s Linear Discriminant = ∑ t S ( x − m ) ( x − m ) � Scatter Matrices i i i x ∈ D i � S W = S 1 + S 2 − 1 w = S ( m − m ) W 1 2 Fisher’s Relation to MSE Fisher’s Relation to MSE � MSE and Fisher equivalent for specific b ◦ n i = number of x ∈ D i ◦ 1 i is column vector of n i full of ones   n     w 1 1 X   1 0 1 1 n a =   Y =     b = 1   − 1 − X   n   2 2 w w w w 1   2   n � Plug into Y t Ya = Y t b 2   n 1           1 1 − 1 1 X w 1 − 1 n   1 1 1 1 0 1 1       =   1 t t t t  n          X − X − 1 − X X − X w w w w 1 1 2 2 2 1 2   2   n 1 − 1 w = α nS ( m − m ) W 1 2 5

10/2/2008 Relation to Optimal Discriminant Relation to Optimal Discriminant � If you set b = 1 n , MSE approaches the optimal Bayes discriminant g 0 as number of samples approaches infinity. (see 5.8.3) g ( x ) = P ( ω | x ) − P ( ω | x ) 0 2 2 g(x) is MSE estimation Widrow Widrow-Hoff / LMS Hoff / LMS � LMS – Least Mean Squared � Still solves when Y t Y is singular a , b , threshold θ , step η (.), k = 0 begin do k = (k + 1) mod n a = a + η (k)(b k – a t y k )y k until | η (k) )(b k – a t y k )y k | < θ return a end 6

10/2/2008 Widrow Widrow-Hoff / LMS Hoff / LMS � LMS not guaranteed to converge to a separating plane, even if one exists. Procedural differences Procedural differences � Perceptron, relaxation ◦ If samples linearly separable, we can find a solution ◦ Otherwise, we do not converge to a solution � MSE ◦ Always yields a weight vector ◦ May not be the best solution � Not guaranteed to be a separating vector 7

10/2/2008 Choosing b Choosing b � Arbitrary b, MSE minimizes ||Ya – b|| 2 � If linearly separable, we can more smartly choose b ◦ Define â and ß such that Yâ = ß > 0 ◦ Every component of ß is positive Modified MSE Modified MSE � J s (a,b) = ||Ya – b|| 2 � a, b allowed to vary � Subject to b > 0 � Min of J s is zero � a that achieves min J s is the separating vector 8

10/2/2008 Ho Ho-Kashyap Kashyap/Descent /Descent prodecure prodecure ( ) t ∇ J = 2 Y Ya − b a s ( ) ∇ J s = − 2 Ya − b b � For any b † a = Y b So, ∇ a J = 0 and we' re done? no... s – Must avoid b = 0 – Must avoid b < 0 Ho Ho-Kashyap Kashyap/Descent Procedure /Descent Procedure � Pick positive b � Don’t allow reduction of b’s components � Set all positive components of to zero ∇ a J s ◦ b(k+1) = b(k) - η c  ∇ J if ∇ J ≤ 0 b s b s  c =  0 otherwise 1 ( ) c = 2 ∇ J − ∇ J b s b s 9

10/2/2008 Ho Ho-Kashyap Kashyap/Descent Procedure /Descent Procedure ( ) ∇ J s = − 2 Ya − b b e = Ya − b 1 [ ] b = b − η ∇ J − ∇ J k + 1 k b s b s 2 1 ( ) + + b = b + 2 η e e = e − e k + 1 k k k k k k 2 † a = Y b k k Ho Ho-Kashyap Kashyap � Algorithm 11 ◦ Begin initialize a, b, η () < 1, threshold b min , k max � do k = k+1 mod n � e = Ya – b � e + = ½(e+abs(e)) � b = b + 2 η (k)e + � a = Y † b � if abs(e) <= b min then return a,b and exit � Until k = k max � Print “NO SOLUTION” ◦ End � When e(k) = 0 � we have solution � When e(k) <= 0 � samples not linearly separable 10

10/2/2008 Convergence (separable case) Convergence (separable case) � If 0 < η < 1, and linearly separable ◦ Solution vector exists ◦ We will find in finite k steps � Two possibilities ◦ e(k) = 0 for some finite k 0 ◦ No zero in e() � If e(k 0 ) ◦ a(k), b(k), e(k) stop changing ◦ Ya(k) = b(k) > 0 for all k > k 0 ◦ If we find k 0 , algorithm terminates with solution vector Convergence (separable) Convergence (separable) � e() never zero for finite k � If samples are linearly separable ◦ Ya = b, b > 0 � Because b is positive, either ◦ e(k) is zero, or ◦ e(k) is positive � Since e(k) cannot be zero (first bullet), it must be positive 11

10/2/2008 Convergence (separable) Convergence (separable) ◦ ¼(||e k || 2 -||e k+1 || 2 )= η (1- η )||e + k || 2 + η 2 e +t k YY † e + k � YY † is symmetric, positive semi-definite � 0 < η < 1 ◦ Therefore, ||e k || 2 > ||e k+1 || 2 if 0 < η < 1 � ||e|| will eventually converge to zero � a will eventually converge to solution vector Convergence (non Convergence (non-separable) separable) � If not linearly separable, may obtain a non- zero error vector without positive components � Still have � ¼(||e k || 2 -||e k+1 || 2 )= η (1- η )||e + k || 2 + η 2 e +t k YY † e + k � So limiting ||e|| cannot be zero � Will converge to a non-zero value � Convergence says that ◦ e + k = 0 for some finite k (separable) ◦ e + k will converge to zero while ||e|| is bounded away from zero (non-separable) 12

10/2/2008 Support Vector Machines Support Vector Machines (SVMs) (SVMs) SVMs SVMs � Representing data in higher dimensions space, SVM will construct a separating hyperplane in that space, one which maximizes margin between the two data sets. 13

10/2/2008 Application Application � Face detection, verification, and recognition � Object detection and recognition � Handwritten character and digit recognition � Text detection and categorization � Speech and speaker verification, recognition � Information and image retrieval Formalization Formalization � We are given some training data, a set of points of the form Equation of separating hyperplane: The vector w is a normal vector. The parameter b/||w|| determines the offset of the hyperplane from the origin along the normal vector 14

10/2/2008 Formalization cont… Formalization cont… � Defining two hyperplanes given by equations: H1: H2: � These hyperplanes are defined in such a way that no points lies between them � To prevent data points falling between these hyperplanes, following two constraints are defined: Formulation cont… Formulation cont… � This can be rewritten as: � So the formulation of the optimization problem is ◦ Choose w, b to minimize || w || subject to 15

10/2/2008 SVM Hyperplane Example SVM Hyperplane Example SVM Training SVM Training � Langrange Optimization problem � Reformulated Optimization Problem is given as: � Thus the new optimization problem is to minimize L P w.r.t w and b subject to: 16

Linear Discriminant Functions Linear Discriminant Functions 5.8, - PDF document

10/2/2008 Linear Discriminant Functions Linear Discriminant Functions 5.8, 5.9, 5.11 Jacob Hays Amit Pillay James DeFelice Minimum Squared Error Minimum Squared Error Previous methods only worked on linear separable cases, by looking at

SVM-flexible discriminant analysis Huimin Peng November 20, 2014 Outline SVM Nonlinear SVM =

Discriminant Analysis aka. Discriminant Function Analysis Discriminant Analysis (DISCRIM)

Discriminant Analysis In discriminant analysis, we try to find functions of the data that

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Linear

Linear Discrimination Discriminant-Based Classification 1 Linear Discrimination Linearly

Flexible Discriminant Analysis Using Motivation MGLMM Multivariate Mixed Models Discriminant

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Discriminant

Linear Models for Classification Greg Mori - CMPT 419/726 Bishop PRML Ch. 4 Discriminant

Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678

Semi-Supervised Local Fisher Semi-Supervised Local Fisher Discriminant Analysis Discriminant

Local Fisher Discriminant Local Fisher Discriminant Analysis for Supervised Analysis for

Selecting Variables in Two-Group Robust Linear Discriminant Analysis . . . . . Stefan Van

Linear Discriminant Analysis and Logistic Regression Matthieu R. Bloch 1 Linear Discriminant

Linear Discrimination Steven J Zeil Old Dominion Univ. Fall 2010 1 Discriminant-Based

The Many Flavors of Penalized Linear Discriminant Analysis Daniela M. Witten Assistant Professor

Linear functions A. Functions in general A. Functions in general 1. definition B. Linear

Youre going on sabbatical? Good on ya! Mike Hill, mrhill@ ucdavis.edu Professor, Mechanical

Overview of Robot Decision Making Prof. Yuke Zhu Fall 2020 CS391R: Robot Learning (Fall 2020) 1

Heterogeneous Multi-output Gaussian Process Prediction Pablo Moreno-Muoz Antonio

MapReduce February 13, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick

1tt1 P(dv. ot l,le lrr,/ of trrC F..ro-v.d frs.p/cr. ov Vgrto-{, Par-\"'df.d 9!

File input and output if-then-else Genome 559: Introduction to Statistical and Computational

Interpolation Dr. Mihail October 26, 2015 (Dr. Mihail) Interpolation October 26, 2015 1 / 11

xtoaxaca - Extending the Kitagawa-Oaxaca-Blinder Decomposition Approach to longitudinal data

Linear Discriminant Functions Linear Discriminant Functions 5.8, - PDF document

10/2/2008 Linear Discriminant Functions Linear Discriminant Functions 5.8, 5.9, 5.11 Jacob Hays Amit Pillay James DeFelice Minimum Squared Error Minimum Squared Error Previous methods only worked on linear separable cases, by looking at

SVM-flexible discriminant analysis Huimin Peng November 20, 2014 Outline SVM Nonlinear SVM =

Discriminant Analysis aka. Discriminant Function Analysis Discriminant Analysis (DISCRIM)

Discriminant Analysis In discriminant analysis, we try to find functions of the data that

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Linear

Linear Discrimination Discriminant-Based Classification 1 Linear Discrimination Linearly

Flexible Discriminant Analysis Using Motivation MGLMM Multivariate Mixed Models Discriminant

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Discriminant

Linear Models for Classification Greg Mori - CMPT 419/726 Bishop PRML Ch. 4 Discriminant

Dimensionality Reduction: Linear Discriminant Analysis and Principal Component Analysis CMSC 678

Semi-Supervised Local Fisher Semi-Supervised Local Fisher Discriminant Analysis Discriminant

Local Fisher Discriminant Local Fisher Discriminant Analysis for Supervised Analysis for

Selecting Variables in Two-Group Robust Linear Discriminant Analysis . . . . . Stefan Van

Linear Discriminant Analysis and Logistic Regression Matthieu R. Bloch 1 Linear Discriminant

Linear Discrimination Steven J Zeil Old Dominion Univ. Fall 2010 1 Discriminant-Based

The Many Flavors of Penalized Linear Discriminant Analysis Daniela M. Witten Assistant Professor

Linear functions A. Functions in general A. Functions in general 1. definition B. Linear

Youre going on sabbatical? Good on ya! Mike Hill, mrhill@ ucdavis.edu Professor, Mechanical

Overview of Robot Decision Making Prof. Yuke Zhu Fall 2020 CS391R: Robot Learning (Fall 2020) 1

Heterogeneous Multi-output Gaussian Process Prediction Pablo Moreno-Muoz Antonio

MapReduce February 13, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick

1tt1 P(dv. ot l,le lrr,/ of trrC F..ro-v.d frs*.p/cr. ov * Vgrto-{, Par-\&quot;'df.d 9!

File input and output if-then-else Genome 559: Introduction to Statistical and Computational

Interpolation Dr. Mihail October 26, 2015 (Dr. Mihail) Interpolation October 26, 2015 1 / 11

xtoaxaca - Extending the Kitagawa-Oaxaca-Blinder Decomposition Approach to longitudinal data

1tt1 P(dv. ot l,le lrr,/ of trrC F..ro-v.d frs.p/cr. ov Vgrto-{, Par-\"'df.d 9!