data mining
play

Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif - PowerPoint PPT Presentation

Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents Introduction 1 Linear discriminant analysis


  1. Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31

  2. Table of contents Introduction 1 Linear discriminant analysis 2 Linear classifiers 3 Support vector machines 4 Non-linear support vector machine 5 Multi-class Classifiers 6 One-against-all classification One-against-one classification Error correcting coding classification Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 2 / 31

  3. Table of contents Introduction 1 Linear discriminant analysis 2 Linear classifiers 3 Support vector machines 4 Non-linear support vector machine 5 Multi-class Classifiers 6 One-against-all classification One-against-one classification Error correcting coding classification Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 31

  4. Introduction In classification, the goal is to find a mapping from inputs X to outputs t ∈ { 1 , 2 , . . . , C } given a labeled set of input-output pairs (training set) S = { ( x 1 , t 1 ) , ( x 2 , t 2 ) , . . . , ( x N , t N ) } . Each training input x is a D − dimensional vector of numbers. Approaches for building a classifier. Generative approach: This approach first creates a joint model of the form of p ( x , C n ) and then to condition on x , then deriving p ( C n | x ). Discriminative approach: This approach creates a model of the form of p ( C n | x ) directly. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 31

  5. Table of contents Introduction 1 Linear discriminant analysis 2 Linear classifiers 3 Support vector machines 4 Non-linear support vector machine 5 Multi-class Classifiers 6 One-against-all classification One-against-one classification Error correcting coding classification Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 31

  6. Linear discriminant analysis (LDA) One way to view a linear classification model is in terms of dimensionality reduction. Assume that we want to project a vector onto another vector to obtain a new point after a change of the basis vectors. Let a , b ∈ R n be two n -dimensional vectors. An orthogonal decomposition of the vector b in the direction of another vector a equals to b = b � + b ⊥ = p + r where b = b � is parallel to a and r = b ⊥ is perpendicular to a . X 2 b 4 a b ⊥ = 3 r 2 p = b ∥ 1 0 X 1 0 1 2 3 4 5 Vector p is called orthogonal projection or projection of b on vector a . Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 31

  7. Linear discriminant analysis (LDA) p can be written as p = ca , where c is scaler and p is parallel to a . X 2 b 4 a b ⊥ = 3 r 2 p = b ∥ 1 0 X 1 0 1 2 3 4 5 Thus r = b − p = b − ca . Since p and r are orthogonal, we have p T r = ( ca ) T ( b − ca ) = ca T b − c 2 a T a = 0 . This implies c = a T b a T a Therefor, the projection of b on a equals to � a T b � p = b � = ca = a a T a Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 5 / 31

  8. Linear discriminant analysis (LDA) Consider a two-class problem and suppose we take a D − dimensional input vector x and project it down to one dimension using z = W T x If we place a threshold on z and classify z ≥ w 0 as class C 1 , and otherwise class C 2 , then we obtain our standard linear classifier. 4 . 5 4 . 0 3 . 5 3 . 0 2 . 5 2 . 0 w 1 . 5 4 . 0 4 . 5 5 . 0 5 . 5 6 . 0 6 . 5 7 . 0 7 . 5 8 . 0 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 31

  9. Linear discriminant analysis (cont.) Consider a two-class problem in which there are N 1 points of class C 1 and N 2 points of class C 2 . Hence the mean vectors of the class C j is given by � µ j = 1 x i N j i ∈ C j The simplest measure of the separation of the classes, when projected onto W , is the separation of the projected class means. This suggests that we might choose W so as to maximize m 2 − m 1 = W T ( µ 2 − µ 1 ) where m j = W T µ j This expression can be made arbitrarily large simply by increasing the magnitude of W . To solve this problem, we could constrain W to have unit length, so that � i w 2 i = 1. Using a Lagrange multiplier to perform the constrained maximization, we then find that W ∝ ( µ 2 − µ 1 ) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 31

  10. Linear discriminant analysis (cont.) This approach has a problem: The following figure shows two classes that are well separated in the original two dimensional space but that have considerable overlap when projected onto the line joining their means. 4 2 0 − 2 − 2 2 6 This difficulty arises from the strongly non-diagonal covariances of the class distributions. The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 8 / 31

  11. Linear discriminant analysis (cont.) The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap. The projection z = W T x transforms the set of labeled data points in x into a labeled set in the one-dimensional space z . The within-class variance of the transformed data from class C j equals � s 2 ( z i − m j ) 2 j = i ∈ C j where z i = w T x i . We can define the total within-class variance for the whole data set to be s 2 1 + s 2 2 . The Fisher criterion is defined to be the ratio of the between-class variance to the within-class variance and is given by J ( W ) = ( m 2 − m 1 ) 2 s 2 1 + s 2 2 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 9 / 31

  12. Linear discriminant analysis (cont.) Between-class covariance matrix equals to S B = ( µ 2 − µ 1 )( µ 2 − µ 1 ) T Total within-class covariance matrix equals to � � ( x i − µ 1 ) ( x i − µ 1 ) T + ( x i − µ 2 ) ( x i − µ 2 ) T S W = i ∈ C 1 i ∈ C 2 We have � � 2 ( m 1 − m 2 ) 2 W T µ 1 − W T µ 2 = W T ( µ 1 − µ 2 )( µ 1 − µ 2 ) T = W � �� � S B W T S B W , = Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 31

  13. Linear discriminant analysis (cont.) Also we have � � 2 � s 2 W T x i − µ 2 = t i 1 i � W T ( x i − µ 1 )( x i − µ 1 ) 2 Wt i = i �� � W T (( x i − µ 1 )( x i − µ 1 ) 2 = W i � �� � S 1 W T S 1 W , = and S W = S 1 + S 2 . Hence, J ( W ) can be written as J ( w ) = W T S B W W T S W W Derivative of J ( W ) with respect to W equals to (using ∂ x T Ax = ( A + A T ) x ) ∂ x W ∝ S − 1 W ( µ 2 − µ 1 ) The result W ∝ S − 1 W ( µ 2 − µ 1 ) is known as Fisher’s linear discriminant, although strictly it is not a discriminant but rather a specific choice of direction for projection of the data down to one dimension. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 11 / 31

  14. Table of contents Introduction 1 Linear discriminant analysis 2 Linear classifiers 3 Support vector machines 4 Non-linear support vector machine 5 Multi-class Classifiers 6 One-against-all classification One-against-one classification Error correcting coding classification Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 31

  15. Linear classifiers We consider the following type of linear classifiers. y ( x n ) = g ( x n ) = sign ( w 1 x n 1 + w 2 x n 2 + . . . + w D x nD ) ∈ {− 1 , +1 }   D � � �   = sign w T x n = sign w n x nj . j =1 w = ( w 1 , w 2 , . . . , w D ) T ∈ R D . Different value of w give different functions. x n = ( x n 1 , x n 2 , . . . , x nD ) T is a column vector of real values. This classifier changes its prediction only when the argument to the sign function changes from positive to negative (or vice versa). Geometrically,this transition in the feature space corresponds to crossing the decision boundary where the argument is exactly zero: all x such that w T x = 0. 4 4 2 2 0 0 − 2 − 2 − 4 − 4 − 6 − 6 − 8 − 8 − 4 − 2 0 2 4 6 8 − 4 − 2 0 2 4 6 8 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 31

  16. Table of contents Introduction 1 Linear discriminant analysis 2 Linear classifiers 3 Support vector machines 4 Non-linear support vector machine 5 Multi-class Classifiers 6 One-against-all classification One-against-one classification Error correcting coding classification Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 13 / 31

Recommend


More recommend