COMS 4721: Machine Learning for Data Science Lecture 8, 2/14/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University
L INEAR C LASSIFICATION
B INARY CLASSIFICATION We focus on binary classification, with input x i ∈ R d and output y i ∈ {± 1 } . ◮ We define a classifier f , which makes prediction y i = f ( x i , Θ) based on a function of x i and parameters Θ . In other words f : R d → {− 1 , + 1 } . Last lecture, we discussed the Bayes classification framework. ◮ Here, Θ contains: (1) class prior probabilities on y , (2) parameters for class-dependent distribution on x . This lecture we’ll introduce the linear classification framework. ◮ In this approach the prediction is linear in the parameters Θ . ◮ In fact, there is an intersection between the two that we discuss next.
A B AYES CLASSIFIER Bayes decisions With the Bayes classifier we predict the class of a new x to be the most probable label given the model and training data ( x 1 , y 1 ) , . . . , ( x n , y n ) . In the binary case, we declare class y = 1 if p ( x | y = 1 ) P ( y = 1 ) > p ( x | y = 0 ) P ( y = 0 ) � �� � � �� � π 1 π 0 � ln p ( x | y = 1 ) P ( y = 1 ) > 0 p ( x | y = 0 ) P ( y = 0 ) This second line is referred to as the log odds .
A B AYES CLASSIFIER Gaussian with shared covariance Let’s look at the log odds for the special case where p ( x | y ) = N ( x | µ y , Σ) (i.e., a single Gaussian with a shared covariance matrix) ln p ( x | y = 1 ) P ( y = 1 ) ln π 1 − 1 2 ( µ 1 + µ 0 ) T Σ − 1 ( µ 1 − µ 0 ) = p ( x | y = 0 ) P ( y = 0 ) π 0 � �� � a constant, call it w 0 + x T Σ − 1 ( µ 1 − µ 0 ) � �� � a vector, call it w This is also called “linear discriminant analysis” (used to be called LDA).
A B AYES CLASSIFIER So we can write the decision rule for this Bayes classifier as a linear one: f ( x ) = sign ( x T w + w 0 ) . ◮ This is what we saw last lecture 4 4 2 2 (but now class 0 is called − 1) 0 0 -2 -2 0.15 ◮ The Bayes classifier produced a 0.1 linear decision boundary in the 0.05 data space when Σ 1 = Σ 0 . 0 ◮ w and w 0 are obtained through a x P( ω 2 )=.5 specific equation. P( ω 1 )=.5 R 2 R 1 -2 -2 0 0 2 2 4 4
L INEAR CLASSIFIERS This Bayes classifier is one instance of a linear classifier sign ( x T w + w 0 ) f ( x ) = where ln π 1 − 1 2 ( µ 1 + µ 0 ) T Σ − 1 ( µ 1 − µ 0 ) = w 0 π 0 Σ − 1 ( µ 1 − µ 0 ) = w With MLE used to find values for π y , µ y and Σ . Setting w 0 and w this way may be too restrictive: ◮ This Bayes classifier assumes single Gaussian with shared covariance. ◮ Maybe if we relax what values w 0 and w can take we can do better.
L INEAR CLASSIFIERS ( BINARY CASE ) Definition: Binary linear classifier A binary linear classifier is a function of the form f ( x ) = sign ( x T w + w 0 ) , where w ∈ R d and w 0 ∈ R . Since the goal is to learn w , w 0 from data, we are assuming that linear separability in x is an accurate property of the classes. Definition: Linear separability Two sets A , B ⊂ R d are called linearly separable if � > 0 if x ∈ A ( e.g, class + 1 ) x T w + w 0 < 0 if x ∈ B ( e.g, class − 1 ) The pair ( w , w 0 ) defines an affine hyperplane . It is important to develop the right geometric understanding about what this is doing.
H YPERPLANES Geometric interpretation of linear classifiers: x 2 A hyperplane in R d is a linear subspace of dimension ( d − 1 ) . H ◮ A R 2 -hyperplane is a line. ◮ A R 3 -hyperplane is a plane. w ◮ As a linear subspace, a hyperplane always contains the origin. x 1 A hyperplane H can be represented by a vector w as follows: � � x ∈ R d | x T w = 0 H = .
W HICH SIDE OF THE PLANE ARE WE ON ? x H Distance from the plane w ◮ How close is a point x to H ? θ ◮ Cosine rule: x T w = � x � 2 � w � 2 cos θ � x � 2 · cos θ ◮ The distance of x to the hyperplane is � x � 2 · | cos θ | = | x T w | / � w � 2 . So | x T w | gives a sense of distance. Which side of the hyperplane? ◮ The cosine satisfies cos θ > 0 if θ ∈ ( − π 2 , π 2 ) . ◮ So the sign of cos ( · ) tells us the side of H , and by the cosine rule sign ( cos θ ) = sign ( x T w ) .
A FFINE H YPERPLANES x 2 Affine Hyperplanes H ◮ An affine hyperplane H is a hyperplane translated (shifted) using a scalar w 0 . ◮ Think of: H = x T w + w 0 = 0. w x 1 ◮ Setting w 0 > 0 moves the hyperplane in the opposite direction of w . ( w 0 < 0 in figure) − w 0 / � w � 2 Which side of the hyperplane now? ◮ The plane has been shifted by distance − w 0 � w � 2 in the direction w . ◮ For a given w , w 0 and input x the inequality x T w + w 0 > 0 says that x is on the far side of an affine hyperplane H in the direction w points.
C LASSIFICATION WITH A FFINE H YPERPLANES − w 0 � w � 2 w sign ( x T w + w 0 ) > 0 H sign ( x T w + w 0 ) < 0
P OLYNOMIAL GENERALIZATIONS The same generalizations from regression also hold for classification: ◮ (left) A linear classifier using x = ( x 1 , x 2 ) . ◮ (right) A linear classifier using x = ( x 1 , x 2 , x 2 1 , x 2 2 ) . The decision boundary is linear in R 4 , but isn’t when plotted in R 2 .
A NOTHER B AYES CLASSIFIER Gaussian with different covariance Let’s look at the log odds for the general case where p ( x | y ) = N ( x | µ y , Σ y ) (i.e., now each class has its own covariance) ln p ( x | y = 1 ) P ( y = 1 ) = something complicated not involving x p ( x | y = 0 ) P ( y = 0 ) � �� � a constant + x T (Σ − 1 1 µ 1 − Σ − 1 0 µ 0 ) � �� � a part that’s linear in x + x T (Σ − 1 0 / 2 − Σ − 1 1 / 2 ) x � �� � a part that’s quadratic in x Also called “quadratic discriminant analysis,” but it’s linear in the weights.
A NOTHER B AYES CLASSIFIER ◮ We also saw this last lecture. ◮ Notice that 0.03 f ( x ) = sign ( x T Ax + x T b + c ) 0.02 0.01 is linear in A , b , c . p 0 ◮ When x ∈ R 2 , rewrite as 10 x ← ( x 1 , x 2 , 2 x 1 x 2 , x 2 1 , x 2 2 ) -10 -10 0 0 0 and do linear classification in R 5 . 10 10 -10 20 20 Whereas the Bayes classifier with shared covariance is a version of linear classification, using different covariances is like polynomial classification.
L EAST SQUARES ON {− 1 , + 1 } How do we define more general classifiers of the form f ( x ) = sign ( x T w + w 0 ) ? ◮ One simple idea is to treat classification as a regression problem: 1. Let y = ( y 1 , . . . , y n ) T , where y i ∈ {− 1 , + 1 } is the class of x i . 2. Add dimension equal to 1 to x i and construct the matrix X = [ x 1 , . . . , x n ] T . 3. Learn the least squares weight vector w = ( X T X ) − 1 X T y . 4. For a new point x 0 declare y 0 = sign ( x T 0 w ) ← − w 0 is included in w . ◮ Another option: Instead of LS, use ℓ p regularization. ◮ These are “baseline” options. We can use them, along with k -NN, to get a quick sense what performance we’re aiming to beat.
S ENSITIVITY TO OUTLIERS 4 4 2 2 0 0 −2 −2 −4 −4 −6 −6 −8 −8 −4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8 Least squares can do well, but it is sensitive to outliers. In general we can find better classifiers that focus more on the decision boundary. ◮ (left) Least squares (purple) does well compared with another method ◮ (right) Least squares does poorly because of outliers
T HE P ERCEPTRON A LGORITHM
E ASY CASE : L INEARLY SEPARABLE DATA (Assume data x i has a 1 attached.) Suppose there is a linear classifier with zero training error: y i = sign ( x T i w ) , for all i . Then the data is “linearly separable” Left: Can separate classes with a line. (Can find an infinite number of lines.)
P ERCEPTRON (R OSENBLATT , 1958) Using the linear classifier y = f ( x ) = sign ( x T w ) , the Perceptron seeks to minimize n � ( y i · x T i w ) 1 { y i � = sign ( x T L = − i w ) } . i = 1 Because y ∈ {− 1 , + 1 } , � > 0 if y i = sign ( x T i w ) y i · x T i w is < 0 if y i � = sign ( x T i w ) By minimizing L we’re trying to always predict the correct label.
L EARNING THE PERCEPTRON ◮ Unlike other techniques we’ve talked about, we can’t find the minimum of L by taking a derivative and setting to zero: ∇ w L = 0 cannot be solved for w analytically. However ∇ w L does tell us the direction in which L is increasing in w . ◮ Therefore, for a sufficiently small η , if we update w ′ ← w − η ∇ w L , then L ( w ′ ) < L ( w ) — i.e., we have a better value for w . ◮ This is a very general method for optimizing an objective functions called gradient descent . Perceptron uses a “stochastic” version of this.
Recommend
More recommend