cs 6316 machine learning
play

CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department - PowerPoint PPT Presentation

CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Review: Linear Functions 2. Perceptron 3. Logistic Regression 4. Linear Regression 1 Review: Linear Functions Linear


  1. CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department of Computer Science University of Virginia

  2. Overview 1. Review: Linear Functions 2. Perceptron 3. Logistic Regression 4. Linear Regression 1

  3. Review: Linear Functions

  4. Linear Predictors Linear predictors discussed in this course ◮ halfspace predictors ◮ logistic regression classifiers ◮ linear SVMs (lecture on support vector machines) ◮ naive Bayes classifiers (lecture on generative models) ◮ linear regression predictors 3

  5. Linear Predictors Linear predictors discussed in this course ◮ halfspace predictors ◮ logistic regression classifiers ◮ linear SVMs (lecture on support vector machines) ◮ naive Bayes classifiers (lecture on generative models) ◮ linear regression predictors A common core form of these linear predictors d � h w , b � � w , x � + b � � � w i x i + b (1) i � 1 where w is the weights and b is the bias 3

  6. Alternative Form Given the original definition of a linear function d � h w , b � � w , x � + b � � � w i x i + b , (2) i � 1 we could redefine it in a more compact form ( w 1 , w 2 , . . . , w d , b ) T ← w ( x 1 , x 2 , . . . , x d , 1 ) T x ← and then h w , b ( x ) � � w , x � (3) 4

  7. Linear Functions Consider a two-dimensional case with w � ( 1 , 1 , − 0 . 5 ) f ( x ) � w T x � x 1 + x 2 − 0 . 5 (4) x 2 x 1 Different values of f ( x ) map to different areas on this 2-D space. For example, the following equation defines the blue line L . f ( x ) � w T x � 0 (5) 5

  8. Properties of Linear Functions (II) For any two points x and x ′ lying in the line f ( x ) − f ( x ′ ) � w T x − w T x ′ � 0 (6) x 2 x x 1 x ′ [Friedman et al., 2001, Section 4.5] 6

  9. Properties of Linear Functions (III) Furthermore, f ( x ) � x 1 + x 2 − 0 . 5 � 0 (7) separates the 2-D space R 2 into two half spaces x 2 f ( x ) > 0 x 1 f ( x ) < 0 7

  10. Properties of Linear Functions (IV) From the perspective of linear projection, f ( x ) � 0 defines the vectors on this 2-D space, whose projections onto the direction ( 1 , 1 ) have the same magnitude 0 . 5 � 1 � x 1 + x 2 − 0 . 5 � 0 ⇒ ( x 1 , x 2 ) · � 0 . 5 (8) 1 x 2 ( 1 , 1 ) x x 1 8 [Friedman et al., 2001, Section 4.5]

  11. Properties of Linear Functions (IV) From the perspective of linear projection, f ( x ) � 0 defines the vectors on this 2-D space, whose projections onto the direction ( 1 , 1 ) have the same magnitude 0 . 5 � 1 � x 1 + x 2 − 0 . 5 � 0 ⇒ ( x 1 , x 2 ) · � 0 . 5 (8) 1 x 2 ( 1 , 1 ) This idea can be generalized x to compute the distance x 1 between a point and a line. 8 [Friedman et al., 2001, Section 4.5]

  12. Properties of Linear Functions (IV) The distance of point x to line L : f ( x ) � � w , x � � 0 is given by f ( x ) � � w , x � w � � , x � (9) � w � 2 � x � 2 � w � 2 x 2 x x 1 [Friedman et al., 2001, Section 4.5] 9

  13. Perceptron

  14. Halfspace Hypothesis Class ◮ X � R d ◮ Y � {− 1 , + 1 } ◮ Halfspace hypothesis class half � { sign (� w , x �) : w ∈ R d } H (10) which is an infinite hypothesis space. The sign function y � sign ( x ) is defined as 11

  15. Linearly Separable Cases The algorithm can find a hyperplane to separate all positive examples from negative examples x 2 x 1 The definition of linearly separable cases is with respect to the training set S instead of D 12

  16. Prediction Rule The prediction rule of a half-space predictor is based on the sign of h ( x ) � sign (� w , x �) � + 1 � w , x � > 0 h ( x ) � (11) − 1 � w , x � < 0 x 2 x 2 � w , x � > 0 + x 1 x 1 − � w , x � < 0 13

  17. Prediction Rule The prediction rule of a half-space predictor is based on the sign of h ( x ) � sign (� w , x �) � + 1 � w , x � > 0 h ( x ) � (11) − 1 � w , x � < 0 or, if y ′ ∈ {− 1 , + 1 } and y ′ � w , x � > 0 h ( x ) � y ′ (12) x 2 x 2 � w , x � > 0 + x 1 x 1 − � w , x � < 0 13

  18. Perceptron Algorithm The perceptron algorithm is defined as 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} 2: Initialize w ( 0 ) � ( 0 , . . . , 0 ) 9: Output : w ( T ) 14

  19. Perceptron Algorithm The perceptron algorithm is defined as 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} 2: Initialize w ( 0 ) � ( 0 , . . . , 0 ) 3: for t � 1 , 2 , · · · , T do i ← t mod m 4: 8: end for 9: Output : w ( T ) 14

  20. Perceptron Algorithm The perceptron algorithm is defined as 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} 2: Initialize w ( 0 ) � ( 0 , . . . , 0 ) 3: for t � 1 , 2 , · · · , T do i ← t mod m 4: if y i � w ( t ) , x i � ≤ 0 then 5: w ( t + 1 ) ← w ( t ) + y i x i // updating rule 6: end if 7: 8: end for 9: Output : w ( T ) 14

  21. Perceptron Algorithm The perceptron algorithm is defined as 1: Input : S � {( x 1 , y 1 ) , . . . , ( x m , y m ))} 2: Initialize w ( 0 ) � ( 0 , . . . , 0 ) 3: for t � 1 , 2 , · · · , T do i ← t mod m 4: if y i � w ( t ) , x i � ≤ 0 then 5: w ( t + 1 ) ← w ( t ) + y i x i // updating rule 6: end if 7: 8: end for 9: Output : w ( T ) Exercise : Implementing this algorithm with a simple 14 example

  22. Two Questions The updating rule can be break down into two cases: w ( t + 1 ) ← w ( t ) + y i x i (13) ◮ For y i � + 1 , w ( t + 1 ) ← w ( t ) + x i ◮ For y i � − 1 , w ( t + 1 ) ← w ( t ) − x i 15

  23. Two Questions The updating rule can be break down into two cases: w ( t + 1 ) ← w ( t ) + y i x i (13) ◮ For y i � + 1 , w ( t + 1 ) ← w ( t ) + x i ◮ For y i � − 1 , w ( t + 1 ) ← w ( t ) − x i Two questions: ◮ How the updating rule can help? ◮ How many updating steps the algorithm needs? 15

  24. The Updating Rule At time step t , given the training example ( x i , y i ) and the current weight w ( t ) y i � w ( t + 1 ) , x i � y i � w ( t ) + y i x i , x i � (14) � y i � w ( t ) , x i � + � x i � 2 (15) � ◮ w ( t + 1 ) gives a higher value of y i � w ( t + 1 ) , x i � on predicting x i than w ( t ) ◮ the updating is affected by the norm of x i , � x i � 2 16

  25. Theorem Assume that {( x i , y i )} m i � 1 is separable. Let ◮ B � min {� w � : ∀ i ∈ [ m ] , y i � w , x i � ≥ 1 } , and ◮ R � max i � x i � . Then, the Perceptron algorithm stops after at most ( RB ) 2 iterations, and when it stops it holds that ∀ i ∈ [ m ] , y i � w ( t ) , x � > 0 (16) ◮ A realizable case with infinite hypothesis space ◮ Finish training in finite steps 17

  26. Example [Bishop, 2006, Page 195] 18

  27. Example [Bishop, 2006, Page 195] 18

  28. Example [Bishop, 2006, Page 195] 18

  29. Example [Bishop, 2006, Page 195] 18

  30. The XOR Example: a Non-separable Case ◮ X 1 , X 2 ∈ { 0 , 1 } ◮ the XOR operation is defined as x 2 Y � X 1 ⊕ X 2 where x 1 � 1 X 1 � X 2 Y � X 1 � X 2 0 19

  31. The XOR Example: Further Comment 20

  32. Logistic Regression

  33. Hypothesis Class ◮ The hypothesis class of logistic regression is defined as LR � { σ (� w , x �) : w ∈ R d } H (17) ◮ The sigmoid function σ ( a ) with a ∈ R 1 σ ( a ) � (18) 1 + exp (− a ) 22

  34. Unified Form for Logistic Predictors ◮ An unified form for y ∈ {− 1 , + 1 } 1 h ( x , y ) � (19) 1 + exp (− y � w , x �) which is similar to the half-space predictors 23

  35. Unified Form for Logistic Predictors ◮ An unified form for y ∈ {− 1 , + 1 } 1 h ( x , y ) � (19) 1 + exp (− y � w , x �) which is similar to the half-space predictors ◮ Prediction 1. Compute the the values from Eq. 19 with y ∈ {− 1 , + 1 } 2. Pick the y that has bigger value � + 1 h ( x , + 1 ) > h ( x , − 1 ) y � (20) − 1 h ( x , + 1 ) < h ( x , − 1 ) 23

  36. A Predictor Take a close look of the uniform definition of h ( x , y ) ◮ When y � + 1 1 h w ( x , + 1 ) � 1 + exp (−� w , x �) ◮ When y � − 1 1 h w ( x , − 1 ) � 1 + exp (� w , x �) exp (−� w , x �) � 1 + exp (−� w , x �) 1 1 − � 1 + exp (−� w , x �) 1 − h w ( x , + 1 ) � 24

  37. A Linear Classifier? To justify this is a linear classifier, let take a look the decision boundary given by h ( x , + 1 ) � h ( x , − 1 ) (21) Specifically, we have 1 1 � 1 + exp (−� w , x �) 1 + exp (� w , x �) exp (−� w , x �) exp (� w , x �) � −� w , x � � w , x � � 2 � w , x � 0 � The decision boundary is a straight line 25

  38. Risk/Loss Function For a given training example ( x , y ) , the risk/loss function is defined as the negative log of h ( x , y ) 1 L ( h w , ( x , y )) − log � 1 + exp (− y � w , x �) log ( 1 + exp (− y � w , x �)) (22) � Intuitively, minimizing the risk will increase the value of h ( x , y ) 26

Recommend


More recommend