logistic regression
play

Logistic regression CS 446 1. Linear classifiers Linear regression - PowerPoint PPT Presentation

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied linear regression ; the output/label space Y was R . 90 80 delay 70 60 50 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 duration 1 / 68 Linear


  1. Logistic regression CS 446

  2. 1. Linear classifiers

  3. Linear regression Last two lectures, we studied linear regression ; the output/label space Y was R . 90 80 delay 70 60 50 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 duration 1 / 68

  4. Linear classification Today, the goal is a linear classifier ; the output/label space Y is discrete. 1.0 0.8 0.6 0.4 0.2 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2 / 68

  5. Notation For now, let’s consider binary classification: Y = {− 1 , +1 } . A linear predictor w ∈ R d classifies according to sign( w T x ) ∈ {− 1 , +1 } . Given (( x i , y i )) n i =1 , a predictor w ∈ R d , � � w T x i we want sign and y i to agree. 3 / 68

  6. Geometry of linear classifiers x 2 A hyperplane in R d is a linear subspace of dimension d − 1 . ◮ A hyperplane in R 2 is a line. H ◮ A hyperplane in R 3 is a plane. ◮ As a linear subspace, a hyperplane always contains the origin. w x 1 A hyperplane H can be specified by a (non-zero) normal vector w ∈ R d . The hyperplane with normal vector w is the set of points orthogonal to w : � � x ∈ R d : x T w = 0 H = . Given w and its corresponding H : H splits the sets labeled positive { x : w T x > 0 } and those labeled negative { x : w T w < 0 } . 4 / 68

  7. Classification with a hyperplane H w span { w } 5 / 68

  8. Classification with a hyperplane H x Projection of x onto span { w } (a line) has coordinate w � x � 2 · cos( θ ) θ where � x � 2 · cos θ x T w cos θ = . � w � 2 � x � 2 span { w } (Distance to hyperplane is � x � 2 · | cos( θ ) | .) 5 / 68

  9. Classification with a hyperplane H x Projection of x onto span { w } (a line) has coordinate w � x � 2 · cos( θ ) θ where � x � 2 · cos θ x T w cos θ = . � w � 2 � x � 2 span { w } (Distance to hyperplane is � x � 2 · | cos( θ ) | .) Decision boundary is hyperplane (oriented by w ): T w > 0 x ⇐ ⇒ � x � 2 · cos( θ ) > 0 ⇐ ⇒ x on same side of H as w 5 / 68

  10. Classification with a hyperplane H x Projection of x onto span { w } (a line) has coordinate w � x � 2 · cos( θ ) θ where � x � 2 · cos θ x T w cos θ = . � w � 2 � x � 2 span { w } (Distance to hyperplane is � x � 2 · | cos( θ ) | .) Decision boundary is hyperplane (oriented by w ): T w > 0 x ⇐ ⇒ � x � 2 · cos( θ ) > 0 ⇐ ⇒ x on same side of H as w What should we do if we want hyperplane decision boundary that doesn’t (necessarily) go through origin? 5 / 68

  11. Linear separability Is it always possible to find w with sign( w T x i ) = y i ? Is it always possible to find a hyperplane separating the data? (Appending 1 means it need not go through the origin.) 1.0 0.8 0.6 0.4 0.2 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Linearly separable. Not linearly separable. 6 / 68

  12. Decision boundary with quadratic feature expansion elliptical decision boundary hyperbolic decision boundary 7 / 68

  13. Decision boundary with quadratic feature expansion elliptical decision boundary hyperbolic decision boundary Same feature expansions we saw for linear regression models can also be used here to “upgrade” linear classifiers. 7 / 68

  14. Finding linear classifiers with ERM Why not feed our goal into an optimization package, in the form n � 1 T x i ) � = y i ] ? arg min 1 [sign( w n w ∈ R d i =1 8 / 68

  15. Finding linear classifiers with ERM Why not feed our goal into an optimization package, in the form n � 1 T x i ) � = y i ] ? arg min 1 [sign( w n w ∈ R d i =1 ◮ Discrete/combinatorial search; often NP-hard. 8 / 68

  16. Relaxing the ERM problem Let’s remove one source of discreteness: n n � � � � 1 1 T x i ) � = y i ] T x i ) ≤ 0 1 [sign( w − → y i ( w . 1 n n i =1 i =1 Did we lose something in this process? Should it be “ > ” or “ ≥ ”? 9 / 68

  17. Relaxing the ERM problem Let’s remove one source of discreteness: n n � � � � 1 1 T x i ) � = y i ] T x i ) ≤ 0 1 [sign( w − → y i ( w . 1 n n i =1 i =1 Did we lose something in this process? Should it be “ > ” or “ ≥ ”? y i ( w T x i ) is the (unnormalized) margin of w on example i ; we have written this problem with a margin loss: n � R zo ( w ) = 1 � T x i ) ℓ zo ( y i w where ℓ zo ( z ) = 1 [ z ≤ 0] . n i =1 (Remainder of lecture will use single-parameter margin losses.) 9 / 68

  18. 2. Logistic loss and risk

  19. Logistic loss Let’s state our classification goal with a generic margin loss ℓ : n � R ( w ) = 1 � T x i ); ℓ ( y i w n i =1 the key properties we want: ◮ ℓ is continuous; ◮ ℓ ( z ) ≥ c 1 [ z ≤ 0] = cℓ zo ( z ) for some c > 0 and any z ∈ R , which implies � R ℓ ( w ) ≥ c � R zo ( w ) . ◮ ℓ ′ (0) < 0 (pushes stuff from wrong to right). 10 / 68

  20. Logistic loss Let’s state our classification goal with a generic margin loss ℓ : n � R ( w ) = 1 � T x i ); ℓ ( y i w n i =1 the key properties we want: ◮ ℓ is continuous; ◮ ℓ ( z ) ≥ c 1 [ z ≤ 0] = cℓ zo ( z ) for some c > 0 and any z ∈ R , which implies � R ℓ ( w ) ≥ c � R zo ( w ) . ◮ ℓ ′ (0) < 0 (pushes stuff from wrong to right). Examples. ◮ Squared loss, written in margin form: ℓ ls ( z ) := (1 − z ) 2 ; y ) 2 = y 2 (1 − y ˆ y ) 2 = ( y − ˆ y ) 2 . note ℓ ls ( y ˆ y ) = (1 − y ˆ ◮ Logistic loss: ℓ log ( z ) = ln(1 + exp( − z )) . 10 / 68

  21. Squared and logistic losses on linearly separable data I 1.0 1.0 0.8 0.8 0.6 0.6 0.000 - 1 - - 1 -1.200 -0.800 -0.400 8 4 0 4 8 0.400 0.800 1.200 2 . . . 2 . . 0 0 0 . . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0.4 0.2 0.2 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Logistic loss. Squared loss. 11 / 68

  22. Squared and logistic losses on linearly separable data II 1.0 1.0 0.8 0.8 0.6 0.6 - - 8 4 1 0 1 - - . . 1 . - . 0 . 0 - 0 2 0 4 0 8 2 0 0 0 2 0 . 0 . 0 . 0 . . 0 . 0 0 8 . 4 8 . 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0.4 0.2 0.2 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Logistic loss. Squared loss. 12 / 68

  23. Logistic risk and separation If there exists a perfect linear separator, empirical logistic risk minimization should find it. Theorem. 13 / 68

  24. Logistic risk and separation If there exists a perfect linear separator, empirical logistic risk minimization should find it. w T x i > 0 for all i , Theorem. If there exists ¯ w with y i ¯ then every w with � R log ( w ) < ln(2) / 2 n + inf v � R log ( v ) , also satisfies y i w T x i > 0 . 13 / 68

  25. Logistic risk and separation If there exists a perfect linear separator, empirical logistic risk minimization should find it. w T x i > 0 for all i , Theorem. If there exists ¯ w with y i ¯ then every w with � R log ( w ) < ln(2) / 2 n + inf v � R log ( v ) , also satisfies y i w T x i > 0 . Proof. Step 1: low risk implies few mistakes. For any w with y j w T x j ≤ 0 for some j , R log ( w ) ≥ 1 T x j )) ≥ ln(2) � n ln(1 + exp( − y j w . n By contrapositive, any w with � R log ( w ) < ln(2) / n makes no mistakes. Step 2: inf v � R log ( v ) = 0 . Note: n � 1 � T x i )) = 0 . 0 ≤ inf R log ( v ) ≤ inf ln(1 + exp( − ry i ¯ w n r> 0 v i =1 This completes the proof. 13 / 68

  26. 3. Minimizing the empirical logistic risk

  27. Least squares and logistic ERM Least squares: ◮ Take gradient of � Aw − b � 2 , set to 0; obtain normal equations A T Aw = A T b . ◮ One choice is minimum norm solution A + b . 14 / 68

  28. Least squares and logistic ERM Least squares: ◮ Take gradient of � Aw − b � 2 , set to 0; obtain normal equations A T Aw = A T b . ◮ One choice is minimum norm solution A + b . Logistic loss: � n ◮ Take gradient of � R log ( w ) = 1 i =1 ln(1+exp( y i w T x i )) and set to 0 ??? n 14 / 68

  29. Least squares and logistic ERM Least squares: ◮ Take gradient of � Aw − b � 2 , set to 0; obtain normal equations A T Aw = A T b . ◮ One choice is minimum norm solution A + b . Logistic loss: � n ◮ Take gradient of � R log ( w ) = 1 i =1 ln(1+exp( y i w T x i )) and set to 0 ??? n Remark . Is A + b a “closed form expression”? 14 / 68

  30. Decreasing � R We need to move down the contours of � R log : 10.0 7.5 5.0 2.000 2.5 4.000 1 1 8 6.000 4 0 . 0.0 12.000 . . 0 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 15 / 68

  31. Gradient descent Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . . 0 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 16 / 68

Recommend


More recommend