Slack variables New Lagrangian m L ( w , b , ξ , α , β ) = 1 2 || w || 2 + C � ξ i (10) i = 1 m � − α i [ y i ( w · x i + b ) − 1 + ξ i ] (11) i = 1 m � − β i ξ i (12) i = 1 Taking the gradients ( ∇ w L , ∇ b L , ∇ ξ i L ) and solving for zero gives us m m � α i y i = 0 (14) � α i + β i = C (15) w = α i y i x i (13) i = 1 i = 1 Machine Learning: Chenhao Tan | Boulder | 22 of 52
Slack variables New Lagrangian m L ( w , b , ξ , α , β ) = 1 2 || w || 2 + C � ξ i (10) i = 1 m � − α i [ y i ( w · x i + b ) − 1 + ξ i ] (11) i = 1 m � − β i ξ i (12) i = 1 Taking the gradients ( ∇ w L , ∇ b L , ∇ ξ i L ) and solving for zero gives us m m � α i y i = 0 (14) � α i + β i = C (15) w = α i y i x i (13) i = 1 i = 1 Machine Learning: Chenhao Tan | Boulder | 22 of 52
Slack variables Simplifying dual objective m m � w = α i y i x i α i + β i = C � α i y i = 0 i = 1 i = 1 Machine Learning: Chenhao Tan | Boulder | 23 of 52
Slack variables Simplifying dual objective m m � w = α i y i x i α i + β i = C � α i y i = 0 i = 1 i = 1 m L ( w , b , ξ , α , β ) = 1 2 || w || 2 + C � ξ i i = 1 m � − α i [ y i ( w · x i + b ) − 1 + ξ i ] i = 1 m � − β i ξ i i = 1 Machine Learning: Chenhao Tan | Boulder | 23 of 52
Slack variables Dual Problem m m m α i − 1 � � � max α i α j y i y j ( x j · x i ) 2 α i = 1 i = 1 j = 1 s.t. C ≥ α i ≥ 0 , i ∈ [ 1 , m ] � α i y i = 0 i Machine Learning: Chenhao Tan | Boulder | 24 of 52
Slack variables Dual Problem m m m α i − 1 � � � max α i α j y i y j ( x j · x i ) 2 α i = 1 i = 1 j = 1 s.t. C ≥ α i ≥ 0 , i ∈ [ 1 , m ] � α i y i = 0 i Machine Learning: Chenhao Tan | Boulder | 24 of 52
Slack variables Karush-Kuhn-Tucker (KKT) conditions Primal and dual feasibility y i ( w · x i + b ) ≥ 1 − ξ i , ξ i ≥ 0 , C ≥ α i ≥ 0 , β i ≥ 0 (16) Machine Learning: Chenhao Tan | Boulder | 25 of 52
Slack variables Karush-Kuhn-Tucker (KKT) conditions Primal and dual feasibility y i ( w · x i + b ) ≥ 1 − ξ i , ξ i ≥ 0 , C ≥ α i ≥ 0 , β i ≥ 0 (16) Stationarity m m � � w = α i y i x i , α i y i = 0 , α i + β i = C (17) i = 1 i = 1 Machine Learning: Chenhao Tan | Boulder | 25 of 52
Slack variables Karush-Kuhn-Tucker (KKT) conditions Primal and dual feasibility y i ( w · x i + b ) ≥ 1 − ξ i , ξ i ≥ 0 , C ≥ α i ≥ 0 , β i ≥ 0 (16) Stationarity m m � � w = α i y i x i , α i y i = 0 , α i + β i = C (17) i = 1 i = 1 Complementary slackness α i [ y i ( w · x i + b ) − 1 + ξ i ] = 0 , β i ξ i = 0 (18) Machine Learning: Chenhao Tan | Boulder | 25 of 52
Slack variables More on Complementary Slackness α i [ y i ( w · x i + b ) − 1 + ξ i ] = 0 , β i ξ i = 0 (19) • x i satisfies the margin, y i ( w · x i + b ) > 1 ⇒ α i = 0 Machine Learning: Chenhao Tan | Boulder | 26 of 52
Slack variables More on Complementary Slackness α i [ y i ( w · x i + b ) − 1 + ξ i ] = 0 , β i ξ i = 0 (19) • x i satisfies the margin, y i ( w · x i + b ) > 1 ⇒ α i = 0 • x i does not satisfy the margin, y i ( w · x i + b ) < 1 ⇒ α i = C Machine Learning: Chenhao Tan | Boulder | 26 of 52
Slack variables More on Complementary Slackness α i [ y i ( w · x i + b ) − 1 + ξ i ] = 0 , β i ξ i = 0 (19) • x i satisfies the margin, y i ( w · x i + b ) > 1 ⇒ α i = 0 • x i does not satisfy the margin, y i ( w · x i + b ) < 1 ⇒ α i = C • x i is on the margin, y i ( w · x i + b ) = 1 ⇒ 0 ≤ α i ≤ C Machine Learning: Chenhao Tan | Boulder | 26 of 52
Sequential Mimimal Optimization Outline Duality Slack variables Sequential Mimimal Optimization Recap Machine Learning: Chenhao Tan | Boulder | 27 of 52
Sequential Mimimal Optimization Sequential Mimimal Optimization Trivia • Invented by John Platt in 1998 at Microsoft Research • Called Minimal due to solving small sub-problems Machine Learning: Chenhao Tan | Boulder | 28 of 52
Sequential Mimimal Optimization Dual problem m m m α i − 1 � � � max α i α j y i y j ( x j · x i ) 2 α i = 1 i = 1 j = 1 s.t. C ≥ α i ≥ 0 , i ∈ [ 1 , m ] � α i y i = 0 i Machine Learning: Chenhao Tan | Boulder | 29 of 52
Sequential Mimimal Optimization Brief Interlude: Coordinate Ascent m m m α i − 1 � � � max α i α j y i y j ( x j · x i ) 2 α i = 1 i = 1 j = 1 s.t. C ≥ α i ≥ 0 , i ∈ [ 1 , m ] � α i y i = 0 i Loop over each training example, change α i to maximize the above function Machine Learning: Chenhao Tan | Boulder | 30 of 52
Sequential Mimimal Optimization Brief Interlude: Coordinate Ascent m m m α i − 1 � � � max α i α j y i y j ( x j · x i ) 2 α i = 1 i = 1 j = 1 s.t. C ≥ α i ≥ 0 , i ∈ [ 1 , m ] � α i y i = 0 i Loop over each training example, change α i to maximize the above function Although coordinate ascent works OK for lots of problems, we have the constraint � i α i y i = 0 Machine Learning: Chenhao Tan | Boulder | 30 of 52
Sequential Mimimal Optimization Outline for SVM Optimization (SMO) 1. Select two examples i , j 2. Update α j , α i to maximize the above function Machine Learning: Chenhao Tan | Boulder | 31 of 52
Sequential Mimimal Optimization Karush-Kuhn-Tucker (KKT) conditions Primal and dual feasibility y i ( w · x i + b ) ≥ 1 − ξ i , ξ i ≥ 0 , C ≥ α i ≥ 0 , β i ≥ 0 (20) Stationarity m m � � w = α i y i x i , α i y i = 0 , α i + β i = C (21) i = 1 i = 1 Complementary slackness α i [ y i ( w · x i + b ) − 1 + ξ i ] = 0 , β i ξ i = 0 (22) Machine Learning: Chenhao Tan | Boulder | 32 of 52
Sequential Mimimal Optimization Outline for SVM Optimization (SMO) y i α i + y j α j = y i α old + y j α old = γ i j Machine Learning: Chenhao Tan | Boulder | 33 of 52
Sequential Mimimal Optimization Step 2: Optimize α j 1. Compute upper ( H ) and lower ( L ) bounds that ensure 0 ≤ α j ≤ C . y i = y j y i � = y j L = max( 0 , α i + α j − C ) (25) L = max( 0 , α j − α i ) (23) H = min( C , α j + α i ) (26) H = min( C , C + α j − α i ) (24) Machine Learning: Chenhao Tan | Boulder | 34 of 52
Sequential Mimimal Optimization Step 2: Optimize α j 1. Compute upper ( H ) and lower ( L ) bounds that ensure 0 ≤ α j ≤ C . y i = y j y i � = y j L = max( 0 , α i + α j − C ) (25) L = max( 0 , α j − α i ) (23) H = min( C , α j + α i ) (26) H = min( C , C + α j − α i ) (24) This is because the update for α i is based on y i y j (sign matters) Machine Learning: Chenhao Tan | Boulder | 34 of 52
Sequential Mimimal Optimization Step 2: Optimize α j Compute errors for i and j E k ≡ f ( x k ) − y k (27) η = 2 x i · x j − x i · x i − x j · x j (28) for new value for α j − y j ( E i − E j ) j = α ( old ) α ∗ (29) j η Machine Learning: Chenhao Tan | Boulder | 35 of 52
Sequential Mimimal Optimization Step 3: Optimize α i Set α i : � � i = α ( old ) α ( old ) α ∗ + y i y j − α j (30) i j Machine Learning: Chenhao Tan | Boulder | 36 of 52
Sequential Mimimal Optimization Step 3: Optimize α i Set α i : � � i = α ( old ) α ( old ) α ∗ + y i y j − α j (30) i j This balances out the move that we made for α j . Machine Learning: Chenhao Tan | Boulder | 36 of 52
Sequential Mimimal Optimization Overall algorithm Iterate over i = { 1 , . . . m } Repeat until KKT conditions are met Choose j randomly from m − 1 other options Update α i , α j Find w , b based on stationarity conditions Machine Learning: Chenhao Tan | Boulder | 37 of 52
Sequential Mimimal Optimization Iterations / Details • What if i doesn’t violate the KKT conditions? • What if η ≥ 0 ? • When do we stop? Machine Learning: Chenhao Tan | Boulder | 38 of 52
Sequential Mimimal Optimization Iterations / Details • What if i doesn’t violate the KKT conditions? Skip it! • What if η ≥ 0 ? • When do we stop? Machine Learning: Chenhao Tan | Boulder | 38 of 52
Sequential Mimimal Optimization Iterations / Details • What if i doesn’t violate the KKT conditions? Skip it! • What if η ≥ 0 ? Skip it! (should not happen except for numerical instability) • When do we stop? Machine Learning: Chenhao Tan | Boulder | 38 of 52
Sequential Mimimal Optimization Iterations / Details • What if i doesn’t violate the KKT conditions? Skip it! • What if η ≥ 0 ? Skip it! (should not happen except for numerical instability) • When do we stop? Until we go through α ’s without changing anything Machine Learning: Chenhao Tan | Boulder | 38 of 52
Sequential Mimimal Optimization SMO Algorithm Negative Positive (-2, -3) (-2, 2) (0, -1) (0, 4) (2, -3) (2, 1) 1 positive 0 2 4 3 5 negative Machine Learning: Chenhao Tan | Boulder | 39 of 52
Sequential Mimimal Optimization SMO Algorithm Negative Positive (-2, -3) (-2, 2) (0, -1) (0, 4) (2, -3) (2, 1) • Initially, all alphas are zero 1 positive α = < 0 , 0 , 0 , 0 , 0 , 0 > 0 2 4 3 5 negative Machine Learning: Chenhao Tan | Boulder | 39 of 52
Sequential Mimimal Optimization SMO Algorithm Negative Positive (-2, -3) (-2, 2) (0, -1) (0, 4) (2, -3) (2, 1) • Initially, all alphas are zero 1 positive α = < 0 , 0 , 0 , 0 , 0 , 0 > 0 2 4 3 5 • Intercept b is also zero negative • Capacity C = π Machine Learning: Chenhao Tan | Boulder | 39 of 52
Sequential Mimimal Optimization SMO Optimization for i = 0 , j = 4 : Predictions and Step 1 positive • Prediction: f ( x 0 ) 0 • Prediction: f ( x 4 ) 2 • Error: E 0 • Error: E 4 4 3 5 negative Machine Learning: Chenhao Tan | Boulder | 40 of 52
Sequential Mimimal Optimization SMO Optimization for i = 0 , j = 4 : Predictions and Step 1 positive • Prediction: f ( x 0 ) = 0 0 • Prediction: f ( x 4 ) 2 • Error: E 0 • Error: E 4 4 3 5 negative Machine Learning: Chenhao Tan | Boulder | 40 of 52
Sequential Mimimal Optimization SMO Optimization for i = 0 , j = 4 : Predictions and Step 1 positive • Prediction: f ( x 0 ) = 0 0 • Prediction: f ( x 4 ) = 0 2 • Error: E 0 • Error: E 4 4 3 5 negative Machine Learning: Chenhao Tan | Boulder | 40 of 52
Sequential Mimimal Optimization SMO Optimization for i = 0 , j = 4 : Predictions and Step 1 positive • Prediction: f ( x 0 ) = 0 0 • Prediction: f ( x 4 ) = 0 2 • Error: E 0 = − 1 • Error: E 4 = + 1 4 3 5 negative Machine Learning: Chenhao Tan | Boulder | 40 of 52
Sequential Mimimal Optimization SMO Optimization for i = 0 , j = 4 : Predictions and Step 1 positive • Prediction: f ( x 0 ) = 0 0 • Prediction: f ( x 4 ) = 0 2 • Error: E 0 = − 1 • Error: E 4 = + 1 4 η = 2 � x 0 , x 4 � − � x 0 , x 0 � − � x 4 , x 4 � 3 5 negative Machine Learning: Chenhao Tan | Boulder | 40 of 52
Sequential Mimimal Optimization SMO Optimization for i = 0 , j = 4 : Predictions and Step 1 positive • Prediction: f ( x 0 ) = 0 0 • Prediction: f ( x 4 ) = 0 2 • Error: E 0 = − 1 • Error: E 4 = + 1 4 η = 2 � x 0 , x 4 � − � x 0 , x 0 � − � x 4 , x 4 � 3 5 = 2 · − 2 − 8 − 1 = − 13 negative Machine Learning: Chenhao Tan | Boulder | 40 of 52
Sequential Mimimal Optimization SMO Optimization for i = 0 , j = 4 : Bounds • Lower and upper bounds for α j L = max( 0 , α j − α i ) (31) H = min( C , C + α j − α i ) (32) Machine Learning: Chenhao Tan | Boulder | 41 of 52
Sequential Mimimal Optimization SMO Optimization for i = 0 , j = 4 : Bounds • Lower and upper bounds for α j L = max( 0 , α j − α i ) = 0 (31) H = min( C , C + α j − α i ) (32) Machine Learning: Chenhao Tan | Boulder | 41 of 52
Sequential Mimimal Optimization SMO Optimization for i = 0 , j = 4 : Bounds • Lower and upper bounds for α j L = max( 0 , α j − α i ) = 0 (31) H = min( C , C + α j − α i ) = π (32) Machine Learning: Chenhao Tan | Boulder | 41 of 52
Sequential Mimimal Optimization SMO Optimization for i = 0 , j = 4 : α update New value for α j j = α j − y j ( E i − E j ) α ∗ (33) η (34) Machine Learning: Chenhao Tan | Boulder | 42 of 52
Sequential Mimimal Optimization SMO Optimization for i = 0 , j = 4 : α update New value for α j j = α j − y j ( E i − E j ) = − 2 η = 2 α ∗ (33) η 13 (34) Machine Learning: Chenhao Tan | Boulder | 42 of 52
Sequential Mimimal Optimization SMO Optimization for i = 0 , j = 4 : α update New value for α j j = α j − y j ( E i − E j ) = − 2 η = 2 α ∗ (33) η 13 New value for α i (34) Machine Learning: Chenhao Tan | Boulder | 42 of 52
Sequential Mimimal Optimization SMO Optimization for i = 0 , j = 4 : α update New value for α j j = α j − y j ( E i − E j ) = − 2 η = 2 α ∗ (33) η 13 New value for α i � � α ( old ) α ∗ i = α i + y i y j − α j (34) j Machine Learning: Chenhao Tan | Boulder | 42 of 52
Sequential Mimimal Optimization SMO Optimization for i = 0 , j = 4 : α update New value for α j j = α j − y j ( E i − E j ) = − 2 η = 2 α ∗ (33) η 13 New value for α i = α j = 2 � � α ( old ) α ∗ i = α i + y i y j − α j (34) j 13 Machine Learning: Chenhao Tan | Boulder | 42 of 52
Sequential Mimimal Optimization Margin Machine Learning: Chenhao Tan | Boulder | 43 of 52
Sequential Mimimal Optimization Find weight vector and bias • Weight vector m � � w = α i y i � (35) x i i • Bias b = b ( old ) − E i − y i ( α ∗ i − α ( old ) j − α ( old ) ) x i · x i − y j ( α ∗ ) x i · x j (36) i j (37) Machine Learning: Chenhao Tan | Boulder | 44 of 52
Sequential Mimimal Optimization Find weight vector and bias • Weight vector � 0 m x i = 2 � − 2 � − 2 � � � w = α i y i � (35) − 1 13 2 13 i • Bias b = b ( old ) − E i − y i ( α ∗ i − α ( old ) j − α ( old ) ) x i · x i − y j ( α ∗ ) x i · x j (36) i j (37) Machine Learning: Chenhao Tan | Boulder | 44 of 52
Sequential Mimimal Optimization Find weight vector and bias • Weight vector � 0 m � − 4 � − 2 � � � x i = 2 − 2 � � w = α i y i � = 13 (35) 6 2 − 1 13 13 13 i • Bias b = b ( old ) − E i − y i ( α ∗ i − α ( old ) j − α ( old ) ) x i · x i − y j ( α ∗ ) x i · x j (36) i j (37) Machine Learning: Chenhao Tan | Boulder | 44 of 52
Sequential Mimimal Optimization Find weight vector and bias • Weight vector � 0 � − 4 m � − 2 � � � x i = 2 − 2 � w = � α i y i � = 13 (35) 6 − 1 2 13 13 13 i • Bias b = b ( old ) − E i − y i ( α ∗ i − α ( old ) j − α ( old ) ) x i · x i − y j ( α ∗ ) x i · x j (36) i j = 1 − 2 13 · 8 + 2 13 · − 2 = − 0 . 54 (37) Machine Learning: Chenhao Tan | Boulder | 44 of 52
Sequential Mimimal Optimization SMO Optimization for i = 2 , j = 4 1 positive Let’s skip the boring stuff • E 2 = − 1 . 69 0 2 • E 4 = 0 . 00 • η = − 8 • α 4 = α ( old ) − y j ( E i − E j ) 4 j η 3 5 � � • α 2 = α ( old ) α ( old ) + y i y j − α j negative i j Machine Learning: Chenhao Tan | Boulder | 45 of 52
Sequential Mimimal Optimization SMO Optimization for i = 2 , j = 4 1 positive Let’s skip the boring stuff • E 2 = − 1 . 69 0 2 • E 4 = 0 . 00 • η = − 8 • α 4 = α ( old ) − y j ( E i − E j ) 4 j η 3 5 � � • α 2 = α ( old ) α ( old ) + y i y j − α j negative i j Machine Learning: Chenhao Tan | Boulder | 45 of 52
Sequential Mimimal Optimization SMO Optimization for i = 2 , j = 4 1 positive Let’s skip the boring stuff • E 2 = − 1 . 69 0 • E 4 = 0 . 00 2 • η = − 8 • α 4 = α ( old ) − y j ( E i − E j ) = 0 . 15 + − 1 . 69 = 4 j η − 8 0 . 37 3 5 � � • α 2 = α ( old ) α ( old ) negative + y i y j − α j i j Machine Learning: Chenhao Tan | Boulder | 45 of 52
Sequential Mimimal Optimization SMO Optimization for i = 2 , j = 4 Let’s skip the boring stuff 1 positive • E 2 = − 1 . 69 0 • E 4 = 0 . 00 2 • η = − 8 • α 4 = α ( old ) − y j ( E i − E j ) = 0 . 15 + − 1 . 69 = j η − 8 4 0 . 37 3 5 � � • α 2 = α ( old ) α ( old ) + y i y j − α j = negative i j 0 − ( 0 . 15 − 0 . 37 ) = 0 . 21 Machine Learning: Chenhao Tan | Boulder | 45 of 52
Sequential Mimimal Optimization Margin Machine Learning: Chenhao Tan | Boulder | 46 of 52
Sequential Mimimal Optimization Weight vector and bias • Bias b = − 0 . 12 • Weight vector m � � w = α i y i � x i (38) i Machine Learning: Chenhao Tan | Boulder | 47 of 52
Sequential Mimimal Optimization Weight vector and bias • Bias b = − 0 . 12 • Weight vector m � 0 . 12 � � � w = α i y i � x i = (38) 0 . 88 i Machine Learning: Chenhao Tan | Boulder | 47 of 52
Sequential Mimimal Optimization Another Iteration ( i = 0 , j = 2 ) Machine Learning: Chenhao Tan | Boulder | 48 of 52
Sequential Mimimal Optimization SMO Algorithm • Convenient approach for solving: vanilla, slack, kernel approaches • Convex problem • Scalable to large datasets (implemented in scikit learn) • What we didn’t do: ◦ Check KKT conditions ◦ Randomly choose indices Machine Learning: Chenhao Tan | Boulder | 49 of 52
Recap Outline Duality Slack variables Sequential Mimimal Optimization Recap Machine Learning: Chenhao Tan | Boulder | 50 of 52
Recommend
More recommend