applied machine learning applied machine learning
play

Applied Machine Learning Applied Machine Learning Perceptron and - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives geometry of linear


  1. Perceptron: example Perceptron: example Iris dataset (linearly separable case) 1 def Perceptron(X, y, max_iters): 2 N,D = X.shape 3 w = np.random.rand(D) 4 for t in range(max_iters): 5 n = np.random.randint(N) iteration 1 6 yh = np.sign(np.dot(X[n,:], w)) 7 if yh != y[n]: 8 w = w + y[n]*X[n,:] 9 return w note that the code is not chacking for convergence 5 . 3

  2. Perceptron: Perceptron: example example Iris dataset (linearly separable case) 1 def Perceptron(X, y, max_iters): 2 N,D = X.shape 3 w = np.random.rand(D) 4 for t in range(max_iters): 5 n = np.random.randint(N) iteration 1 6 yh = np.sign(np.dot(X[n,:], w)) 7 if yh != y[n]: 8 w = w + y[n]*X[n,:] 9 return w note that the code is not chacking for convergence initial decision boundary w x = ⊤ 0 5 . 3

  3. Perceptron: example Perceptron: example Iris dataset (linearly separable case) 1 def Perceptron(X, y, max_iters): 2 N,D = X.shape 3 w = np.random.rand(D) 4 for t in range(max_iters): 5 n = np.random.randint(N) 6 yh = np.sign(np.dot(X[n,:], w)) iteration 10 7 if yh != y[n]: 8 w = w + y[n]*X[n,:] 9 return w note that the code is not chacking for convergence 5 . 4

  4. Perceptron: example Perceptron: example Iris dataset (linearly separable case) 1 def Perceptron(X, y, max_iters): 2 N,D = X.shape 3 w = np.random.rand(D) 4 for t in range(max_iters): 5 n = np.random.randint(N) 6 yh = np.sign(np.dot(X[n,:], w)) iteration 10 7 if yh != y[n]: 8 w = w + y[n]*X[n,:] 9 return w note that the code is not chacking for convergence 5 . 4

  5. Perceptron: Perceptron: example example Iris dataset (linearly separable case) 1 def Perceptron(X, y, max_iters): 2 N,D = X.shape 3 w = np.random.rand(D) 4 for t in range(max_iters): 5 n = np.random.randint(N) 6 yh = np.sign(np.dot(X[n,:], w)) iteration 10 7 if yh != y[n]: 8 w = w + y[n]*X[n,:] 9 return w note that the code is not chacking for convergence observations: after finding a linear separator no further updates happen the final boundary depends on the order of instances (different from all previous methods) 5 . 4

  6. Perceptron: Perceptron: example example 1 def Perceptron(X, y, max_iters): 2 N,D = X.shape 3 w = np.random.rand(D) 4 for t in range(max_iters): 5 n = np.random.randint(N) 6 yh = np.sign(np.dot(X[n,:], w)) 7 if yh != y[n]: 8 w = w + y[n]*X[n,:] 9 return w note that the code is not chacking for convergence 5 . 5

  7. Perceptron: Perceptron: example example Iris dataset ( NOT linearly separable case) 1 def Perceptron(X, y, max_iters): 2 N,D = X.shape 3 w = np.random.rand(D) 4 for t in range(max_iters): 5 n = np.random.randint(N) 6 yh = np.sign(np.dot(X[n,:], w)) 7 if yh != y[n]: 8 w = w + y[n]*X[n,:] 9 return w note that the code is not chacking for convergence 5 . 5

  8. Perceptron: example Perceptron: example Iris dataset ( NOT linearly separable case) 1 def Perceptron(X, y, max_iters): 2 N,D = X.shape 3 w = np.random.rand(D) 4 for t in range(max_iters): 5 n = np.random.randint(N) 6 yh = np.sign(np.dot(X[n,:], w)) 7 if yh != y[n]: 8 w = w + y[n]*X[n,:] 9 return w note that the code is not chacking for convergence the algorithm does not converge there is always a wrong prediction and the weights will be updated 5 . 5

  9. Perceptron: Perceptron: issues issues cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy 5 . 6

  10. Perceptron: Perceptron: issues issues cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations 5 . 6

  11. Perceptron: Perceptron: issues issues cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations the decision boundary may be suboptimal 5 . 6

  12. Perceptron: Perceptron: issues issues cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations let's fix this problem first assume linear separability the decision boundary may be suboptimal 5 . 7 Winter 2020 | Applied Machine Learning (COMP551)

  13. Margin Margin the margin of a classifier (assuming correct classification) is the distance of the closest point to the decision boundary this is positive for correctly classified points 6 . 1

  14. Margin Margin the margin of a classifier (assuming correct classification) is the distance of the closest point to the decision boundary 1 ⊤ ( n ) ( w x + ) signed distance is w 0 ∣∣ w ∣∣ this is positive for correctly classified points 6 . 1

  15. Margin Margin the margin of a classifier (assuming correct classification) is the distance of the closest point to the decision boundary 1 ⊤ ( n ) ( w x + ) signed distance is w 0 ∣∣ w ∣∣ 1 ( n ) ⊤ ( w x + ) correcting for sign (margin) y w 0 ∣∣ w ∣∣ this is positive for correctly classified points 6 . 1

  16. Max margin classifier Max margin classifier find the decision boundary with maximum margin margin is not maximal 6 . 2

  17. Max margin classifier Max margin classifier find the decision boundary with maximum margin maximum margin margin is not maximal 6 . 2

  18. Max margin classifier Max margin classifier find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 M M 6 . 3

  19. Max margin classifier Max margin classifier find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 only the points (n) with 1 ( n ) ⊤ ( n ) M = ( w x + ) matter in finding the boundary y w 0 ∣∣ w ∣∣ 2 M M 6 . 3

  20. Max margin classifier Max margin classifier find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 only the points (n) with 1 ( n ) ⊤ ( n ) M = ( w x + ) matter in finding the boundary y w 0 ∣∣ w ∣∣ 2 these are called support vectors M M 6 . 3

  21. Max margin classifier Max margin classifier find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 only the points (n) with 1 ( n ) ⊤ ( n ) M = ( w x + ) matter in finding the boundary y w 0 ∣∣ w ∣∣ 2 these are called support vectors max-margin classifier is called support vector machine (SVM) M M 6 . 3

  22. Support Vector Machine Support Vector Machine find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 M M 6 . 4

  23. Support Vector Machine Support Vector Machine find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 observation ∗ ∗ if is an optimal solution then w , w 0 M M 6 . 4

  24. Support Vector Machine Support Vector Machine find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 observation ∗ ∗ if is an optimal solution then w , w 0 M ∗ ∗ cw , cw is also optimal (same margin) 0 M 6 . 4

  25. Support Vector Machine Support Vector Machine find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 observation ∗ ∗ if is an optimal solution then w , w 0 M ∗ ∗ cw , cw is also optimal (same margin) 0 M 1 fix the norm of w to avoid this ∣∣ w ∣∣ = 2 M 6 . 4

  26. Support Vector Machine Support Vector Machine find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 6 . 5

  27. Support Vector Machine Support Vector Machine find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 6 . 5

  28. Support Vector Machine Support Vector Machine find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 { 1 max w , w 0 ∣∣ w ∣∣ 2 fixing ∣∣ w ∣∣ 1 = 1 1 ( n ) ⊤ ( n ) 2 ≤ ( w x + ) ∀ n y w M 0 ∣∣ w ∣∣ ∣∣ w ∣∣ 2 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 6 . 5

  29. Support Vector Machine Support Vector Machine find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 { 1 max w , w 0 ∣∣ w ∣∣ 2 fixing ∣∣ w ∣∣ 1 = 1 1 ( n ) ⊤ ( n ) 2 ≤ ( w x + ) ∀ n y w M 0 ∣∣ w ∣∣ ∣∣ w ∣∣ 2 2 1 { 2 ∣∣ w ∣∣ min ∣∣ w ∣∣ 2 w , w 2 0 1 simplifying, we get hard margin SVM objective ( n ) ⊤ ( n ) ( w x + ) ≥ 1 ∀ n ∣∣ w ∣∣ y w 2 0 6 . 5 Winter 2020 | Applied Machine Learning (COMP551)

  30. Perceptron: Perceptron: issues issues cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations the decision boundary may be suboptimal 7

  31. Perceptron: Perceptron: issues issues cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations the decision boundary may be suboptimal maximize the hard margin 7

  32. Perceptron: Perceptron: issues issues cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy now lets fix this problem maximize a soft margin even if linearly separable convergence could take many iterations the decision boundary may be suboptimal maximize the hard margin 7

  33. Soft Soft margin constraints margin constraints allow points inside the margin and on the wrong side but penalize them instead of hard constraint ( n ) ⊤ ( n ) ( w x + ) ≥ 1 ∀ n y w 0 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 1

  34. Soft margin constraints Soft margin constraints allow points inside the margin and on the wrong side but penalize them instead of hard constraint ( n ) ⊤ ( n ) ( w x + ) ≥ 1 ∀ n y w 0 ( n ) ⊤ ( n ) ( n ) ( w x + ) ≥ 1 − ξ ∀ n use y w 0 ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 1

  35. Soft margin constraints Soft margin constraints allow points inside the margin and on the wrong side but penalize them instead of hard constraint ( n ) ⊤ ( n ) ( w x + ) ≥ 1 ∀ n y w 0 ( n ) ⊤ ( n ) ( n ) ( w x + ) ≥ 1 − ξ ∀ n use y w 0 ( n ) ≥ 0 slack variables (one for each n) ξ ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 1

  36. Soft Soft margin constraints margin constraints allow points inside the margin and on the wrong side but penalize them instead of hard constraint ( n ) ⊤ ( n ) ( w x + ) ≥ 1 ∀ n y w 0 ( n ) ⊤ ( n ) ( n ) ( w x + ) ≥ 1 − ξ ∀ n use y w 0 ( n ) ≥ 0 slack variables (one for each n) ξ ( n ) zero if the point satisfies original margin constraint = 0 ξ ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 1

  37. Soft Soft margin constraints margin constraints allow points inside the margin and on the wrong side but penalize them instead of hard constraint ( n ) ⊤ ( n ) ( w x + ) ≥ 1 ∀ n y w 0 ( n ) ⊤ ( n ) ( n ) ( w x + ) ≥ 1 − ξ ∀ n use y w 0 ( n ) ≥ 0 slack variables (one for each n) ξ ( n ) zero if the point satisfies original margin constraint = 0 ξ ξ ( n ) if correctly classified but inside the margin ( n ) 0 < ξ < 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 1

  38. Soft Soft margin constraints margin constraints allow points inside the margin and on the wrong side but penalize them instead of hard constraint ( n ) ⊤ ( n ) ( w x + ) ≥ 1 ∀ n y w 0 ( n ) ⊤ ( n ) ( n ) ( w x + ) ≥ 1 − ξ ∀ n use y w 0 ( n ) ≥ 0 slack variables (one for each n) ξ ( n ) zero if the point satisfies original margin constraint = 0 ξ ξ ( n ) if correctly classified but inside the margin ( n ) 0 < ξ < 1 ∣∣ w ∣∣ 2 1 incorrectly classified ( n ) > 1 ξ ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 1

  39. Soft margin constraints Soft margin constraints allow points inside the margin and on the wrong side but penalize them soft-margin objective 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) ( n ) ( w x + ) ≥ 1 − ξ ∀ n y w 0 ( n ) ≥ 0 ∀ n ξ ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 2

  40. Soft margin constraints Soft margin constraints allow points inside the margin and on the wrong side but penalize them soft-margin objective 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) ( n ) ( w x + ) ≥ 1 − ξ ∀ n y w 0 ( n ) ≥ 0 ∀ n ξ ξ ( n ) ∣∣ w ∣∣ 2 1 γ ∣∣ w ∣∣ is a hyper-parameter that defines the importance of constraints 2 γ 1 for very large this becomes similar to hard margin svm ∣∣ w ∣∣ 2 8 . 2

  41. Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 3

  42. Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ if point satisfies the margin ( n ) ⊤ ( n ) ( w x + ) ≥ 1 y w 0 ( n ) minimum slack is = 0 ξ ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 3

  43. Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ if point satisfies the margin ( n ) ⊤ ( n ) ( w x + ) ≥ 1 y w 0 ( n ) minimum slack is = 0 ξ ξ ( n ) ∣∣ w ∣∣ 2 ( n ) ⊤ ( n ) otherwise ( w x + ) < 1 y w 1 0 the smallest slack is ( n ) ( n ) ⊤ ( n ) ∣∣ w ∣∣ = 1 − y ( w x + ) ξ w 0 2 1 ∣∣ w ∣∣ 2 8 . 3

  44. Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ if point satisfies the margin ( n ) ⊤ ( n ) ( w x + ) ≥ 1 y w 0 ( n ) minimum slack is = 0 ξ ξ ( n ) ∣∣ w ∣∣ 2 ( n ) ⊤ ( n ) otherwise ( w x + ) < 1 y w 1 0 the smallest slack is ( n ) ( n ) ⊤ ( n ) ∣∣ w ∣∣ = 1 − y ( w x + ) ξ w 0 2 1 ∣∣ w ∣∣ so the optimal slack satisfying both cases 2 ( n ) ( n ) ⊤ ( n ) = max(0, 1 − y ( w x + )) ξ w 0 8 . 3

  45. Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 4

  46. Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ replace ( n ) ( n ) ⊤ ( n ) = max(0, 1 − y ( w x + )) ξ w 0 ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 4

  47. Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ replace ( n ) ( n ) ⊤ ( n ) = max(0, 1 − y ( w x + )) ξ w 0 1 we get 2 ( n ) ⊤ ( n ) min ∣∣ w ∣∣ + max(0, 1 − ( w x + )) ∑ n γ y w w , w 0 0 2 2 ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 4

  48. Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ replace ( n ) ( n ) ⊤ ( n ) = max(0, 1 − y ( w x + )) ξ w 0 1 we get 2 ( n ) ⊤ ( n ) min ∣∣ w ∣∣ + max(0, 1 − ( w x + )) ∑ n γ y w w , w 0 0 2 2 ξ ( n ) ∣∣ w ∣∣ the same as 1 ( n ) ⊤ ( n ) 2 min max(0, 1 − ( w x + )) + ∣∣ w ∣∣ 2 0 ∑ n y w w , w 0 2 1 2 γ ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 4

  49. Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ replace ( n ) ( n ) ⊤ ( n ) = max(0, 1 − y ( w x + )) ξ w 0 1 we get 2 ( n ) ⊤ ( n ) min ∣∣ w ∣∣ + max(0, 1 − ( w x + )) ∑ n γ y w w , w 0 0 2 2 ξ ( n ) ∣∣ w ∣∣ the same as 1 ( n ) ⊤ ( n ) 2 min max(0, 1 − ( w x + )) + ∣∣ w ∣∣ 2 0 ∑ n y w w , w 0 2 1 2 γ ∣∣ w ∣∣ 2 this is called the hinge loss ( y , ^ ) = max(0, 1 − y ^ ) L y y 1 hinge ∣∣ w ∣∣ 2 8 . 4

  50. Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ replace ( n ) ( n ) ⊤ ( n ) = max(0, 1 − y ( w x + )) ξ w 0 1 we get 2 ( n ) ⊤ ( n ) min ∣∣ w ∣∣ + max(0, 1 − ( w x + )) ∑ n γ y w w , w 0 0 2 2 ξ ( n ) ∣∣ w ∣∣ the same as 1 ( n ) ⊤ ( n ) 2 min max(0, 1 − ( w x + )) + ∣∣ w ∣∣ 2 0 ∑ n y w w , w 0 2 1 2 γ ∣∣ w ∣∣ 2 this is called the hinge loss ( y , ^ ) = max(0, 1 − y ^ ) L y y 1 hinge ∣∣ w ∣∣ 2 soft-margin SVM is doing L2 regularized hinge loss minimization 8 . 4 Winter 2020 | Applied Machine Learning (COMP551)

  51. Perceptron vs. SVM Perceptron vs. SVM Perceptron if correctly classified evaluates to zero otherwise it is min ( n ) ⊤ ( n ) − y ( w x + )) w w , w 0 0 9 . 1

  52. Perceptron vs. SVM Perceptron vs. SVM Perceptron if correctly classified evaluates to zero otherwise it is min ( n ) ⊤ ( n ) − y ( w x + )) w w , w 0 0 can be written as ( n ) ⊤ ( n ) max(0, − y ( w x + )) ∑ n w 0 9 . 1

  53. Perceptron vs. SVM Perceptron vs. SVM Perceptron SVM ( n ) ⊤ ( n ) 2 max(0, 1 − ( w x + )) + λ ∣∣ w ∣∣ ∑ n y w if correctly classified evaluates to zero 0 2 2 otherwise it is min ( n ) ⊤ ( n ) − y ( w x + )) w w , w 0 0 can be written as ( n ) ⊤ ( n ) max(0, − y ( w x + )) ∑ n w 0 9 . 1

  54. Perceptron vs. SVM Perceptron vs. SVM Perceptron SVM ( n ) ⊤ ( n ) 2 max(0, 1 − ( w x + )) + λ ∣∣ w ∣∣ ∑ n y w if correctly classified evaluates to zero 0 2 2 otherwise it is min ( n ) ⊤ ( n ) − y ( w x + )) w w , w 0 0 so this is the difference! (plus regularization) can be written as ( n ) ⊤ ( n ) max(0, − y ( w x + )) ∑ n w 0 9 . 1

  55. Perceptron vs. SVM Perceptron vs. SVM Perceptron SVM ( n ) ⊤ ( n ) 2 max(0, 1 − ( w x + )) + λ ∣∣ w ∣∣ ∑ n y w if correctly classified evaluates to zero 0 2 2 otherwise it is min ( n ) ⊤ ( n ) − y ( w x + )) w w , w 0 0 so this is the difference! (plus regularization) can be written as ( n ) ⊤ ( n ) max(0, − y ( w x + )) ∑ n w 0 finds some linear decision boundary if exists for small lambda finds the max-marging decision boundary 9 . 1

  56. Perceptron vs. SVM Perceptron vs. SVM Perceptron SVM ( n ) ⊤ ( n ) 2 max(0, 1 − ( w x + )) + λ ∣∣ w ∣∣ ∑ n y w if correctly classified evaluates to zero 0 2 2 otherwise it is min ( n ) ⊤ ( n ) − y ( w x + )) w w , w 0 0 so this is the difference! (plus regularization) can be written as ( n ) ⊤ ( n ) max(0, − y ( w x + )) ∑ n w 0 finds some linear decision boundary if exists for small lambda finds the max-marging decision boundary stochastic gradient descent with fixed learning rate depending on the formulation we have many choices 9 . 1

  57. Perceptron vs. SVM Perceptron vs. SVM cost ( n ) ⊤ ( n ) 2 λ J ( w ) = ∑ n max(0, 1 − ) + ∣∣ w ∣∣ y w x 2 2 now we included bias in w 9 . 2

  58. Perceptron vs. SVM Perceptron vs. SVM cost ( n ) ⊤ ( n ) 2 λ J ( w ) = ∑ n max(0, 1 − ) + ∣∣ w ∣∣ y w x 2 2 now we included bias in w check that the cost function is convex in w(?) 9 . 2

  59. Perceptron vs. SVM Perceptron vs. SVM cost ( n ) ⊤ ( n ) 2 λ J ( w ) = ∑ n max(0, 1 − ) + ∣∣ w ∣∣ y w x 2 2 now we included bias in w 1 def cost(X,y,w, lamb=1e-3): 2 z = np.dot(X, w) 3 J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 check that the cost function is convex in w(?) 4 return J 9 . 2

  60. Perceptron vs. SVM Perceptron vs. SVM cost ( n ) ⊤ ( n ) 2 λ J ( w ) = ∑ n max(0, 1 − ) + ∣∣ w ∣∣ y w x 2 2 now we included bias in w 1 def cost(X,y,w, lamb=1e-3): 2 z = np.dot(X, w) 3 J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 check that the cost function is convex in w(?) 4 return J hinge loss is not smooth (piecewise linear) 9 . 2

  61. Perceptron vs. SVM Perceptron vs. SVM cost ( n ) ⊤ ( n ) 2 λ J ( w ) = ∑ n max(0, 1 − ) + ∣∣ w ∣∣ y w x 2 2 now we included bias in w 1 def cost(X,y,w, lamb=1e-3): 2 z = np.dot(X, w) 3 J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 check that the cost function is convex in w(?) 4 return J hinge loss is not smooth (piecewise linear) if we use "stochastic" sub-gradient descent the update will look like Perceptron ( n ) ⊤ ( n ) 2 if minimize ( n ) y ^ ( n ) λ − y ( w x ) + ∣∣ w ∣∣ < 1 y 2 2 otherwise, do nothing 9 . 2

  62. Perceptron vs. SVM Perceptron vs. SVM cost ( n ) ⊤ ( n ) 2 λ J ( w ) = ∑ n max(0, 1 − ) + ∣∣ w ∣∣ y w x 2 2 now we included bias in w 1 def cost(X,y,w, lamb=1e-3): 2 z = np.dot(X, w) 3 J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 check that the cost function is convex in w(?) 4 return J hinge loss is not smooth (piecewise linear) if we use "stochastic" sub-gradient descent 1 1 def subgradient(X, y, w, lamb): def subgradient(X, y, w, lamb): the update will look like Perceptron 2 2 N,D = X.shape N,D = X.shape 3 3 z = np.dot(X, w) z = np.dot(X, w) 4 4 violations = np.nonzero(z*y < 1)[0] violations = np.nonzero(z*y < 1)[0] ( n ) ⊤ ( n ) 2 5 5 grad = -np.dot(X[violations,:].T, grad = -np.dot(X[violations,:].T, if minimize ( n ) y ^ ( n ) λ − y ( w x ) + ∣∣ w ∣∣ < 1 y 2 2 y[violations])/N y[violations])/N 6 6 grad[:-1] += lamb2 * w[:-1] grad[:-1] += lamb2 * w[:-1] otherwise, do nothing 7 7 return grad return grad 9 . 2

  63. Perceptron vs. SVM Perceptron vs. SVM cost ( n ) ⊤ ( n ) 2 λ J ( w ) = ∑ n max(0, 1 − ) + ∣∣ w ∣∣ y w x 2 2 now we included bias in w 1 def cost(X,y,w, lamb=1e-3): 2 z = np.dot(X, w) 3 J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 check that the cost function is convex in w(?) 4 return J hinge loss is not smooth (piecewise linear) if we use "stochastic" sub-gradient descent 1 1 1 def subgradient(X, y, w, lamb): def subgradient(X, y, w, lamb): def subgradient(X, y, w, lamb): the update will look like Perceptron 2 2 2 N,D = X.shape N,D = X.shape N,D = X.shape 3 3 3 z = np.dot(X, w) z = np.dot(X, w) z = np.dot(X, w) 4 4 4 violations = np.nonzero(z*y < 1)[0] violations = np.nonzero(z*y < 1)[0] violations = np.nonzero(z*y < 1)[0] ( n ) ⊤ ( n ) 2 5 5 5 grad = -np.dot(X[violations,:].T, grad = -np.dot(X[violations,:].T, grad = -np.dot(X[violations,:].T, if minimize ( n ) y ^ ( n ) λ − y ( w x ) + ∣∣ w ∣∣ < 1 y 2 2 y[violations])/N y[violations])/N y[violations])/N 6 6 6 grad[:-1] += lamb2 * w[:-1] grad[:-1] += lamb2 * w[:-1] grad[:-1] += lamb2 * w[:-1] otherwise, do nothing 7 7 7 return grad return grad return grad 9 . 2

  64. Example Example Iris dataset (D=2) (linearly separable case) 1 def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 2 N,D = X.shape 3 w = np.zeros(D) 4 t = 0 5 w_old = w + np.inf 6 while np.linalg.norm(w - w_old) > eps and t < max_iters: 7 g = subgradient(X, y, w, lamb=lamb) 8 w_old = w 9 w = w - lr*g/np.sqrt(t+1) 10 t += 1 11 return w 9 . 3

  65. Example Example Iris dataset (D=2) (linearly separable case) 1 1 def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 2 2 N,D = X.shape N,D = X.shape 3 3 w = np.zeros(D) w = np.zeros(D) 4 4 t = 0 t = 0 5 5 w_old = w + np.inf w_old = w + np.inf 6 6 while np.linalg.norm(w - w_old) > eps and t < max_iters: while np.linalg.norm(w - w_old) > eps and t < max_iters: 7 7 g = subgradient(X, y, w, lamb=lamb) g = subgradient(X, y, w, lamb=lamb) 8 8 w_old = w w_old = w 9 9 w = w - lr*g/np.sqrt(t+1) w = w - lr*g/np.sqrt(t+1) 10 10 t += 1 t += 1 11 11 return w return w 9 . 3

  66. Example Example Iris dataset (D=2) (linearly separable case) 1 1 def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 2 2 N,D = X.shape N,D = X.shape 3 3 w = np.zeros(D) w = np.zeros(D) 4 4 t = 0 t = 0 5 5 w_old = w + np.inf w_old = w + np.inf 6 6 while np.linalg.norm(w - w_old) > eps and t < max_iters: while np.linalg.norm(w - w_old) > eps and t < max_iters: 7 7 g = subgradient(X, y, w, lamb=lamb) g = subgradient(X, y, w, lamb=lamb) 8 8 w_old = w w_old = w 9 9 w = w - lr*g/np.sqrt(t+1) w = w - lr*g/np.sqrt(t+1) 10 10 t += 1 t += 1 11 11 return w return w max-margin boundary (using small lambda ) λ = 10 −8 9 . 3

  67. Example Example Iris dataset (D=2) (linearly separable case) 1 1 def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 2 2 N,D = X.shape N,D = X.shape 3 3 w = np.zeros(D) w = np.zeros(D) 4 4 t = 0 t = 0 5 5 w_old = w + np.inf w_old = w + np.inf 6 6 while np.linalg.norm(w - w_old) > eps and t < max_iters: while np.linalg.norm(w - w_old) > eps and t < max_iters: 7 7 g = subgradient(X, y, w, lamb=lamb) g = subgradient(X, y, w, lamb=lamb) 8 8 w_old = w w_old = w 9 9 w = w - lr*g/np.sqrt(t+1) w = w - lr*g/np.sqrt(t+1) 10 10 t += 1 t += 1 11 11 return w return w max-margin boundary (using small lambda ) λ = 10 −8 compare to Perceptron's decision boundary 9 . 3

  68. Example Example Iris dataset (D=2) ( NOT linearly separable case) 1 def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 2 N,D = X.shape 3 w = np.zeros(D) 4 g = np.inf 5 t = 0 6 while np.linalg.norm(g) > eps and t < max_iters: 7 g = subgradient(X, y, w, lamb=lamb) λ = 10 −8 8 w = w - lr*g/np.sqrt(t+1) 9 t += 1 10 return w 9 . 4

  69. Example Example Iris dataset (D=2) ( NOT linearly separable case) 1 def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 2 N,D = X.shape 3 w = np.zeros(D) 4 g = np.inf 5 t = 0 6 while np.linalg.norm(g) > eps and t < max_iters: 7 g = subgradient(X, y, w, lamb=lamb) soft margins using small lambda λ = 10 −8 8 w = w - lr*g/np.sqrt(t+1) 9 t += 1 10 return w 9 . 4

  70. Example Example Iris dataset (D=2) ( NOT linearly separable case) 1 def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 2 N,D = X.shape 3 w = np.zeros(D) 4 g = np.inf 5 t = 0 6 while np.linalg.norm(g) > eps and t < max_iters: 7 g = subgradient(X, y, w, lamb=lamb) soft margins using small lambda λ = 10 −8 8 w = w - lr*g/np.sqrt(t+1) 9 t += 1 10 return w Perceptron does not converge 9 . 4 Winter 2020 | Applied Machine Learning (COMP551)

  71. SVM vs. logistic regression SVM vs. logistic regression recall : logistic regression simplified cost for y ∈ {0, 1} − z ( n ) z ( n ) N ( n ) ( n ) ( n ) ⊤ ( n ) J ( w ) = log ( 1 + ) + (1 − y ) log ( 1 + = ∑ n =1 ) where z y e e w x includes the bias zy 10

  72. SVM vs. logistic regression SVM vs. logistic regression recall : logistic regression simplified cost for y ∈ {0, 1} − z ( n ) z ( n ) N ( n ) ( n ) ( n ) ⊤ ( n ) J ( w ) = log ( 1 + ) + (1 − y ) log ( 1 + = ∑ n =1 ) where z y e e w x includes the bias y ∈ {−1, +1} for we can write this as ( n ) ( n ) N − y 2 J ( w ) = log ( 1 + ) + λ ∣∣ w ∣∣ ∑ n =1 z e 2 2 also added L2 regularization zy 10

Recommend


More recommend