learning linear separability and linear programming
play

Learning, Linear Separability and Linear Programming Lecture 22 - PowerPoint PPT Presentation

CS 573: Algorithms, Fall 2013 Learning, Linear Separability and Linear Programming Lecture 22 November 12, 2013 Sariel (UIUC) CS573 1 Fall 2013 1 / 28 Labeling... . . given examples:a database of cars. 1 . . like to determine which


  1. Learning linear separation . . Given red and blue points – how to compute the separating line 1 ℓ ? . . line/plane/hyperplane is the zero set of a linear function. 2 . . R d Form: ∀ x ∈ I f ( x ) = ⟨ a , x ⟩ + b , 3 R 2 . where a = ( a 1 , . . . , a d ) , b =( b 1 , . . . , b d ) ∈ I ⟨ a , x ⟩ = ∑ i a i x i is the dot product of a and x . . . classification done by computing sign of f ( x ) : sign( f ( x )) . 4 . . If sign( f ( x )) is negative: x is not in class. 5 If positive: inside. . . A set of training examples : 6 { } S = ( x 1 , y 1 ) , . . . , ( x n , y n ) , R d and y i ∈ { -1,1 } , for i = 1 , . . . , n . where x i ∈ I Sariel (UIUC) CS573 5 Fall 2013 5 / 28

  2. Learning linear separation . . Given red and blue points – how to compute the separating line 1 ℓ ? . . line/plane/hyperplane is the zero set of a linear function. 2 . . R d Form: ∀ x ∈ I f ( x ) = ⟨ a , x ⟩ + b , 3 R 2 . where a = ( a 1 , . . . , a d ) , b =( b 1 , . . . , b d ) ∈ I ⟨ a , x ⟩ = ∑ i a i x i is the dot product of a and x . . . classification done by computing sign of f ( x ) : sign( f ( x )) . 4 . . If sign( f ( x )) is negative: x is not in class. 5 If positive: inside. . . A set of training examples : 6 { } S = ( x 1 , y 1 ) , . . . , ( x n , y n ) , R d and y i ∈ { -1,1 } , for i = 1 , . . . , n . where x i ∈ I Sariel (UIUC) CS573 5 Fall 2013 5 / 28

  3. Learning linear separation . . Given red and blue points – how to compute the separating line 1 ℓ ? . . line/plane/hyperplane is the zero set of a linear function. 2 . . R d Form: ∀ x ∈ I f ( x ) = ⟨ a , x ⟩ + b , 3 R 2 . where a = ( a 1 , . . . , a d ) , b =( b 1 , . . . , b d ) ∈ I ⟨ a , x ⟩ = ∑ i a i x i is the dot product of a and x . . . classification done by computing sign of f ( x ) : sign( f ( x )) . 4 . . If sign( f ( x )) is negative: x is not in class. 5 If positive: inside. . . A set of training examples : 6 { } S = ( x 1 , y 1 ) , . . . , ( x n , y n ) , R d and y i ∈ { -1,1 } , for i = 1 , . . . , n . where x i ∈ I Sariel (UIUC) CS573 5 Fall 2013 5 / 28

  4. Classification... R d and b ∈ I . . linear classifier h : ( w , b ) where w ∈ I R . 1 R d is sign( ⟨ w , x ⟩ + b ) . . . classification of x ∈ I 2 . . labeled example ( x , y ) , h classifies ( x , y ) correctly if 3 sign( ⟨ w , x ⟩ + b ) = y . . . Assume a linear classifier exists. 4 . . Given n labeled example. How to compute the linear classifier 5 for these examples? . . Use linear programming.... 6 . looking for (w , b ) , such that for an (x i , y i ) we have 7 sign( ⟨ w , x i ⟩ + b ) = y i , which is ⟨ w , x i ⟩ + b ≥ 0 if y i = 1 , and ⟨ w , x i ⟩ + b ≤ 0 if y i = − 1 . Sariel (UIUC) CS573 6 Fall 2013 6 / 28

  5. Classification... R d and b ∈ I . . linear classifier h : ( w , b ) where w ∈ I R . 1 R d is sign( ⟨ w , x ⟩ + b ) . . . classification of x ∈ I 2 . . labeled example ( x , y ) , h classifies ( x , y ) correctly if 3 sign( ⟨ w , x ⟩ + b ) = y . . . Assume a linear classifier exists. 4 . . Given n labeled example. How to compute the linear classifier 5 for these examples? . . Use linear programming.... 6 . looking for (w , b ) , such that for an (x i , y i ) we have 7 sign( ⟨ w , x i ⟩ + b ) = y i , which is ⟨ w , x i ⟩ + b ≥ 0 if y i = 1 , and ⟨ w , x i ⟩ + b ≤ 0 if y i = − 1 . Sariel (UIUC) CS573 6 Fall 2013 6 / 28

  6. Classification... R d and b ∈ I . . linear classifier h : ( w , b ) where w ∈ I R . 1 R d is sign( ⟨ w , x ⟩ + b ) . . . classification of x ∈ I 2 . . labeled example ( x , y ) , h classifies ( x , y ) correctly if 3 sign( ⟨ w , x ⟩ + b ) = y . . . Assume a linear classifier exists. 4 . . Given n labeled example. How to compute the linear classifier 5 for these examples? . . Use linear programming.... 6 . looking for (w , b ) , such that for an (x i , y i ) we have 7 sign( ⟨ w , x i ⟩ + b ) = y i , which is ⟨ w , x i ⟩ + b ≥ 0 if y i = 1 , and ⟨ w , x i ⟩ + b ≤ 0 if y i = − 1 . Sariel (UIUC) CS573 6 Fall 2013 6 / 28

  7. Classification... R d and b ∈ I . . linear classifier h : ( w , b ) where w ∈ I R . 1 R d is sign( ⟨ w , x ⟩ + b ) . . . classification of x ∈ I 2 . . labeled example ( x , y ) , h classifies ( x , y ) correctly if 3 sign( ⟨ w , x ⟩ + b ) = y . . . Assume a linear classifier exists. 4 . . Given n labeled example. How to compute the linear classifier 5 for these examples? . . Use linear programming.... 6 . looking for (w , b ) , such that for an (x i , y i ) we have 7 sign( ⟨ w , x i ⟩ + b ) = y i , which is ⟨ w , x i ⟩ + b ≥ 0 if y i = 1 , and ⟨ w , x i ⟩ + b ≤ 0 if y i = − 1 . Sariel (UIUC) CS573 6 Fall 2013 6 / 28

  8. Classification... R d and b ∈ I . . linear classifier h : ( w , b ) where w ∈ I R . 1 R d is sign( ⟨ w , x ⟩ + b ) . . . classification of x ∈ I 2 . . labeled example ( x , y ) , h classifies ( x , y ) correctly if 3 sign( ⟨ w , x ⟩ + b ) = y . . . Assume a linear classifier exists. 4 . . Given n labeled example. How to compute the linear classifier 5 for these examples? . . Use linear programming.... 6 . looking for (w , b ) , such that for an (x i , y i ) we have 7 sign( ⟨ w , x i ⟩ + b ) = y i , which is ⟨ w , x i ⟩ + b ≥ 0 if y i = 1 , and ⟨ w , x i ⟩ + b ≤ 0 if y i = − 1 . Sariel (UIUC) CS573 6 Fall 2013 6 / 28

  9. Classification... . . ( ) x 1 i , . . . , x d R d , for Or equivalently, let x i = ∈ I 1 i ( w 1 , . . . , w d ) i = 1 , . . . , m , and let w = , then we get the linear constraint d w k x k ∑ i + b ≥ 0 if y i = 1 , k =1 d ∑ w k x k and i + b ≤ 0 if y i = − 1 . k =1 Thus, we get a set of linear constraints, one for each training example, and we need to solve the resulting linear program. Sariel (UIUC) CS573 7 Fall 2013 7 / 28

  10. Linear programming for learning? . . Stumbling block: is that linear programming is very sensitive to 1 noise. . . If points are misclassified = ⇒ no solution. 2 . . use an iterative algorithm that converges to the optimal solution 3 if it exists... Sariel (UIUC) CS573 8 Fall 2013 8 / 28

  11. Linear programming for learning? . . Stumbling block: is that linear programming is very sensitive to 1 noise. . . If points are misclassified = ⇒ no solution. 2 . . use an iterative algorithm that converges to the optimal solution 3 if it exists... Sariel (UIUC) CS573 8 Fall 2013 8 / 28

  12. Linear programming for learning? . . Stumbling block: is that linear programming is very sensitive to 1 noise. . . If points are misclassified = ⇒ no solution. 2 . . use an iterative algorithm that converges to the optimal solution 3 if it exists... Sariel (UIUC) CS573 8 Fall 2013 8 / 28

  13. Perceptron algorithm... perceptron ( S : a set of l examples) w 0 ← 0 , k ← 0 � � R = max (x , y ) ∈ S � x � . � � repeat for (x , y ) ∈ S do if sign( ⟨ w k , x ⟩ ) ̸ = y then w k +1 ← w k + y ∗ x k ← k + 1 until no mistakes are made in the classification return w k and k Sariel (UIUC) CS573 9 Fall 2013 9 / 28

  14. Perceptron algorithm . . Why perceptron algorithm converges? 1 . . Assume made a mistake on a sample (x , y ) and y = 1 . Then, 2 ⟨ w k , x ⟩ < 0 , and ⟨ w k +1 , x ⟩ = ⟨ w k + y ∗ x , x ⟩ = ⟨ w k , x ⟩ + y ⟨ x , x ⟩ = ⟨ w k , x ⟩ + y ∥ x ∥ > ⟨ w k , x ⟩ . . . “walking” in the right direction.. 3 . ... new value assigned to x by w k +1 is larger (“more positive”) 4 than the old value assigned to x by w k . . . After enough iterations of such fix-ups, label would change... 5 Sariel (UIUC) CS573 10 Fall 2013 10 / 28

  15. Perceptron algorithm . . Why perceptron algorithm converges? 1 . . Assume made a mistake on a sample (x , y ) and y = 1 . Then, 2 ⟨ w k , x ⟩ < 0 , and ⟨ w k +1 , x ⟩ = ⟨ w k + y ∗ x , x ⟩ = ⟨ w k , x ⟩ + y ⟨ x , x ⟩ = ⟨ w k , x ⟩ + y ∥ x ∥ > ⟨ w k , x ⟩ . . . “walking” in the right direction.. 3 . ... new value assigned to x by w k +1 is larger (“more positive”) 4 than the old value assigned to x by w k . . . After enough iterations of such fix-ups, label would change... 5 Sariel (UIUC) CS573 10 Fall 2013 10 / 28

  16. Perceptron algorithm . . Why perceptron algorithm converges? 1 . . Assume made a mistake on a sample (x , y ) and y = 1 . Then, 2 ⟨ w k , x ⟩ < 0 , and ⟨ w k +1 , x ⟩ = ⟨ w k + y ∗ x , x ⟩ = ⟨ w k , x ⟩ + y ⟨ x , x ⟩ = ⟨ w k , x ⟩ + y ∥ x ∥ > ⟨ w k , x ⟩ . . . “walking” in the right direction.. 3 . ... new value assigned to x by w k +1 is larger (“more positive”) 4 than the old value assigned to x by w k . . . After enough iterations of such fix-ups, label would change... 5 Sariel (UIUC) CS573 10 Fall 2013 10 / 28

  17. Perceptron algorithm . . Why perceptron algorithm converges? 1 . . Assume made a mistake on a sample (x , y ) and y = 1 . Then, 2 ⟨ w k , x ⟩ < 0 , and ⟨ w k +1 , x ⟩ = ⟨ w k + y ∗ x , x ⟩ = ⟨ w k , x ⟩ + y ⟨ x , x ⟩ = ⟨ w k , x ⟩ + y ∥ x ∥ > ⟨ w k , x ⟩ . . . “walking” in the right direction.. 3 . ... new value assigned to x by w k +1 is larger (“more positive”) 4 than the old value assigned to x by w k . . . After enough iterations of such fix-ups, label would change... 5 Sariel (UIUC) CS573 10 Fall 2013 10 / 28

  18. Perceptron algorithm converges . Theorem . � � Let S be a training set of examples, and let R = max ( x , y ) ∈ S � x � . � � � � Suppose that there exists a vector w opt such that � w opt � = 1 , and a � � number γ > 0 , such that y ⟨ w opt , x ⟩ ≥ γ ∀ ( x , y ) ∈ S . Then, the number of mistakes made by the online perceptron algorithm on S is at most ) 2 ( R . γ . Sariel (UIUC) CS573 11 Fall 2013 11 / 28

  19. Claim by figure... hard easy Sariel (UIUC) CS573 12 Fall 2013 12 / 28

  20. Claim by figure... hard easy R R Sariel (UIUC) CS573 12 Fall 2013 12 / 28

  21. Claim by figure... hard easy R R ℓ ℓ γ γ ′ w opt w opt Sariel (UIUC) CS573 12 Fall 2013 12 / 28

  22. Claim by figure... hard easy R R ℓ ℓ γ γ ′ w opt w opt # errors: ( R /γ ) 2 # errors: ( R /γ ′ ) 2 Sariel (UIUC) CS573 12 Fall 2013 12 / 28

  23. Proof of Perceptron convergence... . . Idea of proof: perceptron weight vector converges to w opt . 1 . . Distance between w opt and k th update vector: 2 2 � w k − R 2 � � � � α k = . γ w opt � � � � � . . Quantify the change between α k and α k +1 3 . . Example being misclassified is ( x , y ) . 4 Sariel (UIUC) CS573 13 Fall 2013 13 / 28

  24. Proof of Perceptron convergence... . . Idea of proof: perceptron weight vector converges to w opt . 1 . . Distance between w opt and k th update vector: 2 2 � w k − R 2 � � � � α k = . γ w opt � � � � � . . Quantify the change between α k and α k +1 3 . . Example being misclassified is ( x , y ) . 4 Sariel (UIUC) CS573 13 Fall 2013 13 / 28

  25. Proof of Perceptron convergence... . . Idea of proof: perceptron weight vector converges to w opt . 1 . . Distance between w opt and k th update vector: 2 2 � w k − R 2 � � � � α k = . γ w opt � � � � � . . Quantify the change between α k and α k +1 3 . . Example being misclassified is ( x , y ) . 4 Sariel (UIUC) CS573 13 Fall 2013 13 / 28

  26. Proof of Perceptron convergence... . . Idea of proof: perceptron weight vector converges to w opt . 1 . . Distance between w opt and k th update vector: 2 2 � w k − R 2 � � � � α k = . γ w opt � � � � � . . Quantify the change between α k and α k +1 3 . . Example being misclassified is ( x , y ) . 4 Sariel (UIUC) CS573 13 Fall 2013 13 / 28

  27. Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28

  28. Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28

  29. Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28

  30. Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28

  31. Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28

  32. Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28

  33. Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28

  34. Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28

  35. Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28

  36. Proof of Perceptron convergence... 2 . . . ⟨ ( w k − R 2 ) ⟩ � � We proved: α k +1 = α k + 2 y , x + γ w opt � x 1 � � � . . (x , y ) is misclassified: sign( ⟨ w k , x ⟩ ) ̸ = y 2 . . ⇒ sign( y ⟨ w k , x ⟩ ) = − 1 = 3 . . = ⇒ y ⟨ w k , x ⟩ < 0 . 4 . . � � � ≤ R = ⇒ � x 5 � � ⟨ R 2 ⟩ α k +1 ≤ α k + R 2 + 2 y ⟨ w k , x ⟩ − 2 y γ w opt , x − 2 R 2 ≤ α k + R 2 + γ y ⟨ w opt , x ⟩ . . . ... since 2 y ⟨ w k , x ⟩ < 0 . 6 Sariel (UIUC) CS573 15 Fall 2013 15 / 28

  37. Proof of Perceptron convergence... 2 . . . ⟨ ( w k − R 2 ) ⟩ � � We proved: α k +1 = α k + 2 y , x + γ w opt � x 1 � � � . . (x , y ) is misclassified: sign( ⟨ w k , x ⟩ ) ̸ = y 2 . . ⇒ sign( y ⟨ w k , x ⟩ ) = − 1 = 3 . . = ⇒ y ⟨ w k , x ⟩ < 0 . 4 . . � � � ≤ R = ⇒ � x 5 � � ⟨ R 2 ⟩ α k +1 ≤ α k + R 2 + 2 y ⟨ w k , x ⟩ − 2 y γ w opt , x − 2 R 2 ≤ α k + R 2 + γ y ⟨ w opt , x ⟩ . . . ... since 2 y ⟨ w k , x ⟩ < 0 . 6 Sariel (UIUC) CS573 15 Fall 2013 15 / 28

  38. Proof of Perceptron convergence... 2 . . . ⟨ ( w k − R 2 ) ⟩ � � We proved: α k +1 = α k + 2 y , x + γ w opt � x 1 � � � . . (x , y ) is misclassified: sign( ⟨ w k , x ⟩ ) ̸ = y 2 . . ⇒ sign( y ⟨ w k , x ⟩ ) = − 1 = 3 . . = ⇒ y ⟨ w k , x ⟩ < 0 . 4 . . � � � ≤ R = ⇒ � x 5 � � ⟨ R 2 ⟩ α k +1 ≤ α k + R 2 + 2 y ⟨ w k , x ⟩ − 2 y γ w opt , x − 2 R 2 ≤ α k + R 2 + γ y ⟨ w opt , x ⟩ . . . ... since 2 y ⟨ w k , x ⟩ < 0 . 6 Sariel (UIUC) CS573 15 Fall 2013 15 / 28

  39. Proof of Perceptron convergence... 2 . . . ⟨ ( w k − R 2 ) ⟩ � � We proved: α k +1 = α k + 2 y , x + γ w opt � x 1 � � � . . (x , y ) is misclassified: sign( ⟨ w k , x ⟩ ) ̸ = y 2 . . ⇒ sign( y ⟨ w k , x ⟩ ) = − 1 = 3 . . = ⇒ y ⟨ w k , x ⟩ < 0 . 4 . . � � � ≤ R = ⇒ � x 5 � � ⟨ R 2 ⟩ α k +1 ≤ α k + R 2 + 2 y ⟨ w k , x ⟩ − 2 y γ w opt , x − 2 R 2 ≤ α k + R 2 + γ y ⟨ w opt , x ⟩ . . . ... since 2 y ⟨ w k , x ⟩ < 0 . 6 Sariel (UIUC) CS573 15 Fall 2013 15 / 28

  40. Proof of Perceptron convergence... 2 . . . ⟨ ( w k − R 2 ) ⟩ � � We proved: α k +1 = α k + 2 y , x + γ w opt � x 1 � � � . . (x , y ) is misclassified: sign( ⟨ w k , x ⟩ ) ̸ = y 2 . . ⇒ sign( y ⟨ w k , x ⟩ ) = − 1 = 3 . . = ⇒ y ⟨ w k , x ⟩ < 0 . 4 . . � � � ≤ R = ⇒ � x 5 � � ⟨ R 2 ⟩ α k +1 ≤ α k + R 2 + 2 y ⟨ w k , x ⟩ − 2 y γ w opt , x − 2 R 2 ≤ α k + R 2 + γ y ⟨ w opt , x ⟩ . . . ... since 2 y ⟨ w k , x ⟩ < 0 . 6 Sariel (UIUC) CS573 15 Fall 2013 15 / 28

  41. Proof of Perceptron convergence... 2 . . . ⟨ ( w k − R 2 ) ⟩ � � We proved: α k +1 = α k + 2 y , x + γ w opt � x 1 � � � . . (x , y ) is misclassified: sign( ⟨ w k , x ⟩ ) ̸ = y 2 . . ⇒ sign( y ⟨ w k , x ⟩ ) = − 1 = 3 . . = ⇒ y ⟨ w k , x ⟩ < 0 . 4 . . � � � ≤ R = ⇒ � x 5 � � ⟨ R 2 ⟩ α k +1 ≤ α k + R 2 + 2 y ⟨ w k , x ⟩ − 2 y γ w opt , x − 2 R 2 ≤ α k + R 2 + γ y ⟨ w opt , x ⟩ . . . ... since 2 y ⟨ w k , x ⟩ < 0 . 6 Sariel (UIUC) CS573 15 Fall 2013 15 / 28

  42. Proof of Perceptron convergence... Proved: α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ . 1 . . sign( ⟨ w opt , x ⟩ ) = y . 2 . . By margin assumption: y ⟨ w opt , x ⟩ ≥ γ, ∀ ( x , y ) ∈ S . 3 α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ 4 ≤ α k + R 2 − 2 R 2 γ γ ≤ α k + R 2 − 2 R 2 ≤ α k − R 2 . Sariel (UIUC) CS573 16 Fall 2013 16 / 28

  43. Proof of Perceptron convergence... Proved: α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ . 1 . . sign( ⟨ w opt , x ⟩ ) = y . 2 . . By margin assumption: y ⟨ w opt , x ⟩ ≥ γ, ∀ ( x , y ) ∈ S . 3 α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ 4 ≤ α k + R 2 − 2 R 2 γ γ ≤ α k + R 2 − 2 R 2 ≤ α k − R 2 . Sariel (UIUC) CS573 16 Fall 2013 16 / 28

  44. Proof of Perceptron convergence... Proved: α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ . 1 . . sign( ⟨ w opt , x ⟩ ) = y . 2 . . By margin assumption: y ⟨ w opt , x ⟩ ≥ γ, ∀ ( x , y ) ∈ S . 3 α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ 4 ≤ α k + R 2 − 2 R 2 γ γ ≤ α k + R 2 − 2 R 2 ≤ α k − R 2 . Sariel (UIUC) CS573 16 Fall 2013 16 / 28

  45. Proof of Perceptron convergence... Proved: α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ . 1 . . sign( ⟨ w opt , x ⟩ ) = y . 2 . . By margin assumption: y ⟨ w opt , x ⟩ ≥ γ, ∀ ( x , y ) ∈ S . 3 α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ 4 ≤ α k + R 2 − 2 R 2 γ γ ≤ α k + R 2 − 2 R 2 ≤ α k − R 2 . Sariel (UIUC) CS573 16 Fall 2013 16 / 28

  46. Proof of Perceptron convergence... Proved: α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ . 1 . . sign( ⟨ w opt , x ⟩ ) = y . 2 . . By margin assumption: y ⟨ w opt , x ⟩ ≥ γ, ∀ ( x , y ) ∈ S . 3 α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ 4 ≤ α k + R 2 − 2 R 2 γ γ ≤ α k + R 2 − 2 R 2 ≤ α k − R 2 . Sariel (UIUC) CS573 16 Fall 2013 16 / 28

  47. Proof of Perceptron convergence... Proved: α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ . 1 . . sign( ⟨ w opt , x ⟩ ) = y . 2 . . By margin assumption: y ⟨ w opt , x ⟩ ≥ γ, ∀ ( x , y ) ∈ S . 3 α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ 4 ≤ α k + R 2 − 2 R 2 γ γ ≤ α k + R 2 − 2 R 2 ≤ α k − R 2 . Sariel (UIUC) CS573 16 Fall 2013 16 / 28

  48. Proof of Perceptron convergence... Proved: α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ . 1 . . sign( ⟨ w opt , x ⟩ ) = y . 2 . . By margin assumption: y ⟨ w opt , x ⟩ ≥ γ, ∀ ( x , y ) ∈ S . 3 α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ 4 ≤ α k + R 2 − 2 R 2 γ γ ≤ α k + R 2 − 2 R 2 ≤ α k − R 2 . Sariel (UIUC) CS573 16 Fall 2013 16 / 28

  49. Proof of Perceptron convergence... . . We have: α k +1 ≤ α k − R 2 1 2 � 0 − R 2 = R 4 2 = R 4 � � . . � � � � α 0 = γ w opt � w opt γ 2 . 2 � � � � � � γ 2 � � . . ∀ i α i ≥ 0 . 3 . . Q: max # classification errors can make? 4 . . ... # of updates 5 . . .. # of updates ≤ α 0 / R 2 ... 6 A: ≤ R 2 . . γ 2 . 7 Sariel (UIUC) CS573 17 Fall 2013 17 / 28

  50. Proof of Perceptron convergence... . . We have: α k +1 ≤ α k − R 2 1 2 � 0 − R 2 = R 4 2 = R 4 � � . . � � � � α 0 = γ w opt � w opt γ 2 . 2 � � � � � � γ 2 � � . . ∀ i α i ≥ 0 . 3 . . Q: max # classification errors can make? 4 . . ... # of updates 5 . . .. # of updates ≤ α 0 / R 2 ... 6 A: ≤ R 2 . . γ 2 . 7 Sariel (UIUC) CS573 17 Fall 2013 17 / 28

  51. Proof of Perceptron convergence... . . We have: α k +1 ≤ α k − R 2 1 2 � 0 − R 2 = R 4 2 = R 4 � � . . � � � � α 0 = γ w opt � w opt γ 2 . 2 � � � � � � γ 2 � � . . ∀ i α i ≥ 0 . 3 . . Q: max # classification errors can make? 4 . . ... # of updates 5 . . .. # of updates ≤ α 0 / R 2 ... 6 A: ≤ R 2 . . γ 2 . 7 Sariel (UIUC) CS573 17 Fall 2013 17 / 28

  52. Proof of Perceptron convergence... . . We have: α k +1 ≤ α k − R 2 1 2 � 0 − R 2 = R 4 2 = R 4 � � . . � � � � α 0 = γ w opt � w opt γ 2 . 2 � � � � � � γ 2 � � . . ∀ i α i ≥ 0 . 3 . . Q: max # classification errors can make? 4 . . ... # of updates 5 . . .. # of updates ≤ α 0 / R 2 ... 6 A: ≤ R 2 . . γ 2 . 7 Sariel (UIUC) CS573 17 Fall 2013 17 / 28

  53. Proof of Perceptron convergence... . . We have: α k +1 ≤ α k − R 2 1 2 � 0 − R 2 = R 4 2 = R 4 � � . . � � � � α 0 = γ w opt � w opt γ 2 . 2 � � � � � � γ 2 � � . . ∀ i α i ≥ 0 . 3 . . Q: max # classification errors can make? 4 . . ... # of updates 5 . . .. # of updates ≤ α 0 / R 2 ... 6 A: ≤ R 2 . . γ 2 . 7 Sariel (UIUC) CS573 17 Fall 2013 17 / 28

  54. Proof of Perceptron convergence... . . We have: α k +1 ≤ α k − R 2 1 2 � 0 − R 2 = R 4 2 = R 4 � � . . � � � � α 0 = γ w opt � w opt γ 2 . 2 � � � � � � γ 2 � � . . ∀ i α i ≥ 0 . 3 . . Q: max # classification errors can make? 4 . . ... # of updates 5 . . .. # of updates ≤ α 0 / R 2 ... 6 A: ≤ R 2 . . γ 2 . 7 Sariel (UIUC) CS573 17 Fall 2013 17 / 28

  55. Proof of Perceptron convergence... . . We have: α k +1 ≤ α k − R 2 1 2 � 0 − R 2 = R 4 2 = R 4 � � . . � � � � α 0 = γ w opt � w opt γ 2 . 2 � � � � � � γ 2 � � . . ∀ i α i ≥ 0 . 3 . . Q: max # classification errors can make? 4 . . ... # of updates 5 . . .. # of updates ≤ α 0 / R 2 ... 6 A: ≤ R 2 . . γ 2 . 7 Sariel (UIUC) CS573 17 Fall 2013 17 / 28

  56. Concluding comment... Any linear program can be written as the problem of separating red points from blue points. As such, the perceptron algorithm can be used to solve linear programs. Sariel (UIUC) CS573 18 Fall 2013 18 / 28

  57. Learning a circle... . . Given a set of red points, and blue points in the plane, we want 1 to learn a circle that contains all the red points, and does not contain the blue points. . . Q: How to compute the circle σ ? 2 Lifting : ℓ : ( x , y ) → ( x , y , x 2 + y 2 ) . . . 3 . . ℓ ( x , y ) = ( x , y , x 2 + y 2 ) { � } z ( P ) = � ( x , y ) ∈ P 4 � Sariel (UIUC) CS573 19 Fall 2013 19 / 28

  58. Learning a circle... . . Given a set of red points, and blue points in the plane, we want 1 to learn a circle that contains all the red points, and does not contain the blue points. σ . Q: How to compute the circle σ ? 2 Lifting : ℓ : ( x , y ) → ( x , y , x 2 + y 2 ) . . . 3 ℓ ( x , y ) = ( x , y , x 2 + y 2 ) . . { � } z ( P ) = � ( x , y ) ∈ P 4 � Sariel (UIUC) CS573 19 Fall 2013 19 / 28

  59. Learning a circle... . . Given a set of red points, and blue points in the plane, we want 1 to learn a circle that contains all the red points, and does not contain the blue points. σ . Q: How to compute the circle σ ? 2 Lifting : ℓ : ( x , y ) → ( x , y , x 2 + y 2 ) . . . 3 ℓ ( x , y ) = ( x , y , x 2 + y 2 ) . . { � } z ( P ) = � ( x , y ) ∈ P 4 � Sariel (UIUC) CS573 19 Fall 2013 19 / 28

  60. Learning a circle... . . Given a set of red points, and blue points in the plane, we want 1 to learn a circle that contains all the red points, and does not contain the blue points. σ . Q: How to compute the circle σ ? 2 Lifting : ℓ : ( x , y ) → ( x , y , x 2 + y 2 ) . . . 3 ℓ ( x , y ) = ( x , y , x 2 + y 2 ) . . { � } z ( P ) = � ( x , y ) ∈ P 4 � Sariel (UIUC) CS573 19 Fall 2013 19 / 28

  61. Learning a circle... . . Given a set of red points, and blue points in the plane, we want 1 to learn a circle that contains all the red points, and does not contain the blue points. σ . Q: How to compute the circle σ ? 2 Lifting : ℓ : ( x , y ) → ( x , y , x 2 + y 2 ) . . . 3 ℓ ( x , y ) = ( x , y , x 2 + y 2 ) . . { � } z ( P ) = � ( x , y ) ∈ P 4 � Sariel (UIUC) CS573 19 Fall 2013 19 / 28

  62. Learning a circle... . Theorem . Two sets of points R and B are separable by a circle in two dimensions, if and only if ℓ ( R ) and ℓ ( B ) are separable by a plane in . three dimensions. Sariel (UIUC) CS573 20 Fall 2013 20 / 28

  63. Proof σ ≡ ( x − a ) 2 + ( y − b ) 2 = r 2 : circle containing R , and all . . 1 points of B outside. ( x − a ) 2 + ( y − b ) 2 ≤ r 2 . . ∀ ( x , y ) ∈ R 2 ( x − a ) 2 + ( y − b ) 2 > r 2 . ∀ ( x , y ) ∈ B − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 ≤ 0 . . . ∀ ( x , y ) ∈ R 3 − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 > 0 . ∀ ( x , y ) ∈ B Setting z = z ( x , y ) = x 2 + y 2 : . . 4 h ( x , y , z ) = − 2 ax − 2 by + z − r 2 + a 2 + b 2 ∀ ( x , y ) ∈ R h ( x , y , z ( x , y )) ≤ 0 . . ⇐ ⇒ ∀ ( x , y ) ∈ R h ( ℓ ( x , y )) ≤ 0 5 ∀ ( x , y ) ∈ B h ( ℓ ( x , y )) > 0 . . p ∈ σ ⇐ ⇒ h ( ℓ ( p )) ≤ 0 . 6 . . Proved: if point set is separable by a circle = ⇒ lifted point set 7 ℓ ( R ) and ℓ ( B ) are separable by a plane. Sariel (UIUC) CS573 21 Fall 2013 21 / 28

  64. Proof σ ≡ ( x − a ) 2 + ( y − b ) 2 = r 2 : circle containing R , and all . . 1 points of B outside. ( x − a ) 2 + ( y − b ) 2 ≤ r 2 . . ∀ ( x , y ) ∈ R 2 ( x − a ) 2 + ( y − b ) 2 > r 2 . ∀ ( x , y ) ∈ B − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 ≤ 0 . . . ∀ ( x , y ) ∈ R 3 − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 > 0 . ∀ ( x , y ) ∈ B Setting z = z ( x , y ) = x 2 + y 2 : . . 4 h ( x , y , z ) = − 2 ax − 2 by + z − r 2 + a 2 + b 2 ∀ ( x , y ) ∈ R h ( x , y , z ( x , y )) ≤ 0 . . ⇐ ⇒ ∀ ( x , y ) ∈ R h ( ℓ ( x , y )) ≤ 0 5 ∀ ( x , y ) ∈ B h ( ℓ ( x , y )) > 0 . . p ∈ σ ⇐ ⇒ h ( ℓ ( p )) ≤ 0 . 6 . . Proved: if point set is separable by a circle = ⇒ lifted point set 7 ℓ ( R ) and ℓ ( B ) are separable by a plane. Sariel (UIUC) CS573 21 Fall 2013 21 / 28

  65. Proof σ ≡ ( x − a ) 2 + ( y − b ) 2 = r 2 : circle containing R , and all . . 1 points of B outside. ( x − a ) 2 + ( y − b ) 2 ≤ r 2 . . ∀ ( x , y ) ∈ R 2 ( x − a ) 2 + ( y − b ) 2 > r 2 . ∀ ( x , y ) ∈ B − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 ≤ 0 . . . ∀ ( x , y ) ∈ R 3 − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 > 0 . ∀ ( x , y ) ∈ B Setting z = z ( x , y ) = x 2 + y 2 : . . 4 h ( x , y , z ) = − 2 ax − 2 by + z − r 2 + a 2 + b 2 ∀ ( x , y ) ∈ R h ( x , y , z ( x , y )) ≤ 0 . . ⇐ ⇒ ∀ ( x , y ) ∈ R h ( ℓ ( x , y )) ≤ 0 5 ∀ ( x , y ) ∈ B h ( ℓ ( x , y )) > 0 . . p ∈ σ ⇐ ⇒ h ( ℓ ( p )) ≤ 0 . 6 . . Proved: if point set is separable by a circle = ⇒ lifted point set 7 ℓ ( R ) and ℓ ( B ) are separable by a plane. Sariel (UIUC) CS573 21 Fall 2013 21 / 28

  66. Proof σ ≡ ( x − a ) 2 + ( y − b ) 2 = r 2 : circle containing R , and all . . 1 points of B outside. ( x − a ) 2 + ( y − b ) 2 ≤ r 2 . . ∀ ( x , y ) ∈ R 2 ( x − a ) 2 + ( y − b ) 2 > r 2 . ∀ ( x , y ) ∈ B − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 ≤ 0 . . . ∀ ( x , y ) ∈ R 3 − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 > 0 . ∀ ( x , y ) ∈ B Setting z = z ( x , y ) = x 2 + y 2 : . . 4 h ( x , y , z ) = − 2 ax − 2 by + z − r 2 + a 2 + b 2 ∀ ( x , y ) ∈ R h ( x , y , z ( x , y )) ≤ 0 . . ⇐ ⇒ ∀ ( x , y ) ∈ R h ( ℓ ( x , y )) ≤ 0 5 ∀ ( x , y ) ∈ B h ( ℓ ( x , y )) > 0 . . p ∈ σ ⇐ ⇒ h ( ℓ ( p )) ≤ 0 . 6 . . Proved: if point set is separable by a circle = ⇒ lifted point set 7 ℓ ( R ) and ℓ ( B ) are separable by a plane. Sariel (UIUC) CS573 21 Fall 2013 21 / 28

  67. Proof σ ≡ ( x − a ) 2 + ( y − b ) 2 = r 2 : circle containing R , and all . . 1 points of B outside. ( x − a ) 2 + ( y − b ) 2 ≤ r 2 . . ∀ ( x , y ) ∈ R 2 ( x − a ) 2 + ( y − b ) 2 > r 2 . ∀ ( x , y ) ∈ B − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 ≤ 0 . . . ∀ ( x , y ) ∈ R 3 − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 > 0 . ∀ ( x , y ) ∈ B Setting z = z ( x , y ) = x 2 + y 2 : . . 4 h ( x , y , z ) = − 2 ax − 2 by + z − r 2 + a 2 + b 2 ∀ ( x , y ) ∈ R h ( x , y , z ( x , y )) ≤ 0 . . ⇐ ⇒ ∀ ( x , y ) ∈ R h ( ℓ ( x , y )) ≤ 0 5 ∀ ( x , y ) ∈ B h ( ℓ ( x , y )) > 0 . . p ∈ σ ⇐ ⇒ h ( ℓ ( p )) ≤ 0 . 6 . . Proved: if point set is separable by a circle = ⇒ lifted point set 7 ℓ ( R ) and ℓ ( B ) are separable by a plane. Sariel (UIUC) CS573 21 Fall 2013 21 / 28

  68. Proof σ ≡ ( x − a ) 2 + ( y − b ) 2 = r 2 : circle containing R , and all . . 1 points of B outside. ( x − a ) 2 + ( y − b ) 2 ≤ r 2 . . ∀ ( x , y ) ∈ R 2 ( x − a ) 2 + ( y − b ) 2 > r 2 . ∀ ( x , y ) ∈ B − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 ≤ 0 . . . ∀ ( x , y ) ∈ R 3 − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 > 0 . ∀ ( x , y ) ∈ B Setting z = z ( x , y ) = x 2 + y 2 : . . 4 h ( x , y , z ) = − 2 ax − 2 by + z − r 2 + a 2 + b 2 ∀ ( x , y ) ∈ R h ( x , y , z ( x , y )) ≤ 0 . . ⇐ ⇒ ∀ ( x , y ) ∈ R h ( ℓ ( x , y )) ≤ 0 5 ∀ ( x , y ) ∈ B h ( ℓ ( x , y )) > 0 . . p ∈ σ ⇐ ⇒ h ( ℓ ( p )) ≤ 0 . 6 . . Proved: if point set is separable by a circle = ⇒ lifted point set 7 ℓ ( R ) and ℓ ( B ) are separable by a plane. Sariel (UIUC) CS573 21 Fall 2013 21 / 28

  69. Proof σ ≡ ( x − a ) 2 + ( y − b ) 2 = r 2 : circle containing R , and all . . 1 points of B outside. ( x − a ) 2 + ( y − b ) 2 ≤ r 2 . . ∀ ( x , y ) ∈ R 2 ( x − a ) 2 + ( y − b ) 2 > r 2 . ∀ ( x , y ) ∈ B − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 ≤ 0 . . . ∀ ( x , y ) ∈ R 3 − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 > 0 . ∀ ( x , y ) ∈ B Setting z = z ( x , y ) = x 2 + y 2 : . . 4 h ( x , y , z ) = − 2 ax − 2 by + z − r 2 + a 2 + b 2 ∀ ( x , y ) ∈ R h ( x , y , z ( x , y )) ≤ 0 . . ⇐ ⇒ ∀ ( x , y ) ∈ R h ( ℓ ( x , y )) ≤ 0 5 ∀ ( x , y ) ∈ B h ( ℓ ( x , y )) > 0 . . p ∈ σ ⇐ ⇒ h ( ℓ ( p )) ≤ 0 . 6 . . Proved: if point set is separable by a circle = ⇒ lifted point set 7 ℓ ( R ) and ℓ ( B ) are separable by a plane. Sariel (UIUC) CS573 21 Fall 2013 21 / 28

  70. Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28

  71. Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28

  72. Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28

  73. Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28

  74. Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28

  75. Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28

  76. Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28

  77. Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28

  78. Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28

  79. A closing comment... Linear separability is a powerful technique that can be used to learn complicated concepts that are considerably more complicated than just hyperplane separation. This lifting technique showed above is the kernel technique or linearization . Sariel (UIUC) CS573 23 Fall 2013 23 / 28

  80. A Little Bit On VC Dimension . . Q: how complex is the function trying to learn? 1 . . VC-dimension is one way of capturing this notion. (VC = 2 Vapnik, Chervonenkis,1971). . . A matter of expressivity: What is harder to learn: 3 . . . A rectangle in the plane. 1 . . . A halfplane. 2 . . . A convex polygon with k sides. 3 Sariel (UIUC) CS573 24 Fall 2013 24 / 28

  81. A Little Bit On VC Dimension . . Q: how complex is the function trying to learn? 1 . . VC-dimension is one way of capturing this notion. (VC = 2 Vapnik, Chervonenkis,1971). . . A matter of expressivity: What is harder to learn: 3 . . . A rectangle in the plane. 1 . . . A halfplane. 2 . . . A convex polygon with k sides. 3 Sariel (UIUC) CS573 24 Fall 2013 24 / 28

  82. A Little Bit On VC Dimension . . Q: how complex is the function trying to learn? 1 . . VC-dimension is one way of capturing this notion. (VC = 2 Vapnik, Chervonenkis,1971). . . A matter of expressivity: What is harder to learn: 3 . . . A rectangle in the plane. 1 . . . A halfplane. 2 . . . A convex polygon with k sides. 3 Sariel (UIUC) CS573 24 Fall 2013 24 / 28

  83. Thinking about concepts as binary functions... . . X = { p 1 , p 2 , . . . , p m } : points in the plane. 1 . . H : set of all halfplanes. 2 . . A half-plane r ∈ H defines a binary vector 3 r ( X ) =( b 1 , . . . , b m ) where b i = 1 if and only if p i is inside r . . . Possible binary vectors generated by halfplanes: 4 U ( X , H ) = { r ( X ) | r ∈ H} . . . A set X of m elements is shattered by R if 5 | U ( X , R ) | = 2 m . . . What does this mean? 6 . The VC-dimension of a set of ranges R is the size of the 7 largest set that it can shatter. Sariel (UIUC) CS573 25 Fall 2013 25 / 28

  84. Thinking about concepts as binary functions... . . X = { p 1 , p 2 , . . . , p m } : points in the plane. 1 . . H : set of all halfplanes. 2 . . A half-plane r ∈ H defines a binary vector 3 r ( X ) =( b 1 , . . . , b m ) where b i = 1 if and only if p i is inside r . . . Possible binary vectors generated by halfplanes: 4 U ( X , H ) = { r ( X ) | r ∈ H} . . . A set X of m elements is shattered by R if 5 | U ( X , R ) | = 2 m . . . What does this mean? 6 . The VC-dimension of a set of ranges R is the size of the 7 largest set that it can shatter. Sariel (UIUC) CS573 25 Fall 2013 25 / 28

Recommend


More recommend