Chapter 22 Learning, Linear Separability and Linear Programming CS 573: Algorithms, Fall 2013 November 12, 2013 22.1 The Perceptron algorithm 22.1.0.1 Labeling... (A) given examples:a database of cars. (B) like to determine which cars are sport cars.. (C) Each car record: interpreted as point in high dimensions. (D) Example: sport car with 4 doors, manufactured in 1997, by Quaky (with manufacturer ID 6): (4 , 1997 , 6). Labeled as a sport car. (E) Tractor by General Mess (manufacturer ID 3) in 1998: (0 , 1997 , 3) Labeled as not a sport car. (F) Real world: hundreds of attributes. In some cases even millions of attributes! (G) Automate this classification process: label sports/regular car automatically. 22.1.0.2 Automatic classification... (A) learning algorithm: (A) given several (or many) classified examples... (B) ...develop its own conjecture for rule of classification. (C) ... can use it for classifying new data. (B) learning : training + classifying . R d → {− 1 , 1 } . (C) Learn a function: f : I (D) challenge: f might have infinite complexity... (E) ...rare situation in real world. Assume learnable functions. (F) red and blue points that are linearly separable. (G) Trying to learn a line ℓ that separates the red points from the blue points. 1
22.1.0.3 Linear separability example... ℓ 22.1.0.4 Learning linear separation (A) Given red and blue points – how to compute the separating line ℓ ? (B) line/plane/hyperplane is the zero set of a linear function. R d (C) Form: ∀ x ∈ I f ( x ) = ⟨ a, x ⟩ + b, R 2 . where a = ( a 1 , . . . , a d ) , b =( b 1 , . . . , b d ) ∈ I ⟨ a, x ⟩ = ∑ i a i x i is the dot product of a and x . (D) classification done by computing sign of f ( x ): sign( f ( x )). (E) If sign( f ( x )) is negative: x is not in class. If positive: inside. (F) A set of training examples : { } S = ( x 1 , y 1 ) , . . . , ( x n , y n ) , R d and y i ∈ { -1,1 } , for i = 1 , . . . , n . where x i ∈ I 22.1.0.5 Classification... R d and b ∈ I (A) linear classifier h : ( w, b ) where w ∈ I R. R d is sign( ⟨ w, x ⟩ + b ). (B) classification of x ∈ I (C) labeled example ( x, y ), h classifies ( x, y ) correctly if sign( ⟨ w, x ⟩ + b ) = y . (D) Assume a linear classifier exists. (E) Given n labeled example. How to compute the linear classifier for these examples? (F) Use linear programming.... (G) looking for ( w , b ), such that for an ( x i , y i ) we have sign( ⟨ w , x i ⟩ + b ) = y i , which is ⟨ w , x i ⟩ + b ≥ 0 if y i = 1 , and ⟨ w , x i ⟩ + b ≤ 0 if y i = − 1 . 2
22.1.0.6 Classification... ( ) ( w 1 , . . . , w d ) R d , for i = 1 , . . . , m , and let w = x 1 i , . . . , x d (A) Or equivalently, let x i = ∈ I , then we get i the linear constraint d ∑ w k x k i + b ≥ 0 if y i = 1 , k =1 d ∑ w k x k and i + b ≤ 0 if y i = − 1 . k =1 Thus, we get a set of linear constraints, one for each training example, and we need to solve the resulting linear program. 22.1.0.7 Linear programming for learning? (A) Stumbling block: is that linear programming is very sensitive to noise. (B) If points are misclassified = ⇒ no solution. (C) use an iterative algorithm that converges to the optimal solution if it exists... 22.1.0.8 Perceptron algorithm... perceptron ( S : a set of l examples) w 0 ← 0 , k ← 0 � � R = max ( x ,y ) ∈ S � x � . � � repeat for ( x , y ) ∈ S do if sign( ⟨ w k , x ⟩ ) ̸ = y then w k +1 ← w k + y ∗ x k ← k + 1 until no mistakes are made in the classification return w k and k 22.1.0.9 Perceptron algorithm (A) Why perceptron algorithm converges? (B) Assume made a mistake on a sample ( x , y ) and y = 1. Then, ⟨ w k , x ⟩ < 0, and ⟨ w k +1 , x ⟩ = ⟨ w k + y ∗ x , x ⟩ = ⟨ w k , x ⟩ + y ⟨ x , x ⟩ = ⟨ w k , x ⟩ + y ∥ x ∥ > ⟨ w k , x ⟩ . (C) “walking” in the right direction.. (D) ... new value assigned to x by w k +1 is larger (“more positive”) than the old value assigned to x by w k . (E) After enough iterations of such fix-ups, label would change... 22.1.0.10 Perceptron algorithm converges � � Theorem 22.1.1. Let S be a training set of examples, and let R = max ( x,y ) ∈ S � x � . Suppose that there � � � � exists a vector w opt such that � w opt � = 1 , and a number γ > 0 , such that � � y ⟨ w opt , x ⟩ ≥ γ ∀ ( x, y ) ∈ S. Then, the number of mistakes made by the online perceptron algorithm on S is at most ) 2 ( R . γ 3
22.1.0.11 Claim by figure... hard easy R R R R ℓ ℓ γ γ ′ w opt w opt # errors: ( R/γ ) 2 # errors: ( R/γ ′ ) 2 22.1.0.12 Proof of Perceptron convergence... (A) Idea of proof: perceptron weight vector converges to w opt . (B) Distance between w opt and k th update vector: 2 � w k − R 2 � � � � α k = γ w opt . � � � � � (C) Quantify the change between α k and α k +1 (D) Example being misclassified is ( x, y ). 4
22.1.0.13 Proof of Perceptron convergence... (A) Example being misclassified is ( x, y ) (both are constants). (B) w k +1 ← w k + y ∗ x 2 2 � w k +1 − R 2 � � � � w k + y x − R 2 � � � � � (C) α k +1 = γ w opt = γ w opt � � � � � � � � � � 2 � ( w k − R 2 ) � � � = γ w opt + y x � � � � � � ⟨ ( ) ( ) ⟩ w k − R 2 w k − R 2 = γ w opt + y x , γ w opt + y x ⟨ ( ) ( )⟩ ⟨ ( ) ⟩ w k − R 2 w k − R 2 w k − R 2 = γ w opt , γ w opt +2 y γ w opt , x + ⟨ x , x ⟩ 2 . ⟨ ( ) ⟩ � � w k − R 2 = α k + 2 y γ w opt , x + � x � � � 22.1.0.14 Proof of Perceptron convergence... 2 . ⟨ ( ) ⟩ � � w k − R 2 (A) We proved: α k +1 = α k + 2 y γ w opt , x + � x � � � (B) ( x , y ) is misclassified: sign( ⟨ w k , x ⟩ ) ̸ = y ⇒ sign( y ⟨ w k , x ⟩ ) = − 1 (C) = (D) = ⇒ y ⟨ w k , x ⟩ < 0. � � (E) � x � ≤ R = ⇒ � � ⟨ R 2 ⟩ α k +1 ≤ α k + R 2 + 2 y ⟨ w k , x ⟩ − 2 y γ w opt , x − 2 R 2 ≤ α k + R 2 + γ y ⟨ w opt, x ⟩ . (F) ... since 2 y ⟨ w k , x ⟩ < 0. 22.1.0.15 Proof of Perceptron convergence... (A) Proved: α k +1 ≤ α k + R 2 − 2 R 2 γ y ⟨ w opt, x ⟩ . (B) sign( ⟨ w opt , x ⟩ ) = y . (C) By margin assumption: y ⟨ w opt , x ⟩ ≥ γ, ∀ ( x, y ) ∈ S . (D) α k +1 ≤ α k + R 2 − 2 R 2 γ y ⟨ w opt, x ⟩ ≤ α k + R 2 − 2 R 2 γ γ ≤ α k + R 2 − 2 R 2 ≤ α k − R 2 . 22.1.0.16 Proof of Perceptron convergence... (A) We have: α k +1 ≤ α k − R 2 2 � 0 − R 2 � � = R 4 2 = R 4 � � � � (B) α 0 = γ w opt � w opt γ 2 . � � � � γ 2 � � � � (C) ∀ i α i ≥ 0. (D) Q: max # classification errors can make? (E) ... # of updates (F) .. # of updates ≤ α 0 /R 2 ... (G) A: ≤ R 2 γ 2 . 5
22.1.0.17 Concluding comment... Any linear program can be written as the problem of separating red points from blue points. As such, the perceptron algorithm can be used to solve linear programs. 22.2 Learning A Circle 22.2.0.18 Learning a circle... (A) Given a set of red points, and blue points in the plane, we want to learn a circle that contains all the red points, and does not contain the blue points. σ (B) Q: How to compute the circle σ ? (C) Lifting : ℓ : ( x, y ) → ( x, y, x 2 + y 2 ). ℓ ( x, y ) = ( x, y, x 2 + y 2 ) { � } (D) z ( P ) = � ( x, y ) ∈ P � 22.2.0.19 Learning a circle... Theorem 22.2.1. Two sets of points R and B are separable by a circle in two dimensions, if and only if ℓ ( R ) and ℓ ( B ) are separable by a plane in three dimensions. 22.2.0.20 Proof (A) σ ≡ ( x − a ) 2 + ( y − b ) 2 = r 2 : circle containing R , and all points of B outside. ( x − a ) 2 + ( y − b ) 2 ≤ r 2 (B) ∀ ( x, y ) ∈ R ( x − a ) 2 + ( y − b ) 2 > r 2 . ∀ ( x, y ) ∈ B − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 ≤ 0. (C) ∀ ( x, y ) ∈ R − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 > 0. ∀ ( x, y ) ∈ B (D) Setting z = z ( x, y ) = x 2 + y 2 : h ( x, y, z ) = − 2 ax − 2 by + z − r 2 + a 2 + b 2 ∀ ( x, y ) ∈ R h ( x, y, z ( x, y )) ≤ 0 (E) ⇐ ⇒ ∀ ( x, y ) ∈ R h ( ℓ ( x, y )) ≤ 0 ∀ ( x, y ) ∈ B h ( ℓ ( x, y )) > 0 (F) p ∈ σ ⇐ ⇒ h ( ℓ ( p )) ≤ 0. (G) Proved: if point set is separable by a circle = ⇒ lifted point set ℓ ( R ) and ℓ ( B ) are separable by a plane. 22.2.0.21 Proof: Other direction (A) Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating place be: h ≡ ax + by + cz + d = 0 (B) ∀ ( x, y, x 2 + y 2 ) ∈ ℓ ( R ): ax + by + c ( x 2 + y 2 ) + d ≤ 0 6
Recommend
More recommend