Online Learning II Presenter: Adams Wei Yu Carnegie Mellon University Mar 2015 Presenter: Adams Wei Yu (CMU) March 2015 1 / 31
Recap of Online Learning The data comes sequentially. Do not need to assume the data distribution. Adversarial setting (worst case analysis). Regret minimization: T T � � R T = L (ˆ y t , y t ) − min L (ˆ y t , i , y t ) i ∈{ 1 ,..., N } t =1 t =1 Several simple algorithms with theoretical guarantee (Halving, Weighted Majority, Randomized Weighted Majority, Exponential Weighted Average). Presenter: Adams Wei Yu (CMU) March 2015 2 / 31
Weighted Majority Algorithm Algorithm 1 WEIGHTED-MAJORITY( N ) 1: for i ← 1 to N do w 1 , i ← 1 2: 3: for t ← 1 to T do RECEIVE( x t ) 4: if � i : y t , i =1 w t , i ≥ � i : y t , i =0 w t , i then 5: y t ← 1 ˆ 6: else 7: y t ← 0 ˆ 8: RECEIVE( y t ) 9: if ˆ y t � = y t then 10: for i ← 1 to N do 11: if y t , i � = y t then 12: w t +1 , i ← β w t , i 13: else w t +1 , i ← w t , i 14: 15: return w T +1 Presenter: Adams Wei Yu (CMU) March 2015 3 / 31
Randomized weighted majority algorithm Algorithm 2 RANDOMIZED-WEIGHTED-MAJORITY( N ) 1: for i ← 1 to N do 2: w 1 , i ← 1; p 1 , i ← 1 / N 3: for t ← 1 to T do 4: RECEIVE( x t ) p 1 = � p 0 = � 5: � i : yt , i =1 p t , i ; � i : yt , i =0 p t , i 6: Draw u ∼ Uniform(0,1) 7: if u < � p 1 then 8: ˆ y t ← 1 9: else 10: y t ← 0 ˆ 11: for i ← 1 to N do 12: if l t , i = 1 then 13: w t +1 , i ← β w t , i 14: else w t +1 , i ← w t , i W t +1 ← � N 15: i =1 w t +1 , i 16: for i ← 1 to N do 17: p t +1 , i ← w t +1 , i / W t +1 18: return w T +1 Presenter: Adams Wei Yu (CMU) March 2015 4 / 31
Topics today Perceptron algorithm and mistake bound. Winnow algorithm and mistake bound. Conversion from online to batch algorithm and analysis. Presenter: Adams Wei Yu (CMU) March 2015 5 / 31
Perceptron Algorithm Algorithm 3 PERCEPTRON( w 0 ) 1: for t ← 1 to T do RECEIVE( x t ) 2: y t ← sgn ( w t · x t ) ˆ 3: RECEIVE( y t ) 4: if (ˆ y t � = y t ) then 5: w t +1 ← w t + y t x t ⊲ More generally η y t x t 6: else w t +1 ← w t 7: 8: return w T +1 If x t is misclassified, then y t w t · x t is negative. After one iteration, y t w t +1 · x t = y t w t · x t + η � x t � 2 2 , so the term y t w t · x t is corrected by η � x t � 2 2 . Presenter: Adams Wei Yu (CMU) March 2015 6 / 31
Another Point of View: Stochastic Gradient Descent The Perceptron algorithm could be seen as finding the minimizer of an objective function F : � T F ( w ) = 1 D [ � max (0 , − y t ( w · x t )) = E x ∼ ˆ F ( w , x )] T t =1 where � F ( w , x ) = max (0 , − f ( x )( w · x )) with f ( x ) being the label of x , and ˆ D is the empirical distribution of sample ( x 1 , ..., x T ). F ( w ) is convex over w . Presenter: Adams Wei Yu (CMU) March 2015 7 / 31
Another Point of View: Stochastic Gradient Descent � w t − η ∇ w � if � F ( w t , x t ) , F ( w , x t ) differentiable at w t w t +1 ← w t , otherwise Note that � F ( w , x t ) = max (0 , − y t ( w · x t )) � − y t x t if y t ( w · x t ) < 0 ∇ w � F ( w , x t ) 0 if y t ( w · x t ) > 0 ⇓ w t + η y t x t , if y t ( w t · x t ) < 0 w t +1 ← w t , if y t ( w t · x t ) > 0 w t , otherwise Presenter: Adams Wei Yu (CMU) March 2015 8 / 31
Upper Bound on the Number of Mistakes: Separable Case Theorem 1 Let x 1 , ..., x T ∈ R N be a sequence of T points with � x t � ≤ r for all t ∈ [1 , T ] , some r > 0 . Assume that there exist ρ > 0 and v ∈ R N such that for all t ∈ [1 , T ] , ρ ≤ y t ( v · x t ) � v � 1 . Then, the number of updates made by the Perceptron algorithm when processing x 1 , ..., x T is bounded by r 2 /ρ 2 . Presenter: Adams Wei Yu (CMU) March 2015 9 / 31
Proof I : the subset of the T rounds at which there is an update. M : the total number of updates, i.e. | I | = M . M ρ ≤ v · � t ∈ I y t x t ( ρ ≤ y t ( v · x t ) ) � v � � v � � ≤� y t x t � (Cauchy-Schwarz inequality) t ∈ I � = � ( w t +1 − w t ) � (definition of updates) t ∈ I = � w T +1 � (telescope sum, w 0 = 0) �� � w t +1 � 2 − � w t � 2 = (telescope sum, w 0 = 0) t ∈ I �� � w t + y t x t � 2 − � w t � 2 = (definition of updates) t ∈ I �� �� √ 2 y t w t · x t + � x t � 2 ≤ � x t � 2 ≤ Mr 2 ⇒ M ≤ r 2 /ρ 2 = t ∈ I t ∈ I Presenter: Adams Wei Yu (CMU) March 2015 10 / 31
Remarks The Perceptron algorithm is simple. The bound of updates depends only on the margin ρ (we may assume r = 1) and is independent of the dimension N . This bound O ( 1 ρ 2 ) is tight for Perceptron Algorithm. Maybe very slow when the ρ is small. We may need multiple pass of the data. It will go to deadloop if the data is not separable. Presenter: Adams Wei Yu (CMU) March 2015 11 / 31
Upper Bound on the Number of Mistakes: Inseparable Case Theorem 2 Let x 1 , ..., x T ∈ R N be a sequence of T points with � x t � ≤ r for all t ∈ [1 , T ] , some r > 0 . Let ρ > 0 and v ∈ R N , � v � = 1 . Define the �� T t =1 d 2 deviation of x t by d t = max { 0 , ρ − y t ( v · x t ) } and let δ = t . Then, the number of updates made by the Perceptron algorithm when processing x 1 , ..., x T is bounded by ( r + δ ) 2 /ρ 2 . Key Idea: Construct data points in higher dimension which are separable and have the same prediction behavior as the one of original space. Presenter: Adams Wei Yu (CMU) March 2015 12 / 31
Proof We first reduce the problem to the separable case by mapping the data points from x t ∈ R N to the higher dimension vector x ′ t ∈ R N + T : x t = ( x t , 1 , ..., x t , N ) T → x ′ , ..., 0) T t = ( x t , 1 , ..., x t , N , 0 , ..., ∆ ���� ( N + t )-th component v = ( v 1 , v 2 , ..., v N ) T → v ′ = [ v 1 / Z , ..., v N / Z , y 1 d 1 / (∆ Z ) , ..., y T d T / (∆ Z )] T � 1 + δ 2 To make � v � = 1, we have Z = ∆ 2 . Then the predictions made by Perceptron for x ′ t , t ∈ [1 , T ] coincide with those made in the original space for x t . Presenter: Adams Wei Yu (CMU) March 2015 13 / 31
Proof (con’t) t ) = y t ( v · x t + ∆ y t d t Z ∆ ) = y t v · x t + d t Z ≥ y t v · x t + ρ − y t ( v · x t ) = ρ y t ( v ′ · x ′ Z Z Z Z Z t � 2 ≤ r 2 + ∆ 2 So x ′ 1 , ..., x ′ T is linear separable with margin ρ/ Z . Noting that � x ′ and using the result in Theorem 1, we have that the number of updates made by the perceptron algorithm is bounded by ( r 2 + ∆ 2 )(1 + δ 2 / ∆ 2 ) ρ 2 Choosing ∆ 2 to minimize the bound leads to ∆ 2 = r δ and the bound is ( r + δ ) 2 ρ 2 Presenter: Adams Wei Yu (CMU) March 2015 14 / 31
Dual Perceptron For the original perceptron, we can write the separating hyperplane as � T w = α s y s x s s =1 where α s is incremented by one when this prediction does not match the correct label. Then we write the algorithm as: Algorithm 4 DUAL PERCEPTRON( α 0 ) 1: α ← α 0 ⊲ typically α 0 = 0 2: for t ← 1 to T do RECEIVE( x t ) 3: y t ← sgn ( � T ˆ s =1 α s y s x s · x t ) 4: RECEIVE( y t ) 5: if (ˆ y t � = y t ) then 6: α t ← α t + 1 7: else α t ← α t 8: 9: return α Presenter: Adams Wei Yu (CMU) March 2015 15 / 31
Kernel Perceptron Algorithm 5 KERNEL PERCEPTRON( α 0 ) 1: α ← α 0 ⊲ typically α 0 = 0 2: for t ← 1 to T do RECEIVE( x t ) 3: y t ← sgn ( � T ˆ s =1 α s y s K ( x s , x t )) 4: RECEIVE( y t ) 5: if (ˆ y t � = y t ) then 6: α t ← α t + 1 7: else α t ← α t 8: 9: return α Any PDS kernel could be used. Presenter: Adams Wei Yu (CMU) March 2015 16 / 31
Winnow Algorithm Algorithm 6 WINNOW( η ) 1: w 1 ← 1 / N 2: for t ← 1 to T do RECEIVE( x t ) 3: y t ← sgn ( w t · x t ) ˆ 4: RECEIVE( y t ) 5: if (ˆ y t � = y t ) then 6: Z t ← � N i =1 w t , i exp( η y t x t , i ) 7: for i ← 1 to N do 8: w t +1 , i ← w t , i exp( η y t x t , i ) 9: Z t else w t +1 ← w t 10: 11: return w T +1 Presenter: Adams Wei Yu (CMU) March 2015 17 / 31
Upper Bound on the Number of Mistakes: Separable Case Theorem 3 Let x 1 , ..., x T ∈ R N be a sequence of T points with � x t � ∞ ≤ r ∞ for all t ∈ [1 , T ] , some r ∞ > 0 . Assume that there exist ρ ∞ > 0 and v ∈ R N such that for all t ∈ [1 , T ] , ρ ∞ ≤ y t ( v · x t ) � v � . Then, for η = ρ ∞ ∞ , the number r 2 of updates made by the Winnow algorithm when processing x 1 , ..., x T is upper bounded by 2( r 2 ∞ /ρ 2 ∞ ) log N. Presenter: Adams Wei Yu (CMU) March 2015 18 / 31
Proof I : the subset of the T rounds at which there is an update. M : the total number of updates, i.e. | I | = M . The potential function Φ t is the relative entropy of the distribution defined by the normalized weights v i / � v � 1 , i ∈ [1 , N ] and the one defined by the component of the weight vector w t , i : � N v i log v i / � v � 1 Φ t = � v � 1 w t , i i =1 Presenter: Adams Wei Yu (CMU) March 2015 19 / 31
Recommend
More recommend