compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Spring 2020. Lecture 24 (Final Lecture!) 0
logistics under the schedule tab on the course page. hours this Thursday and next Tuesday during the regular class time 11:30am-12:45pm holding an optional SRTI for this class and would really appreciate your feedback. courseEvalSurvey/uma/ . 1 • Problem Set 4 is due Sunday 5/3 at 8pm. • Exam is at 2pm on May 6th . Open note, similar to midterm. • Exam review guide and practice problems have been posted • I will hold usual office hours today and exam review office • Regular SRTI’s are suspended this semester. But I am • http://owl.umass.edu/partners/
summary Last Class: under (convex) constraints. This Class: 2 • Analysis of gradient descent for optimizing convex functions. • (The same) analysis of projected gradient descent for optimizing • Convex sets and projection functions. • Online learning, regret, and online gradient descent. • Application to analysis of stochastic gradient descent (if time). • Course summary/wrap-up
online gradient descent mistakes over time. Stochastic gradient descent is a special case: when data points are streaming algorithms) In reality many learning problems are online. 3 more examples of spam over time. continuous feedback from these users. • Websites optimize ads or recommendations to show users, given • Spam filters are incrementally updated and adapt as they see • Face recognition systems, other classification systems, learn from θ, X ) = ∑ n Want to minimize some global loss L ( ⃗ i = 1 ℓ ( ⃗ θ,⃗ x i ) , when data points are presented in an online fashion ⃗ x 1 ,⃗ x 2 , . . . ,⃗ x n (like in considered a random order for computational reasons.
online optimization formal setup Online Optimization: In place of a single function f , we see a different objective function at each step: 4 f 1 , . . . , f t : R d → R • At each step, first pick (play) a parameter vector ⃗ θ ( i ) . • Then are told f i and incur cost f i ( ⃗ θ ( i ) ) . • Goal: Minimize total cost ∑ t i = 1 f i ( ⃗ θ ( i ) ) . No assumptions on how f 1 , . . . , f t are related to each other!
online optimization example UI design via online optimization. 5 • Parameter vector ⃗ θ ( i ) : some encoding of the layout at step i . • Functions f 1 , . . . , f t : f i ( ⃗ θ ( i ) ) = 1 if user does not click ‘add to cart’ and f i ( ⃗ θ ( i ) ) = 0 if they do click. • Want to maximize number of purchases. I.e., minimize ∑ t i = 1 f i ( ⃗ θ ( i ) )
online optimization example Home pricing tools. classic least squares regression). 6 • Parameter vector ⃗ θ ( i ) : coefficients of linear model at step i . • Functions f 1 , . . . , f t : f i ( ⃗ x i , ⃗ θ ( i ) ) = ( ⟨ ⃗ θ ( i ) ⟩ − price i ) 2 revealed when home i is listed or sold. • Want to minimize total squared error ∑ t i = 1 f i ( ⃗ θ ( i ) ) (same as
regret In online optimization we will ask for the same. t t t 7 In normal optimization, we seek ˆ θ satisfying: f (ˆ f ( ⃗ θ ) ≤ min θ ) + ϵ. ⃗ θ ∑ ∑ ∑ f i ( ⃗ f i ( ⃗ f i ( ⃗ θ ( i ) ) ≤ min θ ) + ϵ = θ off ) + ϵ ⃗ θ i = 1 i = 1 i = 1 ϵ is called the regret. • This error metric is a bit ‘unfair’. Why? • Comparing online solution to best fixed solution in hindsight. ϵ can be negative!
intuition check an alternating pattern? no particular pattern? How can any online learning algorithm hope to achieve small regret? 8 What if for i = 1 , . . . , t , f i ( θ ) = | x − 1000 | or f i ( θ ) = | x + 1000 | in How small can the regret ϵ be? ∑ t θ ( i ) ) ≤ ∑ t i = 1 f i ( ⃗ i = 1 f i ( ⃗ θ off ) + ϵ . What if for i = 1 , . . . , t , f i ( θ ) = | x − 1000 | or f i ( θ ) = | x + 1000 | in
online gradient descent Online Gradient Descent t . Assume that: R G 9 • f 1 , . . . , f t are all convex. • Each f i is G -Lipschitz (i.e., ∥ ⃗ ∇ f i ( ⃗ θ ) ∥ 2 ≤ G for all ⃗ θ .) θ ( 1 ) − ⃗ • ∥ ⃗ θ off ∥ 2 ≤ R where θ ( 1 ) is the first vector chosen. • Set step size η = √ • For i = 1 , . . . , t • Play ⃗ θ ( i ) and incur cost f i ( ⃗ θ ( i ) ) . θ ( i + 1 ) = ⃗ θ ( i ) − η · ⃗ • ⃗ ∇ f i ( ⃗ θ ( i ) )
online gradient descent analysis t 2 2 t Theorem – OGD on Convex Lipschitz Functions: For convex G - t 10 G t , has regret bounded by: R Lipschitz f 1 , . . . , f t , OGD initialized with starting point θ ( 1 ) within radius R of θ off , using step size η = √ [ ] √ ∑ ∑ f i ( θ ( i ) ) − f i ( θ off ) ≤ RG i = 1 i = 1 Average regret goes to 0 and t → ∞ . No assumptions on f 1 , . . . , f t ! Step 1.1: For all i , ∇ f i ( θ ( i ) )( θ ( i ) − θ off ) ≤ ∥ θ ( i ) − θ off ∥ 2 2 −∥ θ ( i + 1 ) − θ off ∥ 2 + η G 2 2 η 2 . Convexity = ⇒ Step 1: For all i , f i ( θ ( i ) ) − f i ( θ off ) ≤ ∥ θ ( i ) − θ off ∥ 2 2 − ∥ θ ( i + 1 ) − θ off ∥ 2 + η G 2 2 . 2 η
online gradient descent analysis t 2 2 t t t 2 Theorem – OGD on Convex Lipschitz Functions: For convex G - 2 t 11 G R t , has regret bounded by: t Lipschitz f 1 , . . . , f t , OGD initialized with starting point θ ( 1 ) within radius R of θ off , using step size η = √ [ ] √ ∑ ∑ f i ( θ ( i ) ) − f i ( θ off ) ≤ RG i = 1 i = 1 Step 1: For all i , f i ( θ ( i ) ) − f i ( θ off ) ≤ ∥ θ ( i ) − θ off ∥ 2 2 −∥ θ ( i + 1 ) − θ off ∥ 2 + η G 2 = ⇒ 2 η [ ] ∥ θ ( i ) − θ off ∥ 2 2 − ∥ θ ( i + 1 ) − θ off ∥ 2 + t · η G 2 ∑ ∑ ∑ f i ( θ ( i ) ) − f i ( θ off ) ≤ . 2 η i = 1 i = 1 i = 1
stochastic gradient descent Stochastic gradient descent is an efficient offline optimization learning. 12 method, seeking ˆ θ with f (ˆ f ( ⃗ θ ) + ϵ = f ( ⃗ θ ) ≤ min θ ∗ ) + ϵ. ⃗ θ • The most popular optimization method in modern machine • Easily analyzed as a special case of online gradient descent!
stochastic gradient descent Assume that: t t . G R Stochastic Gradient Descent 13 θ ) = ∑ n • f is convex and decomposable as f ( ⃗ j = 1 f j ( ⃗ θ ) . • E.g., L ( ⃗ θ, X ) = ∑ n j = 1 ℓ ( ⃗ θ,⃗ x j ) . n -Lipschitz (i.e., ∥ ⃗ ∇ f j ( ⃗ n for all ⃗ θ ) ∥ 2 ≤ G θ .) • Each f j is G • What does this imply about how Lipschitz f is? θ ( 1 ) − ⃗ • Initialize with θ ( 1 ) satisfying ∥ ⃗ θ ∗ ∥ 2 ≤ R . • Set step size η = √ • For i = 1 , . . . , t • Pick random j i ∈ 1 , . . . , n . θ ( i + 1 ) = ⃗ θ ( i ) − η · ⃗ • ⃗ ∇ f j i ( ⃗ θ ( i ) ) ∑ t i = 1 ⃗ • Return ˆ θ ( i ) . θ = 1
stochastic gradient descent in expectation (batch GD, randomly quantized, measurement noise, differentially private, etc.) 14 θ ( i + 1 ) = ⃗ θ ( i ) − η · ⃗ θ ( i + 1 ) = ⃗ θ ( i ) − η · ⃗ ⃗ ∇ f j i ( ⃗ θ ( i ) ) vs. ⃗ ∇ f ( ⃗ θ ( i ) ) Note that: E [ ⃗ ∇ f j i ( ⃗ n ⃗ ∇ f ( ⃗ θ ( i ) )] = 1 θ ( i ) ) . Analysis extends to any algorithm that takes the gradient step
test of intuition A sum of convex functions is always convex (good exercise). 15 What does f 1 ( θ ) + f 2 ( θ ) + f 3 ( θ ) look like? 12000 12000 f 2 10000 10000 f 1 8000 8000 6000 6000 f 3 4000 4000 2000 2000 0 -10 -5 0 5 10
stochastic gradient descent analysis t , and starting point within radius R OGD bound t Theorem – SGD on Convex Lipschitz Functions: SGD run with t 16 G R t ≥ R 2 G 2 iterations, η = √ ϵ 2 of θ ∗ , outputs ˆ θ satisfying: E [ f (ˆ θ )] ≤ f ( θ ∗ ) + ϵ. ∑ t Step 1: f (ˆ θ ) − f ( θ ∗ ) ≤ 1 i = 1 [ f ( θ ( i ) ) − f ( θ ∗ )] [∑ t ] Step 2: E [ f (ˆ θ ) − f ( θ ∗ )] ≤ n i = 1 [ f j i ( θ ( i ) ) − f j i ( θ ∗ )] t · E . [∑ t ] Step 3: E [ f (ˆ θ ) − f ( θ ∗ )] ≤ n i = 1 [ f j i ( θ ( i ) ) − f j i ( θ off )] t · E . √ Step 4: E [ f (ˆ θ ) − f ( θ ∗ )] ≤ n t · R · G n · = RG t . √ � �� �
sgd vs. gd Stochastic gradient descent generally makes more iterations than gradient descent. Each iteration is much cheaper (by a factor of n ). n 17 ∑ ⃗ f j ( ⃗ θ ) vs. ⃗ ∇ f j ( ⃗ ∇ θ ) j = 1
sgd vs. gd n : When would this bound be tight? G : G 2 18 When f ( ⃗ θ ) = ∑ n j = 1 f j ( ⃗ ∇ f j ( ⃗ θ ) and ∥ ⃗ θ ) ∥ 2 ≤ G iterations outputs ˆ Theorem – SGD: After t ≥ R 2 G 2 θ satisfying: ϵ 2 E [ f (ˆ θ )] ≤ f ( θ ∗ ) + ϵ. When ∥ ⃗ ∇ f ( ⃗ θ ) ∥ 2 ≤ ¯ Theorem – GD: After t ≥ R 2 ¯ iterations outputs ˆ θ satisfying: ϵ 2 f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. ∥ ⃗ ∇ f ( ⃗ θ ) ∥ 2 = ∥ ⃗ ∇ f 1 ( ⃗ θ ) + . . . + ⃗ ∇ f n ( ⃗ θ ) ∥ 2 ≤ ∑ n j = 1 ∥ ⃗ ∇ f j ( ⃗ θ ) ∥ 2 ≤ n · G n ≤ G .
Recommend
More recommend