compsci 514 algorithms for data science
play

compsci 514: algorithms for data science Cameron Musco University - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 21 0 Finish discussion of SGD. Understanding gradient descent and SGD as applied to least Connections to more advanced


  1. compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 21 0

  2. • Finish discussion of SGD. • Understanding gradient descent and SGD as applied to least • Connections to more advanced techniques: accelerated methods summary Last Class: This Class: squares regression. and adaptive gradient methods. 1 • Stochastic gradient descent (SGD). • Online optimization and online gradient descent (OGD). • Analysis of SGD as a special case of online gradient descent.

  3. summary Last Class: This Class: squares regression. and adaptive gradient methods. 1 • Stochastic gradient descent (SGD). • Online optimization and online gradient descent (OGD). • Analysis of SGD as a special case of online gradient descent. • Finish discussion of SGD. • Understanding gradient descent and SGD as applied to least • Connections to more advanced techniques: accelerated methods

  4. logistics This class wraps up the optimization unit. Three remaining classes after break. Give your feedback on Piazza about what you’d like to see. projection. eigendecomposition, regression. minimization, k -means clustering,...) 2 • High dimensional geometry and connections to random • Randomized methods for fast approximate SVD, • Fourier methods, compressed sensing, sparse recovery. • More advanced optimization methods (alternating • Fairness and differential privacy.

  5. • Applies to: f 1 f 2 presented online. d in an online fashion with • Goal: Pick 1 f i 1 f (i.e., achieve regret i . • Update Step: • Applies to: f 1 f i d with f • Goal: Find d f • Update Step: where j i is chosen uniformly d Stochastic Gradient Descent: f i i that can be written as f . n i min i 1 i f j i i at random from 1 n . 1 quick review i t Online Gradient Descent: ). d 1 t f t i i i Gradient Descent: 3 t d min • Applies to: Any differentiable f : R d → R . • Goal: Find ˆ θ ∈ R d with f (ˆ θ ∈ R d f ( ⃗ θ ) ≤ min ⃗ θ ) + ϵ. θ ( i + 1 ) = ⃗ θ ( i ) − η⃗ • Update Step: ⃗ ∇ f ( ⃗ θ ( i ) ) .

  6. • Applies to: f 1 f i d with f • Goal: Find d f • Update Step: where j i is chosen uniformly d that can be written as f n i . quick review min Gradient Descent: i 1 i f j i i at random from 1 n . Stochastic Gradient Descent: 3 Online Gradient Descent: • Applies to: Any differentiable f : R d → R . • Goal: Find ˆ θ ∈ R d with f (ˆ θ ∈ R d f ( ⃗ θ ) ≤ min ⃗ θ ) + ϵ. θ ( i + 1 ) = ⃗ θ ( i ) − η⃗ • Update Step: ⃗ ∇ f ( ⃗ θ ( i ) ) . • Applies to: f 1 , f 2 , . . . , f t : R d → R presented online. θ ( t ) ∈ R d in an online fashion with • Goal: Pick ⃗ θ ( 1 ) , . . . , ⃗ ∑ t ∑ t i = 1 f i ( ⃗ i = 1 f ( ⃗ θ ( i ) ) ≤ min ⃗ θ ) + ϵ (i.e., achieve regret ≤ ϵ ). θ ∈ R d θ ( i + 1 ) = ⃗ θ ( i ) − η⃗ • Update Step: ⃗ ∇ f i ( ⃗ θ ( i ) ) .

  7. quick review Online Gradient Descent: Stochastic Gradient Descent: Gradient Descent: 3 • Applies to: Any differentiable f : R d → R . • Goal: Find ˆ θ ∈ R d with f (ˆ θ ∈ R d f ( ⃗ θ ) ≤ min ⃗ θ ) + ϵ. θ ( i + 1 ) = ⃗ θ ( i ) − η⃗ • Update Step: ⃗ ∇ f ( ⃗ θ ( i ) ) . • Applies to: f 1 , f 2 , . . . , f t : R d → R presented online. θ ( t ) ∈ R d in an online fashion with • Goal: Pick ⃗ θ ( 1 ) , . . . , ⃗ ∑ t ∑ t i = 1 f i ( ⃗ i = 1 f ( ⃗ θ ( i ) ) ≤ min ⃗ θ ) + ϵ (i.e., achieve regret ≤ ϵ ). θ ∈ R d θ ( i + 1 ) = ⃗ θ ( i ) − η⃗ • Update Step: ⃗ ∇ f i ( ⃗ θ ( i ) ) . • Applies to: f : R d → R that can be written as f ( ⃗ θ ) = ∑ n i = 1 f i ( ⃗ θ ) . θ ∈ R d f ( ⃗ • Goal: Find ˆ θ ∈ R d with f (ˆ θ ) ≤ min ⃗ θ ) + ϵ. θ ( i + 1 ) = ⃗ θ ( i ) − η⃗ • Update Step: ⃗ ∇ f j i ( ⃗ θ ( i ) ) where j i is chosen uniformly at random from 1 , . . . , n .

  8. quick review Online Gradient Descent: Stochastic Gradient Descent: Gradient Descent: 3 • Applies to: Any differentiable f : R d → R . • Goal: Find ˆ θ ∈ R d with f (ˆ θ ∈ R d f ( ⃗ θ ) ≤ min ⃗ θ ) + ϵ. θ ( i + 1 ) = ⃗ θ ( i ) − η⃗ • Update Step: ⃗ ∇ f ( ⃗ θ ( i ) ) . • Applies to: f 1 , f 2 , . . . , f t : R d → R presented online. θ ( t ) ∈ R d in an online fashion with • Goal: Pick ⃗ θ ( 1 ) , . . . , ⃗ ∑ t ∑ t i = 1 f i ( ⃗ i = 1 f ( ⃗ θ ( i ) ) ≤ min ⃗ θ ) + ϵ (i.e., achieve regret ≤ ϵ ). θ ∈ R d θ ( i + 1 ) = ⃗ θ ( i ) − η⃗ • Update Step: ⃗ ∇ f i ( ⃗ θ ( i ) ) . • Applies to: f : R d → R that can be written as f ( ⃗ θ ) = ∑ n i = 1 f i ( ⃗ θ ) . θ ∈ R d f ( ⃗ • Goal: Find ˆ θ ∈ R d with f (ˆ θ ) ≤ min ⃗ θ ) + ϵ. θ ( i + 1 ) = ⃗ θ ( i ) − η⃗ • Update Step: ⃗ ∇ f j i ( ⃗ θ ( i ) ) where j i is chosen uniformly at random from 1 , . . . , n .

  9. • Stochastic gradient descent is identical to online gradient descent run on the sequence of t functions f j 1 f j 2 • These functions are picked uniformly at random, so in 1 f j i 1 f i gives only a better solution. I.e., • By convexity • Quality directly bounded by the regret analysis for online 1 t i 1 i f t t 1 f i gradient descent! i stochastic gradient analysis recap t 1 . i i t i i t expectation, f j t . 4 θ ) = ∑ n Minimizing a finite sum function: f ( ⃗ i = 1 f i ( ⃗ θ ) .

  10. i gives only a better solution. I.e., • By convexity • Quality directly bounded by the regret analysis for online 1 t i 1 t i f 1 t i 1 f i gradient descent! t stochastic gradient analysis recap 4 . θ ) = ∑ n Minimizing a finite sum function: f ( ⃗ i = 1 f i ( ⃗ θ ) . • Stochastic gradient descent is identical to online gradient descent run on the sequence of t functions f j 1 , f j 2 , . . . , f j t . • These functions are picked uniformly at random, so in [∑ t ] [∑ t ] i = 1 f j i ( ⃗ i = 1 f ( ⃗ θ ( i ) ) θ ( i ) ) expectation, E = E

  11. stochastic gradient analysis recap . gradient descent! t t t 4 θ ) = ∑ n Minimizing a finite sum function: f ( ⃗ i = 1 f i ( ⃗ θ ) . • Stochastic gradient descent is identical to online gradient descent run on the sequence of t functions f j 1 , f j 2 , . . . , f j t . • These functions are picked uniformly at random, so in [∑ t ] [∑ t ] i = 1 f j i ( ⃗ i = 1 f ( ⃗ θ ( i ) ) θ ( i ) ) expectation, E = E ∑ t • By convexity ˆ i = 1 ⃗ θ ( i ) gives only a better solution. I.e., θ = 1 [ ] [ ] ∑ ∑ f (ˆ f ( ⃗ θ ( i ) ) θ ) ≤ E . E i = 1 i = 1 • Quality directly bounded by the regret analysis for online

  12. sgd vs. gd Stochastic gradient descent generally makes more iterations than gradient descent. Each iteration is much cheaper (by a factor of n ). n 5 ∑ ∇ f ( ⃗ ⃗ θ ) = ⃗ f j ( ⃗ θ ) vs. ⃗ ∇ f j ( ⃗ ∇ θ ) j = 1

  13. sgd vs. gd 1 f 2 f 1 f n 2 n j f j G 2 2 n G n G . When would this bound be tight? I.e., SGD takes the same number of iterations as GD. When is it loose? I.e., SGD performs very poorly compared to GD. iterations 6 iterations θ ) = ∑ n Consider f ( ⃗ j = 1 f j ( ⃗ θ ) with each f j convex. Theorem – SGD: If ∥ ⃗ ∇ f j ( ⃗ n ∀ ⃗ θ ) ∥ 2 ≤ G θ , after t ≥ R 2 G 2 ϵ 2 outputs ˆ θ satisfying: E [ f (ˆ θ )] ≤ f ( θ ∗ ) + ϵ. R 2 ¯ Theorem – GD: If ∥ ⃗ ∇ f ( ⃗ θ ) ∥ 2 ≤ ¯ G ∀ ⃗ θ , after t ≥ ϵ 2 outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ.

  14. sgd vs. gd iterations When is it loose? I.e., SGD performs very poorly compared to GD. iterations as GD. When would this bound be tight? I.e., SGD takes the same number of iterations G 2 6 θ ) = ∑ n Consider f ( ⃗ j = 1 f j ( ⃗ θ ) with each f j convex. Theorem – SGD: If ∥ ⃗ ∇ f j ( ⃗ n ∀ ⃗ θ ) ∥ 2 ≤ G θ , after t ≥ R 2 G 2 ϵ 2 outputs ˆ θ satisfying: E [ f (ˆ θ )] ≤ f ( θ ∗ ) + ϵ. R 2 ¯ Theorem – GD: If ∥ ⃗ ∇ f ( ⃗ θ ) ∥ 2 ≤ ¯ G ∀ ⃗ θ , after t ≥ ϵ 2 outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. θ ) ∥ 2 ≤ ∑ n ∥ ⃗ ∇ f ( ⃗ θ ) ∥ 2 = ∥ ⃗ ∇ f 1 ( ⃗ θ ) + . . . + ⃗ ∇ f n ( ⃗ j = 1 ∥ ⃗ ∇ f j ( ⃗ θ ) ∥ 2 ≤ n · G n ≤ G .

  15. sgd vs. gd iterations When is it loose? I.e., SGD performs very poorly compared to GD. iterations as GD. When would this bound be tight? I.e., SGD takes the same number of iterations G 2 6 θ ) = ∑ n Consider f ( ⃗ j = 1 f j ( ⃗ θ ) with each f j convex. Theorem – SGD: If ∥ ⃗ ∇ f j ( ⃗ n ∀ ⃗ θ ) ∥ 2 ≤ G θ , after t ≥ R 2 G 2 ϵ 2 outputs ˆ θ satisfying: E [ f (ˆ θ )] ≤ f ( θ ∗ ) + ϵ. R 2 ¯ Theorem – GD: If ∥ ⃗ ∇ f ( ⃗ θ ) ∥ 2 ≤ ¯ G ∀ ⃗ θ , after t ≥ ϵ 2 outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. θ ) ∥ 2 ≤ ∑ n ∥ ⃗ ∇ f ( ⃗ θ ) ∥ 2 = ∥ ⃗ ∇ f 1 ( ⃗ θ ) + . . . + ⃗ ∇ f n ( ⃗ j = 1 ∥ ⃗ ∇ f j ( ⃗ θ ) ∥ 2 ≤ n · G n ≤ G .

Recommend


More recommend