compsci 514 algorithms for data science
play

compsci 514: algorithms for data science Cameron Musco University - PowerPoint PPT Presentation

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 19 0 logistics 1 Problem Set 3 on Spectral Methods due this Friday at 8pm . Can turn in without penalty until Sunday at


  1. compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 19 0

  2. logistics 1 • Problem Set 3 on Spectral Methods due this Friday at 8pm . • Can turn in without penalty until Sunday at 11:59pm.

  3. • Analysis of gradient descent for optimizing convex functions. • Analysis of projected gradient descent for optimizing under summary Last Class: This Class: constraints. 2 • Intro to continuous optimization. • Multivariable calculus review. • Intro to Gradient Descent.

  4. summary Last Class: This Class: constraints. 2 • Intro to continuous optimization. • Multivariable calculus review. • Intro to Gradient Descent. • Analysis of gradient descent for optimizing convex functions. • Analysis of projected gradient descent for optimizing under

  5. 0 . • Choose some initialization • For i t , as an approximate minimizer of f • Return i t • i i 1 f 1 . Step size is chosen ahead of time or adapted during the algorithm. 1 gradient descent motivation Gradient descent greedy motivation: At each step, make a small Psuedocode: gradient: Gradient descent step: When the step size is small, this is 3 change to ⃗ θ ( i − 1 ) to give ⃗ θ ( i ) , with minimum value of f ( ⃗ θ ( i ) ) . approximate optimized by stepping in the opposite direction of the θ ( i ) = ⃗ θ ( i − 1 ) − η · ⃗ ⃗ ∇ f ( ⃗ θ ( i − 1 ) ) .

  6. gradient descent motivation Gradient descent step: When the step size is small, this is Psuedocode: Gradient descent greedy motivation: At each step, make a small gradient: 3 change to ⃗ θ ( i − 1 ) to give ⃗ θ ( i ) , with minimum value of f ( ⃗ θ ( i ) ) . approximate optimized by stepping in the opposite direction of the θ ( i ) = ⃗ θ ( i − 1 ) − η · ⃗ ⃗ ∇ f ( ⃗ θ ( i − 1 ) ) . • Choose some initialization ⃗ θ ( 0 ) . • For i = 1 , . . . , t θ ( i ) = ⃗ θ ( i − 1 ) − η ∇ f ( ⃗ • ⃗ θ ( i − 1 ) ) • Return ⃗ θ ( t ) , as an approximate minimizer of f ( ⃗ θ ) . Step size η is chosen ahead of time or adapted during the algorithm.

  7. 4 θ ( i ) = ⃗ θ ( i − 1 ) − η ∇ f ( ⃗ Gradient Descent Update: ⃗ θ ( i − 1 ) )

  8. convexity 5 Definition – Convex Function: A function f : R d → R is convex if and only if, for any ⃗ θ 1 , ⃗ θ 2 ∈ R d and λ ∈ [ 0 , 1 ] : ( ) ( 1 − λ ) · f ( ⃗ θ 1 ) + λ · f ( ⃗ ( 1 − λ ) · ⃗ θ 1 + λ · ⃗ θ 2 ) ≥ f θ 2

  9. convexity 5 Definition – Convex Function: A function f : R d → R is convex if and only if, for any ⃗ θ 1 , ⃗ θ 2 ∈ R d and λ ∈ [ 0 , 1 ] : ( ) ( 1 − λ ) · f ( ⃗ θ 1 ) + λ · f ( ⃗ ( 1 − λ ) · ⃗ θ 1 + λ · ⃗ θ 2 ) ≥ f θ 2

  10. 6 convexity Corollary – Convex Function: A function f : R d → R is convex if and only if, for any ⃗ θ 1 , ⃗ θ 2 ∈ R d and λ ∈ [ 0 , 1 ] : θ 1 ) T ( ) f ( ⃗ θ 2 ) − f ( ⃗ θ 1 ) ≥ ⃗ ∇ f ( ⃗ ⃗ θ 2 − ⃗ θ 1

  11. • Lipschitz (size of gradient is bounded): For all • Smooth (direction/size of gradient is not changing too quickly): 2 and some f 2 2 1 2 2 f 1 other assumptions , 1 For all G 2 f and some G , 7 We will also assume that f ( · ) is ‘well-behaved’ in some way.

  12. other assumptions 7 We will also assume that f ( · ) is ‘well-behaved’ in some way. • Lipschitz (size of gradient is bounded): For all ⃗ θ and some G , ∥ ⃗ ∇ f ( ⃗ θ ) ∥ 2 ≤ G . • Smooth (direction/size of gradient is not changing too quickly): For all ⃗ θ 1 , ⃗ θ 2 and some β , ∥ ⃗ ∇ f ( ⃗ θ 1 ) − ⃗ ∇ f ( ⃗ θ 2 ) ∥ 2 ≤ β · ∥ ⃗ θ 1 − ⃗ θ 2 ∥ 2 .

  13. lipschitz assumption 8

  14. gd analysis – convex functions Gradient Descent t . Assume that: R G 9 • f is convex. • f is G -Lipschitz (i.e., ∥ ⃗ ∇ f ( ⃗ θ ) ∥ 2 ≤ G for all ⃗ θ .) • ∥ ⃗ θ 0 − ⃗ θ ∗ ∥ 2 ≤ R where θ 0 is the initialization point. • Choose some initialization ⃗ θ 0 and set η = √ • For i = 1 , . . . , t • ⃗ θ i = ⃗ θ i − 1 − η · ∇ f ( ⃗ θ i − 1 ) • Return ˆ θ t f ( ⃗ θ = arg min ⃗ θ i ) . θ 0 ,...,⃗

  15. 2 . Visually: gd analysis proof i G 2 2 2 2 1 i 2 2 i f Step 1: For all i , f Theorem – GD on Convex Lipschitz Functions: For convex G - t , G R 10 Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ.

  16. gd analysis proof G 2 Theorem – GD on Convex Lipschitz Functions: For convex G - t , 10 R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. Step 1: For all i , f ( θ i ) − f ( θ ∗ ) ≤ ∥ θ i − θ ∗ ∥ 2 2 −∥ θ i + 1 − θ ∗ ∥ 2 + η G 2 2 η 2 . Visually:

  17. gd analysis proof G 2 Theorem – GD on Convex Lipschitz Functions: For convex G - t , 11 R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. Step 1: For all i , f ( θ i ) − f ( θ ∗ ) ≤ ∥ θ i − θ ∗ ∥ 2 2 −∥ θ i + 1 − θ ∗ ∥ 2 + η G 2 2 η 2 . Formally:

  18. gd analysis proof t , Step 1. 2 2 2 Theorem – GD on Convex Lipschitz Functions: For convex G - 12 G R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. Step 1: For all i , f ( θ i ) − f ( θ ∗ ) ≤ ∥ θ i − θ ∗ ∥ 2 2 −∥ θ i + 1 − θ ∗ ∥ 2 + η G 2 2 η 2 . Step 1.1: ∇ f ( θ i )( θ i − θ ∗ ) ≤ ∥ θ i − θ ∗ ∥ 2 2 −∥ θ i + 1 − θ ∗ ∥ 2 + η G 2 2 η

  19. gd analysis proof t , 2 2 2 Theorem – GD on Convex Lipschitz Functions: For convex G - 12 G R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. Step 1: For all i , f ( θ i ) − f ( θ ∗ ) ≤ ∥ θ i − θ ∗ ∥ 2 2 −∥ θ i + 1 − θ ∗ ∥ 2 + η G 2 2 η 2 . Step 1.1: ∇ f ( θ i )( θ i − θ ∗ ) ≤ ∥ θ i − θ ∗ ∥ 2 2 −∥ θ i + 1 − θ ∗ ∥ 2 + η G 2 = ⇒ Step 1. 2 η

  20. Step 2: 1 1 f 2 . gd analysis proof Theorem – GD on Convex Lipschitz Functions: For convex G - G 2 t 2 R 2 f i i t t 2 13 t , 2 G R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. Step 1: For all i , f ( θ i ) − f ( θ ∗ ) ≤ ∥ θ i − θ ∗ ∥ 2 2 −∥ θ i + 1 − θ ∗ ∥ 2 + η G 2 2 η

  21. gd analysis proof t , R 2 t 2 2 Theorem – GD on Convex Lipschitz Functions: For convex G - 13 G R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. Step 1: For all i , f ( θ i ) − f ( θ ∗ ) ≤ ∥ θ i − θ ∗ ∥ 2 2 −∥ θ i + 1 − θ ∗ ∥ 2 + η G 2 2 η 2 η · t + η G 2 ∑ t i = 1 f ( θ i ) − f ( θ ∗ ) ≤ Step 2: 1 2 .

  22. gd analysis proof t , R 2 t Theorem – GD on Convex Lipschitz Functions: For convex G - 14 G R Lipschitz function f , GD run with t ≥ R 2 G 2 iterations, η = √ ϵ 2 and starting point within radius R of θ ∗ , outputs ˆ θ satisfying: f (ˆ θ ) ≤ f ( θ ∗ ) + ϵ. 2 η · t + η G 2 ∑ t i = 1 f ( θ i ) − f ( θ ∗ ) ≤ Step 2: 1 2 .

  23. d is convex if and only if, constrained convex optimization 0 1 : 1 . 2 d E.g. 2 1 1 2 and Often want to perform convex optimization with convex constraints. 1 for any Definition – Convex Set: A set 15 θ ∗ = arg min f ( θ ) , θ ∈S where S is a convex set.

  24. constrained convex optimization Often want to perform convex optimization with convex constraints. 1 . 2 d E.g. 15 θ ∗ = arg min f ( θ ) , θ ∈S where S is a convex set. Definition – Convex Set: A set S ⊆ R d is convex if and only if, for any ⃗ θ 1 , ⃗ θ 2 ∈ S and λ ∈ [ 0 , 1 ] : ( 1 − λ ) ⃗ θ 1 + λ · ⃗ θ 2 ∈ S

  25. constrained convex optimization Often want to perform convex optimization with convex constraints. 15 θ ∗ = arg min f ( θ ) , θ ∈S where S is a convex set. Definition – Convex Set: A set S ⊆ R d is convex if and only if, for any ⃗ θ 1 , ⃗ θ 2 ∈ S and λ ∈ [ 0 , 1 ] : ( 1 − λ ) ⃗ θ 1 + λ · ⃗ θ 2 ∈ S θ ∈ R d : ∥ ⃗ E.g. S = { ⃗ θ ∥ 2 ≤ 1 } .

  26. • For • For • Choose some initialization 0 and set • For i • Return t f i . arg min i i 1 f i • 1 i • P out i . 0 out projected gradient descent t d , what is P d 2 1 what is P 1 being a k dimensional subspace of y ? y ? Projected Gradient Descent R G t . 16 For any convex set let P S ( · ) denote the projection function onto S . θ ∈S ∥ ⃗ • P S ( ⃗ θ − ⃗ y ) = arg min ⃗ y ∥ 2 .

Recommend


More recommend