2 elements of convex optjmizatjon
play

2. Elements of convex optjmizatjon Chlo-Agathe Azencot Centre for - PowerPoint PPT Presentation

Introductjon to Machine Learning CentraleSuplec Paris Fall 2017 2. Elements of convex optjmizatjon Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Why talk about


  1. Gradient descent ● Start from a random point u . ● How do I get closer to the solutjon? ● Follow the opposite of the gradient. The gradient indicates the directjon of steepest increase. ∇ - f(u)) f(u) f(u + ) u + u

  2. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … ● Stop at some point

  3. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size ● Stop at some point stopping criterion

  4. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size ● Stop at some point stopping criterion Usually: stop when

  5. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size ? – If the step size is too big, the search might diverge – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.

  6. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size – If the step size is too big, the search might diverge – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.

  7. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size – If the step size is too big, the search might diverge – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.

  8. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size – If the step size is too big, the search might diverge ? – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.

  9. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size – If the step size is too big, the search might diverge – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.

  10. Gradient descent algorithm ● Choose an initjal point ● Repeat for k=1, 2, 3, … step size – If the step size is too big, the search might diverge – If the step size is too small, the search might take a very long tjme – Backtracking line search makes it possible to chose the step size adaptjvely.

  11. BLS: shrinking needed The step size is too big and we are overshootjng our goal. f(u) α∇ f(u- f(u)) u α∇ u- f(u)

  12. BLS: shrinking needed The step size is too big and we are overshootjng our goal. f(u) ? α∇ f(u- f(u)) u α ∇ u- /2 f(u) α∇ u- f(u)

  13. BLS: shrinking needed The step size is too big and we are overshootjng our goal. f(u) α∇ f(u- f(u)) T f(u) α ∇ ∇ f(u)- /2 f(u) u α ∇ u- /2 f(u) α∇ u- f(u)

  14. BLS: shrinking needed α∇ α ∇ T f(u) ∇ f(u- f(u)) > f(u)- /2 f(u) The step size is too big and we are overshootjng our goal. f(u) α∇ f(u- f(u)) T f(u) α ∇ ∇ f(u)- /2 f(u) u α ∇ u- /2 f(u) α∇ u- f(u)

  15. BLS: shrinking needed α∇ α ∇ T f(u) ∇ f(u- f(u)) > f(u)- /2 f(u) The step size is too big and we are overshootjng our goal. f(u) α∇ f(u- f(u)) T f(u) α ∇ ∇ f(u)- /2 f(u) u α ∇ u- /2 f(u) α∇ u- f(u)

  16. BLS: no shrinking needed α∇ α ∇ T f(u) ∇ f(u- f(u)) ≤ f(u)- /2 f(u) The step size is small enough. f(u) α α ∇ ∇ T f(u) T f(u) ∇ ∇ f(u)- /2 f(u) f(u)- /2 f(u) α∇ f(u- f(u)) u α ∇ u- /2 f(u) α∇ u- f(u)

  17. Backtracking line search ● Shrinking parameter , initjal step size ● Choose an initjal point ● Repeat for k=1, 2, 3, … – If shrink the step size: – Else: – Update: ● Stop when

  18. Newton’s method ● Suppose f is twice derivable ● Second-order Taylor’s expansion: ● Minimize in v instead of in u

  19. Newton’s method ● Suppose f is twice derivable ● Second-order Taylor’s expansion: ● Minimize in v instead of in u ?

  20. Newton’s method ● Suppose f is twice derivable ● Second-order Taylor’s expansion: ● Minimize in v :

  21. Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for ● What is the new update rule? ?

  22. Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for ● New update rule:

  23. Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for This is a problem of the form

  24. Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for ? This is a problem of the form

  25. Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for This is a problem of the form Second-order characterizatjon of convex functjons

  26. Newton CG (conjugate gradient) ● Computjng the inverse of the Hessian is computatjonally intensive. ● Instead, compute and and solve for This is a problem of the form Solve using the conjugate gradient method.

  27. Conjugate gradient method Solve ● Idea: build a set of A-conjugate vectors (basis of ) – Initjalisatjon: – At step t: ● Update rule: ● residual ● ensures – Convergence: hence

  28. Conjugate gradient method ? Prove Given – Initjalisatjon: – At step t: ● Update rule: ● residual ● and assuming

  29. Prove Given – Initjalisatjon: – Update rule: – residual – and assuming

  30. Conjugate gradient method ? Prove and conclude the proof Given – Initjalisatjon: – At step t: ● Update rule: ● residual ●

  31. Prove Given – Initjalisatjon: – Update rule: – residual –

  32. Quasi-Newton methods ● What if the Hessian is unavailable / expensive to compute at each iteratjon? ● Approximate the inverse Hessian: update iteratjvely ● Conditjons: ∇ 1 st order Taylor applied to f – – Secant equatjon: ⇒ ● Initjalizatjon: Identjty

  33. Quasi-Newton methods ● What if the Hessian is unavailable / expensive to compute at each iteratjon? ● Approximate the inverse Hessian: update iteratjvely ● Conditjons: ● BFGS: Broyden-Fletcher-Goldfarb-Shanno – – Secant equatjon: The mean value G of between u and v verifjes – ⇒ ● L-BFGS: Limited memory variant Do not store the full matrix W k .

  34. Stochastjc gradient descent ● For ● Gradient descent: ● Stochastjc gradient descent: – Cyclic : cycle over 1, 2, …, m, 1, 2, …, m, … – Randomized: chose i k uniformely at random in {1, 2, …, m}.

  35. Coordinate Descent ● For – g: convex and difgerentjable ⇒ – h i : convex the non-smooth part of f is separable. ● Minimize coordinate by coordinate: – Initjalisatjon: – For k=1, 2, …:

  36. Coordinate Descent ● For – g: convex and difgerentjable ⇒ – h i : convex the non-smooth part of f is separable. ● Minimize coordinate by coordinate: – Initjalisatjon: – For k=1, 2, …:

  37. Coordinate Descent ● For – g: convex and difgerentjable ⇒ – h i : convex the non-smooth part of f is separable. ● Minimize coordinate by coordinate: – Initjalisatjon: – For k=1, 2, …: Variants: – re-order the coordinates randomly – Proceed by blocks of coordinates (2 or more at a tjme)

  38. Summary: Unconstrained convex optjmizatjon If f is difgerentjable – Set its gradient to zero – If hard to solve: gradient descent Settjng the learning rate: ● Backtracking Line Search (adapt heuristjcally to avoid “overshootjng”) ● Newton’s method: Suppose f twice difgerentjable – – If the Hessian is hard to invert, compute by solving by the conjugate gradient method – If the Hessian is hard to compute, approximate the inverse Hessian with a quasi-Newton method such as BFGS ( L-BFGS : less memory) – If f is separable: stochastjc gradient descent – If the non-smooth part of f is separable: coordinate descent.

  39. Constrained convex optjmizatjon

  40. Constrained convex optjmizatjon ● Convex optjmizatjon program/problem: – f is convex – are convex – are affjne – The feasible set is convex

  41. Lagrangian ● Lagrangian: = Lagrange multjpliers = dual variables

  42. Lagrange dual functjon ● Lagrangian: ● Lagrange dual functjon: Infjmum = the greatest value x such that x ≤ L(u, α, β) ● Q is concave (independently of the convexity of f)

  43. Lagrange dual functjon ● ● Q is concave (independently of the convexity of f)

  44. Lagrange dual functjon ● The dual functjon gives a lower bound on our solutjon Let feasible set Then for any

  45. Weak duality ● for any ● What is the best lower bound on p* we can get? ?

Recommend


More recommend