conjugate gradient cg
play

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview - PDF document

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient Descent Quadratic Forms Gradient Descent in Quadratic Forms Eigen vectors and values Gradient Descent Convergence Conjugate


  1. Conjugate Gradient (CG) Majid Lesani Alireza Masoum

  2. Overview � Backpropagation � Gradient Descent � Quadratic Forms � Gradient Descent in Quadratic Forms � Eigen vectors and values � Gradient Descent Convergence � Conjugate Gradient

  3. BackPropagation � Abstraction � Generalization problem • Heuristic features • Small networks • Early stopping • Regularization � Search � Convergence problem

  4. Gradient Descent � Or Steepest Descent ∂ f ( x , y ) ∂ y ∂ f ( x , y ) ∂ x

  5. Faster Training � Gradient Descent modification � Gradient Descent BP with Momentum � Variable Learning Rate BP � numerical optimization techniques � Conjugate Gradient BP � Quasi-Newton BP

  6. Gradient Descent The problem is choosing the step size

  7. Gradient Descent Choosing Best Step Size α f ( x ) � Choose Where is minimum i + i 1 ∂ f ( x ) 1 = + i 0 ∂ α i � (By chain rule) ∂ + α f ( x r ) ⇒ = ∇ = i i i f ( x ). r 0 + i 1 i ∂ α i ⇒ 1 = T r i r 0 + i

  8. Gradient Descent Choosing Best Step Size

  9. Quadratic forms � Our discussion is to minimize the quadratic function: 1 = 2 − + T T f ( x ) x Ax b x c

  10. > v T Av 0 Positive definite (for every vector v, )

  11. Quadratic Forms � A Symmetric Positive-Definite Matrix have a global minimum where gradient is zero 1 = 2 − + T T f ( x ) x Ax b x c = ∇ = − 0 f ( x ) Ax b � Solving equation Ax = b equals to minimize f

  12. Gradient Descent for Quadratic Forms

  13. � steepest descent for quadratic form is

  14. Eigen Vectors and Eigen Values � An eigenvector of a matrix A is a nonzero vector that does not rotate when A is applied to it. Only scale by constant � Every symmetric matrix have n orthogonal eigen vector with it’s related eigen value

  15. Using Eigen Vectors � think of a vector as a sum of other vectors whose behavior is understood

  16. Using Eigen Vectors � Positive definite matrix is a matrix that all its eigen values are positive � Eigen vectors are axis of our rotated ellipse and each radius relate to corresponding eigen value

  17. General Convergence of Steepest Descent � Relation between eigen values of A � Eigen vector components of error

  18. Fast Convergence � Same eigen values have fast convergence

  19. Poor Convergence � Different Eigen vectors and error component in direction of eigen vectors of smaller eigen values

  20. Conjugate Gradient Overview � Orthogonal Directions � Conjugate vectors � Conjugate Directions � Gram-Schmidt algorithm � Gradient and error optimality � Conjugate Gradient

  21. Orthogonal Directions � Steepest descent go in one direction many times � if we have n orthogonal search directions and choose best step every time After n steps we are at the goal!

  22. Orthogonal Directions � We need every time error be orthogonal to previous direction

  23. Conjugate vectors

  24. Conjugate vectors � Two vectors and are A-orthogonal ( or conjugate) if � Being Conjugate in scaled space means orthogonal in unscaled space

  25. Conjugate Directions � If we have n conjugate search directions and like orthogonal directions choose best step every time After n steps we are at the goal!

  26. Conjugate Directions

  27. Orthogonal Directions

  28. Conjugate Directions � We need every time error be A-orthogonal to previous direction

  29. Conjugate Directions = − e x x i i = − = − = − Ae Ax Ax Ax b r i i i i

  30. Gram-Schmidt algorithm � So, only remains to find n conjugate directions � Gram-Schmidt algorithm do it have n independent Gives n conjugate directions

  31. Gram-Schmidt algorithm

  32. Gram-Schmidt algorithm

  33. Conjugate Directions � So Algorithm is complete � but it’s ! � We had Gaussian elimination algorithm before

  34. Conjugate Directions with axial unit vectors

  35. Gradient and error optimality � For every � We have � It means

  36. Conjugate Gradient � Use for � Makes equations very simple � Complexity from O(n^2) per iteration reduce to O(m), m is number of nonzero entries of A

  37. Line Search � Finding stepsize compute best step-size α ∈ + α ⋅ arg min f ( x d ) i i i α ≥ 0

  38. End � Thanks for your patience!

Recommend


More recommend