Conjugate Gradient (CG) Majid Lesani Alireza Masoum
Overview � Backpropagation � Gradient Descent � Quadratic Forms � Gradient Descent in Quadratic Forms � Eigen vectors and values � Gradient Descent Convergence � Conjugate Gradient
BackPropagation � Abstraction � Generalization problem • Heuristic features • Small networks • Early stopping • Regularization � Search � Convergence problem
Gradient Descent � Or Steepest Descent ∂ f ( x , y ) ∂ y ∂ f ( x , y ) ∂ x
Faster Training � Gradient Descent modification � Gradient Descent BP with Momentum � Variable Learning Rate BP � numerical optimization techniques � Conjugate Gradient BP � Quasi-Newton BP
Gradient Descent The problem is choosing the step size
Gradient Descent Choosing Best Step Size α f ( x ) � Choose Where is minimum i + i 1 ∂ f ( x ) 1 = + i 0 ∂ α i � (By chain rule) ∂ + α f ( x r ) ⇒ = ∇ = i i i f ( x ). r 0 + i 1 i ∂ α i ⇒ 1 = T r i r 0 + i
Gradient Descent Choosing Best Step Size
Quadratic forms � Our discussion is to minimize the quadratic function: 1 = 2 − + T T f ( x ) x Ax b x c
> v T Av 0 Positive definite (for every vector v, )
Quadratic Forms � A Symmetric Positive-Definite Matrix have a global minimum where gradient is zero 1 = 2 − + T T f ( x ) x Ax b x c = ∇ = − 0 f ( x ) Ax b � Solving equation Ax = b equals to minimize f
Gradient Descent for Quadratic Forms
� steepest descent for quadratic form is
Eigen Vectors and Eigen Values � An eigenvector of a matrix A is a nonzero vector that does not rotate when A is applied to it. Only scale by constant � Every symmetric matrix have n orthogonal eigen vector with it’s related eigen value
Using Eigen Vectors � think of a vector as a sum of other vectors whose behavior is understood
Using Eigen Vectors � Positive definite matrix is a matrix that all its eigen values are positive � Eigen vectors are axis of our rotated ellipse and each radius relate to corresponding eigen value
General Convergence of Steepest Descent � Relation between eigen values of A � Eigen vector components of error
Fast Convergence � Same eigen values have fast convergence
Poor Convergence � Different Eigen vectors and error component in direction of eigen vectors of smaller eigen values
Conjugate Gradient Overview � Orthogonal Directions � Conjugate vectors � Conjugate Directions � Gram-Schmidt algorithm � Gradient and error optimality � Conjugate Gradient
Orthogonal Directions � Steepest descent go in one direction many times � if we have n orthogonal search directions and choose best step every time After n steps we are at the goal!
Orthogonal Directions � We need every time error be orthogonal to previous direction
Conjugate vectors
Conjugate vectors � Two vectors and are A-orthogonal ( or conjugate) if � Being Conjugate in scaled space means orthogonal in unscaled space
Conjugate Directions � If we have n conjugate search directions and like orthogonal directions choose best step every time After n steps we are at the goal!
Conjugate Directions
Orthogonal Directions
Conjugate Directions � We need every time error be A-orthogonal to previous direction
Conjugate Directions = − e x x i i = − = − = − Ae Ax Ax Ax b r i i i i
Gram-Schmidt algorithm � So, only remains to find n conjugate directions � Gram-Schmidt algorithm do it have n independent Gives n conjugate directions
Gram-Schmidt algorithm
Gram-Schmidt algorithm
Conjugate Directions � So Algorithm is complete � but it’s ! � We had Gaussian elimination algorithm before
Conjugate Directions with axial unit vectors
Gradient and error optimality � For every � We have � It means
Conjugate Gradient � Use for � Makes equations very simple � Complexity from O(n^2) per iteration reduce to O(m), m is number of nonzero entries of A
Line Search � Finding stepsize compute best step-size α ∈ + α ⋅ arg min f ( x d ) i i i α ≥ 0
End � Thanks for your patience!
Recommend
More recommend