Learning Gradient Descent Back to minimizing Problem : How do we compute the gradient ? Large -> Stochastic Gradient Descent n 1 X ` ( f ( x i ; ✓ ) , y i ) r θ n i =1 Complicated function (neural net) -> BackProp
Learning Gradient Descent Back to minimizing Problem : How do we compute the gradient ? Large -> Stochastic Gradient Descent n 1 X ` ( f ( x i ; ✓ ) , y i ) r θ n i =1
Learning Stochastic Gradient Descent Killing n n 1 X ` ( f ( x i ; ✓ ) , y i ) r θ n i =1 One function n = 1 X r θ ` ( f ( x i ; ✓ ) , y i ) n i =1 ⇡ r θ ` ( f ( x j ; ✓ ) , y j )
Learning Stochastic Gradient Descent Killing n n 1 X ` ( f ( x i ; ✓ ) , y i ) r θ The Gradient of the Average n i =1 n = 1 X r θ ` ( f ( x i ; ✓ ) , y i ) n i =1 ⇡ r θ ` ( f ( x j ; ✓ ) , y j )
Learning Stochastic Gradient Descent Killing n n 1 X ` ( f ( x i ; ✓ ) , y i ) r θ The Gradient of the Average n i =1 n = 1 X r θ ` ( f ( x i ; ✓ ) , y i ) = The Average of the Gradients n i =1 ⇡ r θ ` ( f ( x j ; ✓ ) , y j )
Learning Stochastic Gradient Descent Killing n n 1 X ` ( f ( x i ; ✓ ) , y i ) r θ The Gradient of the Average n i =1 n = 1 X r θ ` ( f ( x i ; ✓ ) , y i ) = The Average of the Gradients n i =1 ⇡ r θ ` ( f ( x j ; ✓ ) , y j ) In expectation, for uniform j
Learning Stochastic Gradient Descent Killing n For some number of iterations : Gradient Step Pick some random example ( x j , y j ) ✓ n +1 ✓ n � ⌘ r θ ` ( f ( x j ; ✓ n ) , y j ) Learning-rate
Learning Back Propagation Computing the gradient Problem : How do we compute the gradient ? n 1 X ` ( f ( x i ; ✓ ) , y i ) r θ n i =1 Complicated function (neural net) -> BackProp
Learning BackProp Computing the gradient Problem : How do we compute the gradient ? f i ( x ) = σ ( A i x + b i ) Hidden Layer i
Learning BackProp Computing the gradient Problem : How do we compute the gradient ? f i ( x ) = σ ( A i x + b i ) Hidden Layer i f = f h ( f h − 1 ( f h − 2 ( . . . ))) = ( f h � f h − 1 � . . . � f 1 )( x ) Complete Neural Network
Learning BackProp Computing the gradient Problem : How do we compute the gradient ? f i ( x ) = σ ( A i x + b i ) Hidden Layer i f = f h ( f h − 1 ( f h − 2 ( . . . ))) = ( f h � f h − 1 � . . . � f 1 )( x ) Complete Neural Network r θ ` ( f ( x i ; ✓ ) , y i ) ??
Learning BackProp Computing the gradient ∂ f ∂ x = ∂ f ∂ y Chain-rule : ∂ y ∂ x
Learning BackProp Computing the gradient y h ( y h − 1 ) x y 1 ( x ) y h − 1 ( y h − 2 ) y 2 ( y 1 ) y h − 2 ( y h − 3 ) … ` ( y h ) θ h − 2 θ 2 θ 1 θ h − 1 θ h
Learning BackProp Computing the gradient r θ h ` ( y h ) … ` ( y h ) θ h
Learning BackProp r θ h ` ( y h ) Computing the gradient θ h
Learning BackProp r θ h ` ( y h ) Computing the gradient @` ( y h ) = @` ( y h ) @ y h @✓ h,i @ y h @✓ h,i Chain-Rule θ h
Learning BackProp r θ h ` ( y h ) Computing the gradient @` ( y h ) = @` ( y h ) @ y h @✓ h,i @ y h @✓ h,i Doesn’t depend on current layer Only depends on ` θ h
Learning BackProp r θ h ` ( y h ) Computing the gradient @` ( y h ) = @` ( y h ) @ y h @✓ h,i @ y h @✓ h,i Only depends on current layer θ h
Learning BackProp r θ h − 1 ` ( y h ) Computing the gradient = Φ h − 1 ( ✓ h − 1 , r y h − 1 ` ( y h )) θ h − 1
Learning BackProp r θ h − 1 ` ( y h ) Computing the gradient = Φ h − 1 ( ✓ h − 1 , r y h − 1 ` ( y h )) Depends on current layer’s structure θ h − 1
Learning BackProp r θ h − 1 ` ( y h ) Computing the gradient = Φ h − 1 ( ✓ h − 1 , r y h − 1 ` ( y h )) Known Depends on current layer’s structure θ h − 1
Learning BackProp r θ h − 1 ` ( y h ) Computing the gradient = Φ h − 1 ( ✓ h − 1 , r y h − 1 ` ( y h )) Known Depends on Already computed current layer’s structure θ h − 1
Learning BackProp r y h ( ` ( y h )) It’s Backwards !
Learning BackProp r y h ( ` ( y h )) It’s Backwards ! r θ h ( ` ( y h ))
Recommend
More recommend