Learning From Data Lecture 21 Neural Networks: Backpropagation Forward propagation: algorithmic computation h ( x ) ∂ e ( x ) Backpropagation: algorithmic computation of ∂ weights M. Magdon-Ismail CSCI 4100/6100
recap: The Neural Network Biology − − − − − − − − − − − → Engineering − − − → 1 1 1 x 1 h ( x ) θ θ θ x 2 θ θ θ ( s ) . . . s θ x d input layer ℓ = 0 hidden layers 0 < ℓ < L output layer ℓ = L M Neural Networks: Backpropagation : 2 /14 � A c L Creator: Malik Magdon-Ismail Zooming into a hidden node − →
Zooming into a Hidden Node 1 1 1 x 1 h ( x ) θ θ θ x 2 θ θ θ ( s ) . . . s θ x d input layer ℓ = 0 hidden layers 0 < ℓ < L output layer ℓ = L layer ( ℓ + 1) θ W ( ℓ +1) s ( ℓ ) x ( ℓ ) W ( ℓ ) + θ layer ( ℓ − 1) layer ℓ layers ℓ = 0 , 1 , 2 , . . . , L layer ℓ parameters layer ℓ has “dimension” d ( ℓ ) = ⇒ d ( ℓ ) + 1 nodes w ( ℓ ) w ( ℓ ) w ( ℓ ) d ( ℓ ) dimensional input vector · · · s ( ℓ ) signals in 1 2 d ( ℓ ) W ( ℓ ) = . d ( ℓ ) + 1 dimensional output vector . x ( ℓ ) outputs . ( d ( ℓ − 1) + 1) × d ( ℓ ) dimensional matrix W ( ℓ ) weights in W ( ℓ +1) ( d ( ℓ ) + 1) × d ( ℓ +1) dimensional matrix weights out W = { W (1) , W (2) , . . . , W ( L ) } ← specifies the network M Neural Networks: Backpropagation : 3 /14 � A c L Creator: Malik Magdon-Ismail Linear Signal − →
The Linear Signal Input s ( ℓ ) is a linear combination (using weights) of the 1 1 1 outputs of the previous layer x ( ℓ − 1) . x 1 h ( x ) θ θ θ x 2 θ θ θ ( s ) . s ( ℓ ) = (W ( ℓ ) ) t x ( ℓ − 1) . . s θ x d input layer ℓ = 0 hidden layers 0 < ℓ < L output layer ℓ = L s ( ℓ ) ( w ( ℓ ) 1 ) t 1 s ( ℓ ) ( w ( ℓ ) 2 ) t 2 . . . . . . s ( ℓ ) = ( w ( ℓ ) x ( ℓ − 1) j ) t x ( ℓ − 1) = j s ( ℓ ) ( w ( ℓ ) j ) t j . . . . . . (recall the linear signal s = w t x ) s ( ℓ ) ( w ( ℓ ) d ( ℓ ) ) t d ( ℓ ) θ s ( ℓ ) → x ( ℓ ) − − − − − − M Neural Networks: Backpropagation : 4 /14 � A c L Creator: Malik Magdon-Ismail Forward propagation − →
Forward Propagation: Computing h ( x ) → x (2) · · · → x ( L ) = h ( x ) . W(1) W(2) W( L ) θ θ θ x = x (0) → s (1) → x (1) → s (2) → s ( L ) − − − − − − Forward propagation to compute h ( x ) : 1: x (0) ← x [Initialization] 2: for ℓ = 1 to L do [Forward Propagation] s ( ℓ ) ← (W ( ℓ ) ) t x ( ℓ − 1) 3: � � 1 x ( ℓ ) ← 4: θ ( s ( ℓ ) ) 5: end for 6: h ( x ) = x ( L ) [Output] M Neural Networks: Backpropagation : 5 /14 � A c L Creator: Malik Magdon-Ismail Minimizing E in − →
Minimizing E in N E in ( h ) = E in (W) = 1 � ( h ( x n ) − y n ) 2 W = { W (1) , W (2) , . . . , W ( L ) } N n =1 sign tanh linear tanh E in sign w Using θ = tanh makes E in differentiable so we can use gradient descent − → local minimum. M Neural Networks: Backpropagation : 6 /14 � A c L Creator: Malik Magdon-Ismail Gradient Descent − →
Gradient Descent W( t + 1) = W( t ) − η ∇ E in (W( t )) M Neural Networks: Backpropagation : 7 /14 � A c L Creator: Malik Magdon-Ismail Gradient of E in − →
Gradient of E in e n ւ N E in ( w ) = 1 � e ( h ( x n ) , y n ) N n =1 N ∂E in ( w ) = 1 ∂ e n � ∂ W ( ℓ ) ∂ W ( ℓ ) N n =1 We need ∂ e ( x ) ∂ W ( ℓ ) M Neural Networks: Backpropagation : 8 /14 � A c L Creator: Malik Magdon-Ismail Numerical approach − →
Numerical Approach ≈ e ( x | W ( ℓ ) ij + ∆) − e ( x | W ( ℓ ) ij − ∆) ∂ e ( x ) ∂ W ( ℓ ) 2∆ ij approximate inefficient M Neural Networks: Backpropagation : 9 /14 � A c L Creator: Malik Magdon-Ismail Algorithmic approach − →
Algorithmic Approach e ( x ) is a function of s ( ℓ ) and s ( ℓ ) = (W ( ℓ ) ) t x ( ℓ − 1) � ∂ e ∂ W ( ℓ ) = ∂ s ( ℓ ) � t ∂ e ∂ W ( ℓ ) · (chain rule) ∂ s ( ℓ ) = x ( ℓ − 1) ( δ ( ℓ ) ) t sensitivity δ ( ℓ ) = ∂ e ∂ s ( ℓ ) δ ( ℓ ) and the chain rule − M Neural Networks: Backpropagation : 10 /14 � A c L Creator: Malik Magdon-Ismail →
Computing δ ( ℓ ) Using the Chain Rule δ (1) ← − δ (2) · · · ← − δ ( L − 1) ← − δ ( L ) Multiple applications of the chain rule: → ∆ s ( ℓ +1) · · · − → ∆ x ( ℓ ) W( ℓ +1) θ ∆ s ( ℓ ) − − → ∆ e ( x ) 1 1 + don’t use 0 th component (bias) ↓ δ ( ℓ ) = θ ′ ( s ( ℓ ) ) ⊗ [W ( ℓ +1) δ ( ℓ +1) ] d ( ℓ ) δ ( ℓ ) δ ( ℓ +1) . . W ( ℓ +1) 1 . . ↑ . . componentwise multiplication + × θ ′ ( s ( ℓ ) ) layer ℓ layer ( ℓ + 1) M Neural Networks: Backpropagation : 11 /14 � A c L Creator: Malik Magdon-Ismail The Backpropagation algorithm − →
The Backpropagation Algorithm δ (1) ← − δ (2) · · · ← − δ ( L − 1) ← − δ ( L ) Backpropagation to compute sensitivities δ ( ℓ ) : (Assume s ( ℓ ) and x ( ℓ ) have been computed for all ℓ ) 1: δ ( L ) ← 2( x ( L ) − y ) · θ ′ ( s ( L ) ) [Initialization] 2: for ℓ = L − 1 to 1 do [Back-Propagation] Compute (for tanh hidden node): 3: 1 − x ( ℓ ) ⊗ x ( ℓ ) � d ( ℓ ) � θ ′ ( s ( ℓ ) ) = 1 � W ( ℓ +1) δ ( ℓ +1) � d ( ℓ ) δ ( ℓ ) ← θ ′ ( s ( ℓ ) ) ⊗ ← componentwise multiplication 4: 1 5: end for M Neural Networks: Backpropagation : 12 /14 � A c L Creator: Malik Magdon-Ismail Gradient Descent on E in − →
Algorithm for Gradient Descent on E in Algorithm to Compute E in ( w ) and g = ∇ E in ( w ) : Input: weights w = { W (1) , . . . , W ( L ) } ; data D . Output: error E in ( w ) and gradient g = { G (1) , . . . , G ( L ) } . 1: Initialize: E in = 0; for ℓ = 1 , . . . , L , G ( ℓ ) = 0 · W ( ℓ ) . 2: for Each data point x n ( n = 1 , . . . , N ) do Compute x ( ℓ ) for ℓ = 0 , . . . , L . [forward propagation] 3: Compute δ ( ℓ ) for ℓ = 1 , . . . , L . [backpropagation] 4: N ( x ( L ) E in ← E in + 1 − y n ) 2 . 5: 1 for ℓ = 1 , . . . , L do 6: G ( ℓ ) ( x n ) = [ x ( ℓ − 1) ( δ ( ℓ ) ) t ] 7: G ( ℓ ) ← G ( ℓ ) + 1 N G ( ℓ ) ( x n ). 8: end for 9: 10: end for Can do batch version or sequential version (SGD). M Neural Networks: Backpropagation : 13 /14 � A c L Creator: Malik Magdon-Ismail Digits Data − →
Digits Data 0 Gradient Descent -1 Symmetry log 10 (error) -2 -3 SGD -4 0 2 4 6 log 10 (iteration) Average Intensity M Neural Networks: Backpropagation : 14 /14 � A c L Creator: Malik Magdon-Ismail
Recommend
More recommend