Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University
back to the… World’s Smallest Perceptron! w f y x y = wx (a.k.a. line equation, linear regression) function of ONE parameter!
Training the world’s smallest perceptron This is just gradient descent, that means… this should be the gradient of the loss function Now where does this come from?
d L …is the rate at which this will change… dw L = 1 y ) 2 2( y − ˆ the loss function … per unit change of this y = wx the weight parameter Let’s compute the derivative…
Compute the derivative ⇢ 1 � dw = d d L y ) 2 2( y � ˆ dw y ) dwx = � ( y � ˆ dw = � ( y � ˆ y ) x = r w just shorthand That means the weight update for gradient descent is: w = w � r w move in direction of negative gradient = w + ( y � ˆ y ) x
Gradient Descent (world’s smallest perceptron) For each sample { x i , y i } 1. Predict a. Forward pass y = wx i ˆ L i = 1 b. Compute Loss y ) 2 2( y i − ˆ 2. Update d L i a. Back Propagation dw = � ( y i � ˆ y ) x i = r w b. Gradient update w = w � r w
Training the world’s smallest perceptron
world’s (second) smallest perceptron ! w 1 x 1 y w 2 x 2 function of two parameters!
Gradient Descent For each sample { x i , y i } 1. Predict a. Forward pass b. Compute Loss we just need to compute partial 2. Update derivatives for this network a. Back Propagation b. Gradient update
Back-Propagation ⇢ 1 � ⇢ 1 � ∂ L ∂ ∂ L ∂ y ) 2 y ) 2 = 2( y � ˆ = 2( y � ˆ ∂ w 1 ∂ w 1 ∂ w 2 ∂ w 2 y ) ∂ ˆ y ) ∂ ˆ y y = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 2 y ) ∂ P y ) ∂ P i w i x i i w i x i = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 1 y ) ∂ w 1 x 1 y ) ∂ w 2 x 2 = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 2 = � ( y � ˆ y ) x 1 = r w 1 = � ( y � ˆ y ) x 2 = r w 2 Why do we have partial derivatives now?
Back-Propagation ⇢ 1 � ⇢ 1 � ∂ L ∂ ∂ L ∂ y ) 2 y ) 2 = 2( y � ˆ = 2( y � ˆ ∂ w 1 ∂ w 1 ∂ w 2 ∂ w 2 y ) ∂ ˆ y ) ∂ ˆ y y = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 2 y ) ∂ P y ) ∂ P i w i x i i w i x i = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 1 y ) ∂ w 1 x 1 y ) ∂ w 2 x 2 = � ( y � ˆ = � ( y � ˆ ∂ w 1 ∂ w 2 = � ( y � ˆ y ) x 1 = r w 1 = � ( y � ˆ y ) x 2 = r w 2 Gradient Update w 1 = w 1 � η r w 1 w 2 = w 2 � η r w 2 = w 1 + η ( y � ˆ y ) x 1 = w 2 + η ( y � ˆ y ) x 2
Gradient Descent (since gradients For each sample { x i , y i } approximated from stochastic sample) 1. Predict a. Forward pass y = f MLP ( x i ; θ ) ˆ L i = 1 b. Compute Loss 2( y i − ˆ y ) 2. Update two BP lines now r w 1 i = � ( y i � ˆ y ) x 1 i a. Back Propagation r w 2 i = � ( y i � ˆ y ) x 2 i w 1 i = w 1 i + η ( y − ˆ y ) x 1 i b. Gradient update w 2 i = w 2 i + η ( y − ˆ y ) x 2 i (adjustable step size)
We haven’t seen a lot of ‘propagation’ yet because our perceptrons only had one layer…
multi-layer perceptron w 1 w 2 w 3 h 1 h 2 y x o b 1 function of FOUR parameters and FOUR layers!
sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4
sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4
sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1
sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1
sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1 a 2 = w 2 · f 1 ( w 1 · x + b 1 )
sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1 a 2 = w 2 · f 1 ( w 1 · x + b 1 )
sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1 a 2 = w 2 · f 1 ( w 1 · x + b 1 ) a 3 = w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 ))
sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1 a 2 = w 2 · f 1 ( w 1 · x + b 1 ) a 3 = w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 ))
sum activation activation activation input weight weight weight f 3 a 1 f 1 f 2 a 3 w 1 w 2 a 2 w 3 y x input hidden hidden output b 1 layer 1 layer 2 layer 3 layer 4 a 1 = w 1 · x + b 1 a 2 = w 2 · f 1 ( w 1 · x + b 1 ) a 3 = w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 )) y = f 3 ( w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 )))
Entire network can be written out as one long equation · · · y = f 3 ( w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 ))) We need to train the network: What is known? What is unknown?
Entire network can be written out as a long equation · · · y = f 3 ( w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 ))) known We need to train the network: What is known? What is unknown?
Entire network can be written out as a long equation · · · y = f 3 ( w 3 · f 2 ( w 2 · f 1 ( w 1 · x + b 1 ))) activation function unknown sometimes has parameters We need to train the network: What is known? What is unknown?
Learning an MLP Given a set of samples and a MLP { x i , y i } y = f MLP ( x ; θ ) Estimate the parameters of the MLP θ = { f, w, b }
Stochastic Gradient Descent For each random sample { x i , y i } 1. Predict y = f MLP ( x i ; θ ) ˆ a. Forward pass b. Compute Loss 2. Update ∂ L a. Back Propagation vector of parameter partial derivatives ∂θ b. Gradient update θ θ � η r θ vector of parameter update equations
So we need to compute the partial derivatives ∂ L � ∂ L ∂ L ∂ L ∂ L ∂ θ = ∂ w 3 ∂ w 2 ∂ w 1 ∂ b
Remember, ∂ L Partial derivative describes… ∂ w 1 affect… this …this does how (loss layer) f 3 a 1 f 1 f 2 a 3 w 1 a 2 w 2 w 3 y x b 1 So, how do you compute it?
The Chain Rule
According to the chain rule… ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3 ∂ L Intuitively, the effect of weight on loss function : ∂ w 3 f 3 f 2 ˆ L ( y, ˆ y ) a 3 w 3 y rest of the network · · · depends on depends on ∂ f 3 depends on ∂ a 3 ∂ L ∂ a 3 ∂ w 3 ∂ f 3
f 3 f 2 ˆ a 3 L ( y, ˆ y ) w 3 y rest of the network ∂ L = ∂ L ∂ f 3 ∂ a 3 Chain Rule! ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3
f 3 f 2 ˆ a 3 L ( y, ˆ y ) w 3 y rest of the network ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3 y ) ∂ f 3 ∂ a 3 = − η ( y − ˆ ∂ a 3 ∂ w 3 Just the partial derivative of L2 loss
f 3 f 2 ˆ a 3 L ( y, ˆ y ) w 3 y rest of the network ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3 y ) ∂ f 3 ∂ a 3 = − η ( y − ˆ ∂ a 3 ∂ w 3 Let’s use a Sigmoid function ds ( x ) = s ( x )(1 − s ( x )) dx
f 3 f 2 ˆ a 3 L ( y, ˆ y ) w 3 y rest of the network ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3 y ) ∂ f 3 ∂ a 3 = − η ( y − ˆ ∂ a 3 ∂ w 3 Let’s use a Sigmoid function y ) f 3 (1 − f 3 ) ∂ a 3 = − η ( y − ˆ ds ( x ) = s ( x )(1 − s ( x )) ∂ w 3 dx
f 3 f 2 ˆ a 3 L ( y, ˆ y ) w 3 y rest of the network ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ w 3 ∂ f 3 ∂ a 3 ∂ w 3 y ) ∂ f 3 ∂ a 3 = − η ( y − ˆ ∂ a 3 ∂ w 3 y ) f 3 (1 − f 3 ) ∂ a 3 = − η ( y − ˆ ∂ w 3 = − η ( y − ˆ y ) f 3 (1 − f 3 ) f 2
f 2 f 3 a 1 f 1 a 3 w 1 a 2 w 2 w 3 y x b 1 ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ w 2 ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ w 2
f 2 f 3 a 1 f 1 a 3 w 1 a 2 w 2 w 3 y x b 1 ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ w 2 ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ w 2 already computed. re-use (propagate)!
The Chain rule says… depends on f 3 a 1 f 1 f 2 a 3 w 1 a 2 w 2 w 3 y x depends on depends on depends on depends on depends on b 1 depends on ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ f 1 ∂ a 1 ∂ w 1 ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ f 1 ∂ a 1 ∂ w 1
The Chain rule says… depends on f 3 a 1 f 1 f 2 a 3 w 1 a 2 w 2 w 3 y x depends on depends on depends on depends on depends on b 1 depends on ∂ L = ∂ L ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ f 1 ∂ a 1 ∂ w 1 ∂ f 3 ∂ a 3 ∂ f 2 ∂ a 2 ∂ f 1 ∂ a 1 ∂ w 1 already computed. re-use (propagate)!
Recommend
More recommend