Neural Networks: Backpropagation Machine Learning Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, 1 Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
This lecture • What is a neural network? • Predicting with a neural network • Training neural networks – Backpropagation • Practical concerns 3
Training a neural network • Given – A network architecture (layout of neurons, their connectivity and activations) – A dataset of labeled examples • S = {( x i , y i )} • The goal: Learn the weights of the neural network • Remember : For a fixed architecture, a neural network is a function parameterized by its weights – Prediction: ! = $$(&, () 4
� Recall: Learning as loss minimization We have a classifier NN that is completely defined by its weights Learn the weights by minimizing a loss * min ( . *($$ & / , ( , ! / ) Perhaps with a regularizer / So far, we saw that this strategy worked for: 1. Logistic Regression Each 2. Support Vector Machines minimizes a 3. Perceptron different loss function 4. LMS regression All of these are linear models Same idea for non-linear models too! 6
Back to our running example Given an input x , how is the output predicted 5 + 2 44 5 7 4 + 2 84 5 7 8 output y = 2 34 : + 2 48 : ; 4 + 2 88 : ; 8 ) 7 8 = 9(2 38 : + 2 44 : ; 4 + 2 84 : ; 8 ) z 4 = 9(2 34 7
Back to our running example Given an input x , how is the output predicted 5 + 2 44 5 7 4 + 2 84 5 7 8 output y = 2 34 : + 2 48 : ; 4 + 2 88 : ; 8 ) 7 8 = 9(2 38 : + 2 44 : ; 4 + 2 84 : ; 8 ) z 4 = 9(2 34 Suppose the true label for this example is a number ! / We can write the square loss for this example as: * = 1 2 !– ! / 8 9
� Learning as loss minimization We have a classifier NN that is completely defined by its weights Learn the weights by minimizing a loss * min ( . *($$ ; / , 2 , ! / ) Perhaps with a regularizer / How do we solve the optimization problem? 10
� min ( . *($$ ; / , 2 , ! / ) Stochastic gradient descent / Given a training set S = {( x i , y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 … T: 1. Shuffle the training set 2. For each training example ( x i , y i ) 2 S: Treat this example as the entire dataset • Compute the gradient of the loss A*($$ & / , ( , ! / ) Update: ( ← ( − D E A*($$ & / , ( , ! / )) • 3. Return w 11
� min ( . *($$ ; / , 2 , ! / ) Stochastic gradient descent / Given a training set S = {( x i , y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 … T: 1. Shuffle the training set 2. For each training example ( x i , y i ) 2 S: Treat this example as the entire dataset • Compute the gradient of the loss A*($$ & / , ( , ! / ) Update: ( ← ( − D E A*($$ & / , ( , ! / )) • 3. Return w 12
� min ( . *($$ ; / , 2 , ! / ) Stochastic gradient descent / Given a training set S = {( x i , y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 … T: 1. Shuffle the training set 2. For each training example ( x i , y i ) 2 S: Treat this example as the entire dataset • Compute the gradient of the loss A*($$ & / , ( , ! / ) Update: ( ← ( − D E A*($$ & / , ( , ! / )) • 3. Return w 13
� min ( . *($$ ; / , 2 , ! / ) Stochastic gradient descent / Given a training set S = {( x i , y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 … T: 1. Shuffle the training set 2. For each training example ( x i , y i ) 2 S: Treat this example as the entire dataset • Compute the gradient of the loss A*($$ & / , ( , ! / ) Update: ( ← ( − D E A*($$ & / , ( , ! / )) • 3. Return w 14
� min ( . *($$ ; / , 2 , ! / ) Stochastic gradient descent / Given a training set S = {( x i , y i )}, x 2 < d 1. Initialize parameters w 2. For epoch = 1 … T: 1. Shuffle the training set 2. For each training example ( x i , y i ) 2 S: Treat this example as the entire dataset • Compute the gradient of the loss A*($$ & / , ( , ! / ) Update: ( ← ( − D E A*($$ & / , ( , ! / )) • 3. Return w 15
� min ( . *($$ ; / , 2 , ! / ) Stochastic gradient descent / Given a training set S = {( x i , y i )}, x 2 < d 1. Initialize parameters w The objective is not convex. Initialization can be important 2. For epoch = 1 … T: 1. Shuffle the training set 2. For each training example ( x i , y i ) 2 S: Treat this example as the entire dataset • Compute the gradient of the loss A*($$ & / , ( , ! / ) Update: ( ← ( − D E A*($$ & / , ( , ! / )) • ° t : learning rate, many tweaks possible 3. Return w 17
� min ( . *($$ ; / , 2 , ! / ) Stochastic gradient descent / Given a training set S = {( x i , y i )}, x 2 < d 1. Initialize parameters w The objective is not convex. Initialization can be important 2. For epoch = 1 … T: 1. Shuffle the training set 2. For each training example ( x i , y i ) 2 S: Treat this example as the entire dataset • Compute the gradient of the loss A*($$ & / , ( , ! / ) Update: ( ← ( − D E A*($$ & / , ( , ! / )) • ° t : learning rate, many tweaks possible Have we solved everything? 3. Return w 18
The derivative of the loss function? A*($$ & / , ( , ! / ) If the neural network is a differentiable function, we can find the gradient – Or maybe its sub-gradient – This is decided by the activation functions and the loss function It was easy for SVMs and logistic regression – Only one layer But how do we find the sub-gradient of a more complex function? – Eg: A recent paper used a ~150 layer neural network for image classification! We need an efficient algorithm: Backpropagation 19
The derivative of the loss function? A*($$ & / , ( , ! / ) If the neural network is a differentiable function, we can find the gradient – Or maybe its sub-gradient – This is decided by the activation functions and the loss function It was easy for SVMs and logistic regression – Only one layer But how do we find the sub-gradient of a more complex function? – Eg: A recent paper used a ~150 layer neural network for image classification! We need an efficient algorithm: Backpropagation 20
The derivative of the loss function? A*($$ & / , ( , ! / ) If the neural network is a differentiable function, we can find the gradient – Or maybe its sub-gradient – This is decided by the activation functions and the loss function It was easy for SVMs and logistic regression – Only one layer But how do we find the sub-gradient of a more complex function? – Eg: A recent paper used a ~150 layer neural network for image classification! We need an efficient algorithm: Backpropagation 21
Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 22
Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 23
Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 24
Checkpoint Where are we If we have a neural network (structure, activations and weights), we can make a prediction for an input If we had the true label of the input, then we can define the loss for that example If we can take the derivative of the loss with respect to each of the weights, we can take a gradient step in SGD Questions? 25
Reminder: Chain rule for derivatives – If 7 is a function of ! and ! is a function of ; • Then 7 is a function of ; , as well – Question: how to find FG FH 27 Slide courtesy Richard Socher
Reminder: Chain rule for derivatives – If 7 = a function of ! 4 + a function of ! 8 , and the ! / ’s are functions of ; • Then 7 is a function of ; , as well – Question: how to find FG FH 28 Slide courtesy Richard Socher
Reminder: Chain rule for derivatives – If 7 is a sum of functions of ! / ’s, and the ! / ’s are functions of ; • Then 7 is a function of ; , as well – Question: how to find FG FH 29 Slide courtesy Richard Socher
Backpropagation * = 1 2 !– ! ∗ 8 5 + 2 44 5 7 4 + 2 84 5 7 8 output y = 2 34 : + 2 48 : ; 4 + 2 88 : ; 8 ) 7 8 = 9(2 38 : + 2 44 : ; 4 + 2 84 : ; 8 ) z 4 = 9(2 34 30
Backpropagation * = 1 2 !– ! ∗ 8 5 + 2 44 5 7 4 + 2 84 5 7 8 output y = 2 34 : + 2 48 : ; 4 + 2 88 : ; 8 ) 7 8 = 9(2 38 : + 2 44 : ; 4 + 2 84 : ; 8 ) z 4 = 9(2 34 We want to compute FI FI M and N FJ KL FJ KL 31
Applying the chain rule to compute the gradient Backpropagation (And remembering partial computations along the way to speed up things) * = 1 2 !– ! ∗ 8 5 + 2 44 5 7 4 + 2 84 5 7 8 output y = 2 34 : + 2 48 : ; 4 + 2 88 : ; 8 ) 7 8 = 9(2 38 : + 2 44 : ; 4 + 2 84 : ; 8 ) z 4 = 9(2 34 We want to compute FI FI M and N FJ KL FJ KL 32
Recommend
More recommend