Forward Pass in Python Example code for a forward pass for a 3-layer network in Python: Can be implemented efficiently using matrix operations Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 17 / 62
Forward Pass in Python Example code for a forward pass for a 3-layer network in Python: Can be implemented efficiently using matrix operations Example above: W 1 is matrix of size 4 × 3, W 2 is 4 × 4. What about biases and W 3 ? [http://cs231n.github.io/neural-networks-1/] Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 17 / 62
Special Case What is a single layer (no hiddens) network with a sigmoid act. function? Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 18 / 62
Special Case What is a single layer (no hiddens) network with a sigmoid act. function? Network: 1 o k ( x ) = 1 + exp( − z k ) J � z k = w k 0 + x j w kj j =1 Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 18 / 62
Special Case What is a single layer (no hiddens) network with a sigmoid act. function? Network: 1 o k ( x ) = 1 + exp( − z k ) J � z k = w k 0 + x j w kj j =1 Logistic regression! Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 18 / 62
Example Application Classify image of handwritten digit (32x32 pixels): 4 vs non-4 Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 19 / 62
Example Application Classify image of handwritten digit (32x32 pixels): 4 vs non-4 How would you build your network? Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 19 / 62
Example Application Classify image of handwritten digit (32x32 pixels): 4 vs non-4 How would you build your network? For example, use one hidden layer and the sigmoid activation function: 1 o k ( x ) = 1 + exp( − z k ) J � z k = w k 0 + h j ( x ) w kj j =1 Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 19 / 62
Example Application Classify image of handwritten digit (32x32 pixels): 4 vs non-4 How would you build your network? For example, use one hidden layer and the sigmoid activation function: 1 o k ( x ) = 1 + exp( − z k ) J � z k = w k 0 + h j ( x ) w kj j =1 How can we train the network, that is, adjust all the parameters w ? Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 19 / 62
Training Neural Networks Find weights: N � w ∗ = argmin loss ( o ( n ) , t ( n ) ) w n =1 where o = f ( x ; w ) is the output of a neural network Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 20 / 62
Training Neural Networks Find weights: N � w ∗ = argmin loss ( o ( n ) , t ( n ) ) w n =1 where o = f ( x ; w ) is the output of a neural network Define a loss function, eg: ◮ Squared loss: � 2 ( o ( n ) − t ( n ) 1 k ) 2 k k ◮ Cross-entropy loss: − � k t ( n ) log o ( n ) k k Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 20 / 62
Training Neural Networks Find weights: N � w ∗ = argmin loss ( o ( n ) , t ( n ) ) w n =1 where o = f ( x ; w ) is the output of a neural network Define a loss function, eg: ◮ Squared loss: � 2 ( o ( n ) − t ( n ) 1 k ) 2 k k ◮ Cross-entropy loss: − � k t ( n ) log o ( n ) k k Gradient descent: w t +1 = w t − η ∂ E ∂ w t where η is the learning rate (and E is error/loss) Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 20 / 62
Useful Derivatives name function derivative 1 Sigmoid σ ( z ) = σ ( z ) · (1 − σ ( z )) 1+exp( − z ) tanh ( z ) = exp( z ) − exp( − z ) 1 / cosh 2 ( z ) Tanh exp( z )+exp( − z ) � 1 , if z > 0 ReLU ReLU ( z ) = max(0 , z ) 0 , if z ≤ 0 Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 21 / 62
Training Neural Networks: Back-propagation Back-propagation: an efficient method for computing gradients needed to perform gradient-based optimization of the weights in a multi-layer network Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 22 / 62
Training Neural Networks: Back-propagation Back-propagation: an efficient method for computing gradients needed to perform gradient-based optimization of the weights in a multi-layer network Training neural nets: Loop until convergence: ◮ for each example n 1. Given input x ( n ) , propagate activity forward ( x ( n ) → h ( n ) → o ( n ) ) ( forward pass ) 2. Propagate gradients backward ( backward pass ) 3. Update each weight (via gradient descent) Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 22 / 62
Training Neural Networks: Back-propagation Back-propagation: an efficient method for computing gradients needed to perform gradient-based optimization of the weights in a multi-layer network Training neural nets: Loop until convergence: ◮ for each example n 1. Given input x ( n ) , propagate activity forward ( x ( n ) → h ( n ) → o ( n ) ) ( forward pass ) 2. Propagate gradients backward ( backward pass ) 3. Update each weight (via gradient descent) Given any error function E, activation functions g () and f (), just need to derive gradients Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 22 / 62
Key Idea behind Backpropagation We don’t have targets for a hidden unit, but we can compute how fast the error changes as we change its activity Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62
Key Idea behind Backpropagation We don’t have targets for a hidden unit, but we can compute how fast the error changes as we change its activity ◮ Instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62
Key Idea behind Backpropagation We don’t have targets for a hidden unit, but we can compute how fast the error changes as we change its activity ◮ Instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities ◮ Each hidden activity can affect many output units and can therefore have many separate effects on the error. These effects must be combined Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62
Key Idea behind Backpropagation We don’t have targets for a hidden unit, but we can compute how fast the error changes as we change its activity ◮ Instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities ◮ Each hidden activity can affect many output units and can therefore have many separate effects on the error. These effects must be combined ◮ We can compute error derivatives for all the hidden units efficiently Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62
Key Idea behind Backpropagation We don’t have targets for a hidden unit, but we can compute how fast the error changes as we change its activity ◮ Instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities ◮ Each hidden activity can affect many output units and can therefore have many separate effects on the error. These effects must be combined ◮ We can compute error derivatives for all the hidden units efficiently ◮ Once we have the error derivatives for the hidden activities, its easy to get the error derivatives for the weights going into a hidden unit Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62
Key Idea behind Backpropagation We don’t have targets for a hidden unit, but we can compute how fast the error changes as we change its activity ◮ Instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities ◮ Each hidden activity can affect many output units and can therefore have many separate effects on the error. These effects must be combined ◮ We can compute error derivatives for all the hidden units efficiently ◮ Once we have the error derivatives for the hidden activities, its easy to get the error derivatives for the weights going into a hidden unit This is just the chain rule! Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 23 / 62
Computing Gradients: Single Layer Network Let’s take a single layer network Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 24 / 62
Computing Gradients: Single Layer Network Let’s take a single layer network and draw it a bit differently Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 24 / 62
Computing Gradients: Single Layer Network Error gradients for single layer network: ∂ E = ∂ w ki Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 25 / 62
Computing Gradients: Single Layer Network Error gradients for single layer network: ∂ E = ∂ E ∂ o k ∂ z k ∂ w ki ∂ o k ∂ z k ∂ w ki Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 25 / 62
Computing Gradients: Single Layer Network Error gradients for single layer network: ∂ E = ∂ E ∂ o k ∂ z k ∂ w ki ∂ o k ∂ z k ∂ w ki Error gradient is computable for any continuous activation function g (), and any continuous error function Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 25 / 62
Computing Gradients: Single Layer Network Error gradients for single layer network: ∂ E = ∂ E ∂ o k ∂ z k ∂ w ki ∂ o k ∂ z k ∂ w ki ���� δ o k Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 26 / 62
Computing Gradients: Single Layer Network Error gradients for single layer network: ∂ E = ∂ E ∂ o k ∂ z k ∂ o k ∂ z k = δ o k ∂ w ki ∂ o k ∂ z k ∂ w ki ∂ z k ∂ w ki Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 27 / 62
Computing Gradients: Single Layer Network Error gradients for single layer network: ∂ E = ∂ E ∂ o k ∂ z k k · ∂ o k ∂ z k = δ o ∂ w ki ∂ o k ∂ z k ∂ w ki ∂ z k ∂ w ki � �� � δ z k Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 28 / 62
Computing Gradients: Single Layer Network Error gradients for single layer network: ∂ E = ∂ E ∂ o k ∂ z k ∂ z k = δ z = δ z k · x i k ∂ w ki ∂ o k ∂ z k ∂ w ki ∂ w ki Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 29 / 62
Gradient Descent for Single Layer Network Assuming the error function is mean-squared error (MSE), on a single training example n , we have ∂ E = o ( n ) − t ( n ) := δ o k k k ∂ o ( n ) k Using logistic activation functions: o ( n ) g ( z ( n ) k ) = (1 + exp( − z ( n ) k )) − 1 = k ∂ o ( n ) o ( n ) k (1 − o ( n ) k = k ) ∂ z ( n ) k Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62
Gradient Descent for Single Layer Network Assuming the error function is mean-squared error (MSE), on a single training example n , we have ∂ E = o ( n ) − t ( n ) := δ o k k k ∂ o ( n ) k Using logistic activation functions: o ( n ) g ( z ( n ) k ) = (1 + exp( − z ( n ) k )) − 1 = k ∂ o ( n ) o ( n ) k (1 − o ( n ) k = k ) ∂ z ( n ) k The error gradient is then: ∂ E ∂ w ki = Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62
Gradient Descent for Single Layer Network Assuming the error function is mean-squared error (MSE), on a single training example n , we have ∂ E = o ( n ) − t ( n ) := δ o k k k ∂ o ( n ) k Using logistic activation functions: o ( n ) g ( z ( n ) k ) = (1 + exp( − z ( n ) k )) − 1 = k ∂ o ( n ) o ( n ) k (1 − o ( n ) k = k ) ∂ z ( n ) k The error gradient is then: N ∂ o ( n ) ∂ z ( n ) ∂ E ∂ E � k k ∂ w ki = ∂ w ki = ∂ o ( n ) ∂ z ( n ) n =1 k k Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62
Gradient Descent for Single Layer Network Assuming the error function is mean-squared error (MSE), on a single training example n , we have ∂ E = o ( n ) − t ( n ) := δ o k k k ∂ o ( n ) k Using logistic activation functions: o ( n ) g ( z ( n ) k ) = (1 + exp( − z ( n ) k )) − 1 = k ∂ o ( n ) o ( n ) k (1 − o ( n ) k = k ) ∂ z ( n ) k The error gradient is then: N ∂ o ( n ) ∂ z ( n ) N ∂ E ∂ E � � ( o ( n ) − t ( n ) k ) o ( n ) k (1 − o ( n ) k ) x ( n ) k k ∂ w ki = ∂ w ki = k i ∂ o ( n ) ∂ z ( n ) n =1 n =1 k k Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62
Gradient Descent for Single Layer Network Assuming the error function is mean-squared error (MSE), on a single training example n , we have ∂ E = o ( n ) − t ( n ) := δ o k k k ∂ o ( n ) k Using logistic activation functions: o ( n ) g ( z ( n ) k ) = (1 + exp( − z ( n ) k )) − 1 = k ∂ o ( n ) o ( n ) k (1 − o ( n ) k = k ) ∂ z ( n ) k The error gradient is then: N ∂ o ( n ) ∂ z ( n ) N ∂ E ∂ E � � ( o ( n ) − t ( n ) k ) o ( n ) k (1 − o ( n ) k ) x ( n ) k k ∂ w ki = ∂ w ki = k i ∂ o ( n ) ∂ z ( n ) n =1 n =1 k k The gradient descent update rule is given by: w ki ← w ki − η ∂ E ∂ w ki = Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62
Gradient Descent for Single Layer Network Assuming the error function is mean-squared error (MSE), on a single training example n , we have ∂ E = o ( n ) − t ( n ) := δ o k k k ∂ o ( n ) k Using logistic activation functions: o ( n ) g ( z ( n ) k ) = (1 + exp( − z ( n ) k )) − 1 = k ∂ o ( n ) o ( n ) k (1 − o ( n ) k = k ) ∂ z ( n ) k The error gradient is then: N ∂ o ( n ) ∂ z ( n ) N ∂ E ∂ E � � ( o ( n ) − t ( n ) k ) o ( n ) k (1 − o ( n ) k ) x ( n ) k k ∂ w ki = ∂ w ki = k i ∂ o ( n ) ∂ z ( n ) n =1 n =1 k k The gradient descent update rule is given by: N w ki ← w ki − η ∂ E ( o ( n ) − t ( n ) k ) o ( n ) k (1 − o ( n ) k ) x ( n ) � ∂ w ki = w ki − η i k n =1 Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 30 / 62
Multi-layer Neural Network Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 31 / 62
Back-propagation: Sketch on One Training Case Convert discrepancy between each output and its target value into an error derivative E = 1 ∂ E � ( o k − t k ) 2 ; = o k − t k 2 ∂ o k k Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 32 / 62
Back-propagation: Sketch on One Training Case Convert discrepancy between each output and its target value into an error derivative E = 1 ∂ E � ( o k − t k ) 2 ; = o k − t k 2 ∂ o k k Compute error derivatives in each hidden layer from error derivatives in layer above. [assign blame for error at k to each unit j according to its influence on k (depends on w kj )] Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 32 / 62
Back-propagation: Sketch on One Training Case Convert discrepancy between each output and its target value into an error derivative E = 1 ∂ E � ( o k − t k ) 2 ; = o k − t k 2 ∂ o k k Compute error derivatives in each hidden layer from error derivatives in layer above. [assign blame for error at k to each unit j according to its influence on k (depends on w kj )] Use error derivatives w.r.t. activities to get error derivatives w.r.t. the weights. Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 32 / 62
Gradient Descent for Multi-layer Network The output weight gradients for a multi-layer network are the same as for a single layer network N N ∂ o ( n ) ∂ z ( n ) ∂ E ∂ E � � δ z , ( n ) h ( n ) k k ∂ w kj = ∂ w kj = k j ∂ o ( n ) ∂ z ( n ) n =1 k k n =1 where δ k is the error w.r.t. the net input for unit k Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 33 / 62
Gradient Descent for Multi-layer Network The output weight gradients for a multi-layer network are the same as for a single layer network N N ∂ o ( n ) ∂ z ( n ) ∂ E ∂ E � � δ z , ( n ) h ( n ) k k ∂ w kj = ∂ w kj = k j ∂ o ( n ) ∂ z ( n ) n =1 k k n =1 where δ k is the error w.r.t. the net input for unit k Hidden weight gradients are then computed via back-prop: ∂ E = ∂ h ( n ) j Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 33 / 62
Gradient Descent for Multi-layer Network The output weight gradients for a multi-layer network are the same as for a single layer network N ∂ o ( n ) ∂ z ( n ) N ∂ E ∂ E δ z , ( n ) h ( n ) � k k � ∂ w kj = ∂ w kj = ∂ o ( n ) ∂ z ( n ) k j n =1 n =1 k k where δ k is the error w.r.t. the net input for unit k Hidden weight gradients are then computed via back-prop: ∂ o ( n ) ∂ z ( n ) ∂ E ∂ E � k k = = ∂ h ( n ) ∂ o ( n ) ∂ z ( n ) ∂ h ( n ) j k k k j Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 34 / 62
Gradient Descent for Multi-layer Network The output weight gradients for a multi-layer network are the same as for a single layer network N ∂ o ( n ) ∂ z ( n ) N ∂ E ∂ E δ z , ( n ) h ( n ) � k k � ∂ w kj = ∂ w kj = ∂ o ( n ) ∂ z ( n ) k j n =1 n =1 k k where δ k is the error w.r.t. the net input for unit k Hidden weight gradients are then computed via back-prop: ∂ o ( n ) ∂ z ( n ) ∂ E ∂ E δ z , ( n ) w kj := δ h , ( n ) � � k k = = ∂ h ( n ) ∂ o ( n ) ∂ z ( n ) ∂ h ( n ) k j j k k k j k Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 34 / 62
Gradient Descent for Multi-layer Network The output weight gradients for a multi-layer network are the same as for a single layer network N ∂ o ( n ) ∂ z ( n ) N ∂ E ∂ E � � δ z , ( n ) h ( n ) k k ∂ w kj = ∂ w kj = k j ∂ o ( n ) ∂ z ( n ) n =1 k k n =1 where δ k is the error w.r.t. the net input for unit k Hidden weight gradients are then computed via back-prop: ∂ o ( n ) ∂ z ( n ) ∂ E ∂ E δ z , ( n ) w kj := δ h , ( n ) � k k � = = k j ∂ h ( n ) ∂ o ( n ) ∂ z ( n ) ∂ h ( n ) j k j k k k ∂ h ( n ) ∂ u ( n ) N ∂ E ∂ E j j � ∂ v ji = = ∂ h ( n ) ∂ u ( n ) ∂ v ji n =1 j j Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 35 / 62
Gradient Descent for Multi-layer Network The output weight gradients for a multi-layer network are the same as for a single layer network N ∂ o ( n ) ∂ z ( n ) N ∂ E ∂ E � � δ z , ( n ) h ( n ) k k ∂ w kj = ∂ w kj = k j ∂ o ( n ) ∂ z ( n ) n =1 k k n =1 where δ k is the error w.r.t. the net input for unit k Hidden weight gradients are then computed via back-prop: ∂ o ( n ) ∂ z ( n ) ∂ E ∂ E δ z , ( n ) w kj := δ h , ( n ) � k k � = = k j ∂ h ( n ) ∂ o ( n ) ∂ z ( n ) ∂ h ( n ) j k j k k k ∂ h ( n ) ∂ u ( n ) ∂ u ( n ) N N ∂ E ∂ E j j δ h , ( n ) f ′ ( u ( n ) j � � ∂ v ji = = ) = j j ∂ h ( n ) ∂ u ( n ) ∂ v ji ∂ v ji n =1 n =1 j j Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 35 / 62
Gradient Descent for Multi-layer Network The output weight gradients for a multi-layer network are the same as for a single layer network N ∂ o ( n ) ∂ z ( n ) N ∂ E ∂ E δ z , ( n ) h ( n ) � k k � ∂ w kj = ∂ w kj = ∂ o ( n ) ∂ z ( n ) k j n =1 n =1 k k where δ k is the error w.r.t. the net input for unit k Hidden weight gradients are then computed via back-prop: ∂ o ( n ) ∂ z ( n ) ∂ E ∂ E δ z , ( n ) w kj := δ h , ( n ) � k k � = = ∂ h ( n ) ∂ o ( n ) ∂ z ( n ) ∂ h ( n ) k j j k k k j k ∂ h ( n ) ∂ u ( n ) ∂ u ( n ) N N N ∂ E ∂ E j j δ h , ( n ) f ′ ( u ( n ) j δ u , ( n ) x ( n ) � � � ∂ v ji = = ) = j j j i ∂ h ( n ) ∂ u ( n ) ∂ v ji ∂ v ji n =1 n =1 n =1 j j Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 36 / 62
Choosing Activation and Loss Functions When using a neural network for regression, sigmoid activation and MSE as the loss function work well Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 37 / 62
Choosing Activation and Loss Functions When using a neural network for regression, sigmoid activation and MSE as the loss function work well For classification, if it is a binary (2-class) problem, then cross-entropy error function often does better (as we saw with logistic regression) N � t ( n ) log o ( n ) + (1 − t ( n ) ) log(1 − o ( n ) ) E = − n =1 o ( n ) = (1 + exp( − z ( n ) ) − 1 Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 37 / 62
Choosing Activation and Loss Functions When using a neural network for regression, sigmoid activation and MSE as the loss function work well For classification, if it is a binary (2-class) problem, then cross-entropy error function often does better (as we saw with logistic regression) N � t ( n ) log o ( n ) + (1 − t ( n ) ) log(1 − o ( n ) ) E = − n =1 o ( n ) = (1 + exp( − z ( n ) ) − 1 We can then compute via the chain rule ∂ E ∂ o = ( o − t ) / ( o (1 − o )) ∂ o ∂ z = o (1 − o ) ∂ E ∂ z = ∂ E ∂ o ∂ z = ( o − t ) ∂ o Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 37 / 62
Multi-class Classification For multi-class classification problems, use cross-entropy as loss and the softmax activation function t ( n ) log o ( n ) � � E = − k k n k exp( z ( n ) k ) o ( n ) = k j exp( z ( n ) � ) j And the derivatives become ∂ o k ∂ z k = o k (1 − o k ) ∂ E ∂ E ∂ o j � ∂ z k = ∂ z k = ( o k − t k ) o k (1 − o k ) ∂ o j j Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 38 / 62
Example Application Now trying to classify image of handwritten digit: 32x32 pixels 10 output units, 1 per digit Use the softmax function: exp( z k ) = o k � j exp( z j ) J � z k = w k 0 + h j ( x ) w kj j =1 What is J ? Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 39 / 62
Ways to Use Weight Derivatives How often to update Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62
Ways to Use Weight Derivatives How often to update ◮ after a full sweep through the training data (batch gradient descent) N ∂ E ( o ( n ) , t ( n ) ; w ) w ki ← w ki − η ∂ E � = w ki − η ∂ w ki ∂ w ki n =1 Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62
Ways to Use Weight Derivatives How often to update ◮ after a full sweep through the training data (batch gradient descent) N ∂ E ( o ( n ) , t ( n ) ; w ) w ki ← w ki − η ∂ E � = w ki − η ∂ w ki ∂ w ki n =1 ◮ after each training case (stochastic gradient descent) Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62
Ways to Use Weight Derivatives How often to update ◮ after a full sweep through the training data (batch gradient descent) N ∂ E ( o ( n ) , t ( n ) ; w ) w ki ← w ki − η ∂ E � = w ki − η ∂ w ki ∂ w ki n =1 ◮ after each training case (stochastic gradient descent) ◮ after a mini-batch of training cases Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62
Ways to Use Weight Derivatives How often to update ◮ after a full sweep through the training data (batch gradient descent) N ∂ E ( o ( n ) , t ( n ) ; w ) w ki ← w ki − η ∂ E � = w ki − η ∂ w ki ∂ w ki n =1 ◮ after each training case (stochastic gradient descent) ◮ after a mini-batch of training cases How much to update Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62
Ways to Use Weight Derivatives How often to update ◮ after a full sweep through the training data (batch gradient descent) N ∂ E ( o ( n ) , t ( n ) ; w ) w ki ← w ki − η ∂ E � = w ki − η ∂ w ki ∂ w ki n =1 ◮ after each training case (stochastic gradient descent) ◮ after a mini-batch of training cases How much to update ◮ Use a fixed learning rate Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62
Ways to Use Weight Derivatives How often to update ◮ after a full sweep through the training data (batch gradient descent) N ∂ E ( o ( n ) , t ( n ) ; w ) w ki ← w ki − η ∂ E � = w ki − η ∂ w ki ∂ w ki n =1 ◮ after each training case (stochastic gradient descent) ◮ after a mini-batch of training cases How much to update ◮ Use a fixed learning rate ◮ Adapt the learning rate Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62
Ways to Use Weight Derivatives How often to update ◮ after a full sweep through the training data (batch gradient descent) N ∂ E ( o ( n ) , t ( n ) ; w ) w ki ← w ki − η ∂ E � = w ki − η ∂ w ki ∂ w ki n =1 ◮ after each training case (stochastic gradient descent) ◮ after a mini-batch of training cases How much to update ◮ Use a fixed learning rate ◮ Adapt the learning rate ◮ Add momentum w ki ← w ki − v γ v + η ∂ E v ← ∂ w ki Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 40 / 62
Comparing Optimization Methods [http://cs231n.github.io/neural-networks-3/, Alec Radford] Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 41 / 62
Monitor Loss During Training Check how your loss behaves during training, to spot wrong hyperparameters, bugs, etc Figure: Left: Good vs bad parameter choices, Right: How a real loss might look like during training. What are the bumps caused by? How could we get a more smooth loss? Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 42 / 62
Monitor Accuracy on Train/Validation During Training Check how your desired performance metrics behaves during training [http://cs231n.github.io/neural-networks-3/] Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 43 / 62
Why ”Deep”? Supervised Learning: Examples Classification n o i t a c “dog” i f i s s a l c Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 44 / 62
Why ”Deep”? Supervised Learning: Examples Classification n o i t a c “dog” i f i s s a l c Supervised Deep Learning Classification “dog” [Picture from M. Ranzato] Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 44 / 62
Neural Networks Deep learning uses composite of simple functions (e.g., ReLU, sigmoid, tanh, max) to create complex non-linear functions Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62
Neural Networks Deep learning uses composite of simple functions (e.g., ReLU, sigmoid, tanh, max) to create complex non-linear functions Note: a composite of linear functions is linear! Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62
Neural Networks Deep learning uses composite of simple functions (e.g., ReLU, sigmoid, tanh, max) to create complex non-linear functions Note: a composite of linear functions is linear! Example: 2 hidden layer NNet (now matrix and vector form!) with ReLU as nonlinearity h 1 h 2 2 h 1 + b 2 ) 3 h 2 + b 3 max(0 , W T 1 x + b 1 ) max(0 , W T W T y x Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62
Neural Networks Deep learning uses composite of simple functions (e.g., ReLU, sigmoid, tanh, max) to create complex non-linear functions Note: a composite of linear functions is linear! Example: 2 hidden layer NNet (now matrix and vector form!) with ReLU as nonlinearity h 1 h 2 2 h 1 + b 2 ) 3 h 2 + b 3 max(0 , W T 1 x + b 1 ) max(0 , W T W T y x ◮ x is the input Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62
Neural Networks Deep learning uses composite of simple functions (e.g., ReLU, sigmoid, tanh, max) to create complex non-linear functions Note: a composite of linear functions is linear! Example: 2 hidden layer NNet (now matrix and vector form!) with ReLU as nonlinearity h 1 h 2 2 h 1 + b 2 ) 3 h 2 + b 3 max(0 , W T 1 x + b 1 ) max(0 , W T W T y x ◮ x is the input ◮ y is the output (what we want to predict) Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62
Neural Networks Deep learning uses composite of simple functions (e.g., ReLU, sigmoid, tanh, max) to create complex non-linear functions Note: a composite of linear functions is linear! Example: 2 hidden layer NNet (now matrix and vector form!) with ReLU as nonlinearity h 1 h 2 2 h 1 + b 2 ) 3 h 2 + b 3 max(0 , W T 1 x + b 1 ) max(0 , W T W T y x ◮ x is the input ◮ y is the output (what we want to predict) ◮ h i is the i -th hidden layer Urtasun, Zemel, Fidler (UofT) CSC 411: 10-Neural Networks I Feb 10, 2016 45 / 62
Recommend
More recommend