Neural Net Backpropagation 3/20/17
Recall: Limitations of Perceptrons vs. • AND and OR are linearly separable. XOR isn’t
What is the output of the network? ( 0 x < 0 f ( x ) = 1 x ≥ 0 1 f ( x ) = 1 + e − x ( 0 x < 0 f ( x ) = x ≥ 0 x
How can we train these networks? Two reasons the perceptron algorithm won’t work: 1. Non-threshold activation functions. 2. Multiple layers (what’s the correction for hidden nodes?). Key idea: stochastic gradient descent (SGD). • Compute the error on a random training example. • Compute the derivative of the error with respect to each weight. • Update weights in the direction that reduces error.
Problem: SGD on threshold functions • The derivative of this function is always 0. • We can’t “move in the direction of the gradient”.
Better Activation Functions sigmoid tanh 1 tanh( x ) = 1 + e − 2 x RELU σ ( x ) = 1 + e − x 1 − e − 2 x ( 0 x < 0 RELU( x ) = x ≥ 0 x
Derivatives of Activation Functions sigmoid tanh 1 tanh( x ) = 1 + e − 2 x σ ( x ) = 1 + e − x 1 − e − 2 x d σ ( x ) d tanh( x ) = 1 − tanh 2 ( x ) = σ ( x )(1 − σ ( x )) dx dx RELU ( 0 x < 0 RELU( x ) = x ≥ 0 x ( d RELU( x ) 0 x ≤ 0 = dx 1 x > 0
Error Gradient • Define training error as squared difference between a node’s output and the target: x ) = ( t − o ) 2 E ( ~ w, ~ • Compute gradient of error with respect to weights: sigmoid ∂ E 1 X w · ~ ~ x = w i x i o = 1 + e − ~ w · ~ ∂ w i x i … … … algebra ensues … … … ∂ E = − o (1 − o )( t − o ) x i ∂ w i
Output Node Gradient Descent Step sigmoid α = . 5 w i + = − α ∂ E w i + = α ( o )(1 − o )( t − o ) x i ∂ w i w 0 += . 5 · . 7(1 − . 7)( . 9 − . 7)2 → w i = 1 . 04 w 1 += . 5 · . 7(1 − . 7)( . 9 − . 7)1 . 2 → w i = − . 97
What about hidden layers? • Use the chain rule to compute error derivatives for previous layers. • This turns out to be much easier than it sounds. Let 𝜀 k be the error we computed for output-node k . sigmoid δ k = o k (1 − o k )( t k − o k ) The error for hidden node h comes from the sum of its contribution to the errors for each output node. X w hk δ k k ∈ output
Hidden Node Gradient Descent Step • Compute the contribution to next-layer errors: X δ h = o h (1 − o h ) w hk δ k k ∈ next layer • Update incoming weights using 𝜀 h as the error: w i + = αδ h x i
Backpropagation Algorithm for 1:training runs for example in shuffled training data: run example through network compute error for each output node for each layer (starting from output): for each node in layer: gradient descent update on incoming weights
Example Backpropagation Update 1 σ ( x ) = w i + = α ( o )(1 − o )( t − o ) x i 1 + e − x X δ h = o h (1 − o h ) w hk δ k k ∈ next layer
Recommend
More recommend