Neural networks Chapter 20 Chapter 20 1
Outline ♦ Brains ♦ Neural networks ♦ Perceptrons ♦ Multilayer networks ♦ Applications of neural networks Chapter 20 2
Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms–10ms cycle time Signals are noisy “spike trains” of electrical potential Axonal arborization Axon from another cell Synapse Dendrite Axon Nucleus Synapses Cell body or Soma Chapter 20 3
McCulloch–Pitts “unit” Output is a “squashed” linear function of the inputs: � Σ j W j,i a j � a i ← g ( in i ) = g Bias Weight a 0 = − 1 a i = g ( in i ) W 0 ,i g in i W j,i Σ a j a i Input� Input� Activation� Output� Output Links Function Function Links Chapter 20 4
Activation functions g ( in i ) g ( in i ) + 1 + 1 in i in i (a)� (b)� (a) is a step function or threshold function (b) is a sigmoid function 1 / (1 + e − x ) Changing the bias weight W 0 ,i moves the threshold location Chapter 20 5
Implementing logical functions McCulloch and Pitts: every Boolean function can be implemented (with large enough network) AND? OR? NOT? MAJORITY? Chapter 20 6
Implementing logical functions McCulloch and Pitts: every Boolean function can be implemented (with large enough network) W 0 = 1.5 W 0 = 0.5 W 0 = – 0.5 W 1 = 1 W 1 = 1 W 1 = –1 W 2 = 1 W 2 = 1 AND OR NOT Chapter 20 7
Network structures Feed-forward networks: – single-layer perceptrons – multi-layer networks Feed-forward networks implement functions, have no internal state Recurrent networks: – Hopfield networks have symmetric weights ( W i,j = W j,i ) g ( x ) = sign ( x ) , a i = ± 1 ; holographic associative memory – Boltzmann machines use stochastic activation functions, ≈ MCMC in BNs – recurrent neural nets have directed cycles with delays ⇒ have internal state (like flip-flops), can oscillate etc. Chapter 20 8
Feed-forward example W 1,3 1 3 W W 3,5 1,4 5 W W 2,3 4,5 2 4 W 2,4 Feed-forward network = a parameterized family of nonlinear functions: a 5 = g ( W 3 , 5 · a 3 + W 4 , 5 · a 4 ) = g ( W 3 , 5 · g ( W 1 , 3 · a 1 + W 2 , 3 · a 2 ) + W 4 , 5 · g ( W 1 , 4 · a 1 + W 2 , 4 · a 2 )) Chapter 20 9
Perceptrons Perceptron output 1 0.8 0.6 0.4 0.2 -4 -2 0 2 4 0 x 2 -4 Output Input -2 0 W j,i x 1 2 4 Units Units Chapter 20 10
Expressiveness of perceptrons Consider a perceptron with g = step function (Rosenblatt, 1957, 1960) Can represent AND, OR, NOT, majority, etc. Represents a linear separator in input space: Σ j W j x j > 0 or W · x > 0 I I I 1 1 1 1 1 1 ? 0 0 0 I I I 0 1 0 1 0 1 2 2 2 I xor I I I I I and or (a) (b) (c) 1 2 1 2 1 2 Chapter 20 11
Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err 2 ≡ 1 2( y − h W ( x )) 2 Chapter 20 12
Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err 2 ≡ 1 2( y − h W ( x )) 2 Perform optimization search by gradient descent: ∂E =? ∂W j Chapter 20 13
Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err 2 ≡ 1 2( y − h W ( x )) 2 Perform optimization search by gradient descent: ∂E = Err × ∂ Err ∂ y − g ( Σ n � � = Err × j = 0 W j x j ) ∂W j ∂W j ∂W j Chapter 20 14
Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err 2 ≡ 1 2( y − h W ( x )) 2 Perform optimization search by gradient descent: ∂E = Err × ∂ Err ∂ y − g ( Σ n � � = Err × j = 0 W j x j ) ∂W j ∂W j ∂W j = − Err × g ′ ( in ) × x j Chapter 20 15
Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err 2 ≡ 1 2( y − h W ( x )) 2 Perform optimization search by gradient descent: ∂E = Err × ∂ Err ∂ y − g ( Σ n � � = Err × j = 0 W j x j ) ∂W j ∂W j ∂W j = − Err × g ′ ( in ) × x j Simple weight update rule: W j ← W j + α × Err × g ′ ( in ) × x j E.g., +ve error ⇒ increase network output ⇒ increase weights on +ve inputs, decrease on -ve inputs Chapter 20 16
Perceptron learning W = random initial values for iter = 1 to T for i = 1 to N (all examples) � x = input for example i y = output for example i W old = W Err = y − g ( W old · � x ) for j = 1 to M (all weights) W j = W j + α · Err · g ′ ( W old · � x ) · x j Chapter 20 17
Perceptron learning contd. Derivative of sigmoid g ( x ) can be written in simple form: 1 g ( x ) = 1 + e − x g ′ ( x ) = ? Chapter 20 18
Perceptron learning contd. Derivative of sigmoid g ( x ) can be written in simple form: 1 g ( x ) = 1 + e − x e − x g ′ ( x ) = (1 + e − x ) 2 = e − x g ( x ) 2 Also, 1 + e − x ⇒ g ( x ) + e − x g ( x ) = 1 ⇒ e − x = 1 − g ( x ) 1 g ( x ) = g ( x ) So g ′ ( x ) = 1 − g ( x ) g ( x ) 2 g ( x ) = (1 − g ( x )) g ( x ) Chapter 20 19
Perceptron learning contd. Perceptron learning rule converges to a consistent function for any linearly separable data set Proportion correct on test set Proportion correct on test set 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 Perceptron 0.6 Decision tree 0.5 0.5 Perceptron Decision tree 0.4 0.4 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Training set size - MAJORITY on 11 inputs Training set size - RESTAURANT data Chapter 20 20
Multilayer networks Layers are usually fully connected; numbers of hidden units typically chosen by hand a i Output units W j,i a j Hidden units W k,j a k Input units Chapter 20 21
Expressiveness of MLPs All continuous functions w/ 1 hidden layer, all functions w/ 2 hidden layers h W ( x 1 , x 2 ) h W ( x 1 , x 2 ) 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 -4 -2 0 2 4 -4 -2 0 2 4 0 0 x 2 x 2 -4 -4 -2 -2 0 0 x 1 x 1 2 2 4 4 Chapter 20 22
Training a MLP In general have n output nodes, E ≡ 1 i Err 2 i , � 2 where Err i = ( y i − a i ) and i runs over all nodes in the output layer. � Need to calculate ∂E ∂W ij for any W ij . Chapter 20 23
Training a MLP cont. Can approximate derivatives by: f ′ ( x ) ≈ f ( x + h ) − f ( x ) h ∂E ( W ) ≈ E ( W + (0 , . . . , h, . . . , 0)) − E ( W ) ∂W ij h What would this entail for a network with n weights? Chapter 20 24
Training a MLP cont. Can approximate derivatives by: f ′ ( x ) ≈ f ( x + h ) − f ( x ) h ∂E ( W ) ≈ E ( W + (0 , . . . , h, . . . , 0)) − E ( W ) ∂W ij h What would this entail for a network with n weights? - one iteration would take O ( n 2 ) time Complicated networks have tens of thousands of weights, O ( n 2 ) time is intractable. Back-propagation is a recursive method of calculating all of these derivatives in O ( n ) time. Chapter 20 25
Back-propagation learning In general have n output nodes, E ≡ 1 i Err 2 i , � 2 where Err i = ( y i − a i ) and i runs over all nodes in the output layer. � Output layer: same as for single-layer perceptron, W j,i ← W j,i + α × a j × ∆ i where ∆ i = Err i × g ′ ( in i ) Hidden layers: back-propagate the error from the output layer: ∆ j = g ′ ( in j ) i W j,i ∆ i . � Update rule for weights in hidden layers: W k,j ← W k,j + α × a k × ∆ j . Chapter 20 26
Back-propagation derivation For a node i in the output layer: ∂E = − ( y i − a i ) ∂a i ∂W j,i ∂W j,i Chapter 20 27
Back-propagation derivation For a node i in the output layer: ∂E = − ( y i − a i ) ∂a i = − ( y i − a i ) ∂g ( in i ) ∂W j,i ∂W j,i ∂W j,i Chapter 20 28
Back-propagation derivation For a node i in the output layer: ∂E = − ( y i − a i ) ∂a i = − ( y i − a i ) ∂g ( in i ) ∂W j,i ∂W j,i ∂W j,i = − ( y i − a i ) g ′ ( in i ) ∂ in i ∂W j,i Chapter 20 29
Back-propagation derivation For a node i in the output layer: ∂E = − ( y i − a i ) ∂a i = − ( y i − a i ) ∂g ( in i ) ∂W j,i ∂W j,i ∂W j,i = − ( y i − a i ) g ′ ( in i ) ∂ in i ∂ = − ( y i − a i ) g ′ ( in i ) k W k,i a j � ∂W j,i ∂W j,i Chapter 20 30
Back-propagation derivation For a node i in the output layer: ∂E = − ( y i − a i ) ∂a i = − ( y i − a i ) ∂g ( in i ) ∂W j,i ∂W j,i ∂W j,i = − ( y i − a i ) g ′ ( in i ) ∂ in i ∂ = − ( y i − a i ) g ′ ( in i ) k W k,i a j � ∂W j,i ∂W j,i = − ( y i − a i ) g ′ ( in i ) a j = − a j ∆ i where ∆ i = ( y i − a i ) g ′ ( in i ) Chapter 20 31
Back-propagation derivation: hidden layer For a node j in a hidden layer: ∂E = ? ∂W k,j Chapter 20 32
“Reminder”: Chain rule for partial derivatives For f ( x, y ) , with f differentiable wrt x and y , and x and y differentiable wrt u and v : ∂f ∂f ∂x ∂u + ∂f ∂y = ∂u ∂x ∂y ∂u and ∂f ∂f ∂x ∂v + ∂f ∂y = ∂v ∂x ∂y ∂v Chapter 20 33
Recommend
More recommend