Neural networks Chapter 20, Section 5 Chapter 20, Section 5 1
Outline ♦ Brains ♦ Neural networks ♦ Perceptrons ♦ Multilayer perceptrons ♦ Applications of neural networks Chapter 20, Section 5 2
Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms–10ms cycle time Signals are noisy “spike trains” of electrical potential Axonal arborization Axon from another cell Synapse Dendrite Axon Nucleus Synapses Cell body or Soma Chapter 20, Section 5 3
McCulloch–Pitts “unit” Output is a “squashed” linear function of the inputs: � Σ j W j,i a j � a i ← g ( in i ) = g Bias Weight a 0 = − 1 a i = g ( in i ) W 0 ,i g in i Σ W j,i a j a i Input� Input� Activation� Output� Output Links Function Function Links A gross oversimplification of real neurons, but its purpose is to develop understanding of what networks of simple units can do Chapter 20, Section 5 4
Activation functions g ( in i ) g ( in i ) + 1 + 1 in i in i (a)� (b)� (a) is a step function or threshold function (b) is a sigmoid function 1 / (1 + e − x ) Changing the bias weight W 0 ,i moves the threshold location Chapter 20, Section 5 5
Implementing logical functions W 0 = 1.5 W 0 = 0.5 W 0 = – 0.5 W 1 = 1 W 1 = 1 W 1 = –1 W 2 = 1 W 2 = 1 AND OR NOT McCulloch and Pitts: every Boolean function can be implemented Chapter 20, Section 5 6
Network structures Feed-forward networks: – single-layer perceptrons – multi-layer perceptrons Feed-forward networks implement functions, have no internal state Recurrent networks: – Hopfield networks have symmetric weights ( W i,j = W j,i ) g ( x ) = sign ( x ) , a i = ± 1 ; holographic associative memory – Boltzmann machines use stochastic activation functions, ≈ MCMC in Bayes nets – recurrent neural nets have directed cycles with delays ⇒ have internal state (like flip-flops), can oscillate etc. Chapter 20, Section 5 7
Feed-forward example W 1,3 1 3 W 3,5 W 1,4 5 W W 2,3 4,5 2 4 W 2,4 Feed-forward network = a parameterized family of nonlinear functions: a 5 = g ( W 3 , 5 · a 3 + W 4 , 5 · a 4 ) = g ( W 3 , 5 · g ( W 1 , 3 · a 1 + W 2 , 3 · a 2 ) + W 4 , 5 · g ( W 1 , 4 · a 1 + W 2 , 4 · a 2 )) Adjusting weights changes the function: do learning this way! Chapter 20, Section 5 8
Single-layer perceptrons Perceptron output 1 0.8 0.6 0.4 0.2 -4 -2 0 2 4 0 x 2 -4 -2 Output Input 0 2 x 1 W j,i 4 Units Units Output units all operate separately—no shared weights Adjusting weights moves the location, orientation, and steepness of cliff Chapter 20, Section 5 9
Expressiveness of perceptrons Consider a perceptron with g = step function (Rosenblatt, 1957, 1960) Can represent AND, OR, NOT, majority, etc., but not XOR Represents a linear separator in input space: Σ j W j x j > 0 or W · x > 0 x 1 x 1 x 1 1 1 1 ? 0 0 0 x 2 x 2 x 2 0 1 0 1 0 1 (a) x 1 and x 2 (b) x 1 or x 2 (c) x 1 xor x 2 Minsky & Papert (1969) pricked the neural network balloon Chapter 20, Section 5 10
Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err 2 ≡ 1 2( y − h W ( x )) 2 , Perform optimization search by gradient descent: ∂E = Err × ∂ Err ∂ y − g ( Σ n � � = Err × j = 0 W j x j ) ∂W j ∂W j ∂W j = − Err × g ′ ( in ) × x j Simple weight update rule: W j ← W j + α × Err × g ′ ( in ) × x j E.g., +ve error ⇒ increase network output ⇒ increase weights on +ve inputs, decrease on -ve inputs Chapter 20, Section 5 11
Perceptron learning contd. Perceptron learning rule converges to a consistent function for any linearly separable data set Proportion correct on test set Proportion correct on test set 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 Perceptron 0.6 Decision tree 0.5 0.5 Perceptron Decision tree 0.4 0.4 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Training set size - MAJORITY on 11 inputs Training set size - RESTAURANT data Perceptron learns majority function easily, DTL is hopeless DTL learns restaurant function easily, perceptron cannot represent it Chapter 20, Section 5 12
Multilayer perceptrons Layers are usually fully connected; numbers of hidden units typically chosen by hand Output units a i W j,i Hidden units a j W k,j Input units a k Chapter 20, Section 5 13
Expressiveness of MLPs All continuous functions w/ 2 layers, all functions w/ 3 layers h W ( x 1 , x 2 ) h W ( x 1 , x 2 ) 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 -4 -2 0 2 4 -4 -2 0 2 4 0 0 x 2 x 2 -4 -4 -2 -2 0 0 2 2 x 1 x 1 4 4 Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump Add bumps of various sizes and locations to fit any surface Proof requires exponentially many hidden units (cf DTL proof) Chapter 20, Section 5 14
Back-propagation learning Output layer: same as for single-layer perceptron, W j,i ← W j,i + α × a j × ∆ i where ∆ i = Err i × g ′ ( in i ) Hidden layer: back-propagate the error from the output layer: ∆ j = g ′ ( in j ) i W j,i ∆ i . � Update rule for weights in hidden layer: W k,j ← W k,j + α × a k × ∆ j . (Most neuroscientists deny that back-propagation occurs in the brain) Chapter 20, Section 5 15
Back-propagation derivation The squared error on a single example is defined as E = 1 i ( y i − a i ) 2 , � 2 where the sum is over the nodes in the output layer. ∂E = − ( y i − a i ) ∂a i = − ( y i − a i ) ∂g ( in i ) ∂W j,i ∂W j,i ∂W j,i = − ( y i − a i ) g ′ ( in i ) ∂ in i ∂ = − ( y i − a i ) g ′ ( in i ) j W j,i a j � ∂W j,i ∂W j,i = − ( y i − a i ) g ′ ( in i ) a j = − a j ∆ i Chapter 20, Section 5 16
Back-propagation derivation contd. ∂E i ( y i − a i ) ∂a i i ( y i − a i ) ∂g ( in i ) = − = − � � ∂W k,j ∂W k,j ∂W k,j i ( y i − a i ) g ′ ( in i ) ∂ in i ∂ = − = − i ∆ i j W j,i a j � � � ∂W k,j ∂W k,j ∂a j ∂g ( in j ) = − i ∆ i W j,i = − i ∆ i W j,i � � ∂W k,j ∂W k,j i ∆ i W j,i g ′ ( in j ) ∂ in j = − � ∂W k,j ∂ i ∆ i W j,i g ′ ( in j ) = − k W k,j a k � � ∂W k,j i ∆ i W j,i g ′ ( in j ) a k = − a k ∆ j = − � Chapter 20, Section 5 17
Back-propagation learning contd. At each epoch, sum gradient updates for all examples and apply Training curve for 100 restaurant examples: finds exact fit 14 Total error on training set 12 10 8 6 4 2 0 0 50 100 150 200 250 300 350 400 Number of epochs Typical problems: slow convergence, local minima Chapter 20, Section 5 18
Back-propagation learning contd. Learning curve for MLP with 4 hidden units: 1 Proportion correct on test set 0.9 0.8 0.7 0.6 Decision tree Multilayer network 0.5 0.4 0 10 20 30 40 50 60 70 80 90 100 Training set size - RESTAURANT data MLPs are quite good for complex pattern recognition tasks, but resulting hypotheses cannot be understood easily Chapter 20, Section 5 19
Handwritten digit recognition 3-nearest-neighbor = 2.4% error 400–300–10 unit MLP = 1.6% error LeNet: 768–192–30–10 unit MLP = 0.9% error Current best (kernel machines, vision algorithms) ≈ 0.6% error Chapter 20, Section 5 20
Summary Most brains have lots of neurons; each neuron ≈ linear–threshold unit (?) Perceptrons (one-layer networks) insufficiently expressive Multi-layer networks are sufficiently expressive; can be trained by gradient descent, i.e., error back-propagation Many applications: speech, driving, handwriting, fraud detection, etc. Engineering, cognitive modelling, and neural system modelling subfields have largely diverged Chapter 20, Section 5 21
Recommend
More recommend