Artificial neural networks Chapter 18, Section 7 of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl¨ � Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 1
Outline ♦ Brains ♦ Neural networks ♦ Perceptrons ♦ Multilayer perceptrons ♦ Applications of neural networks of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl¨ � Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 2
Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms–10ms cycle time Signals are noisy “spike trains” of electrical potential Axonal arborization Axon from another cell Synapse Dendrite Axon Nucleus Synapses Cell body or Soma of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl¨ � Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 3
McCulloch–Pitts simplified neuron Output is a “squashed” linear function of the inputs: � Σ j w j,i a j � a i = g ( in i ) = g ( w i · a ) = g Bias Weight a 0 = − 1 a i = g ( in i ) W 0 ,i g in i W j,i Σ a j a i Input� Input� Activation� Output� Output Links Function Function Links Note that a 0 = − 1 is a constant input, and w 0 ,i is the bias weight This is a gross oversimplification of real neurons, but its purpose is to develop understanding of what networks of simple units can do of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl¨ � Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 4
Activation functions g ( in i ) g ( in i ) + 1 + 1 in i in i (a)� (b)� (a) is a step function or threshold function, g ( x ) = 1 if x ≥ 0 , else 0 (b) is a sigmoid function, g ( x ) = 1 / (1 + e − x ) Changing the bias weight w 0 ,i moves the threshold location of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl¨ � Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 5
Network structures Feed-forward networks: – single-layer perceptrons – multi-layer networks Feed-forward networks implement functions, and have no internal state Recurrent networks have directed cycles with delays ⇒ they have internal state (like flip-flops), can oscillate etc. of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl¨ � Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 6
Feed-forward example W 1,3 1 3 W 3,5 W 1,4 5 W W 2,3 4,5 2 4 W 2,4 Feed-forward network = a parameterized family of nonlinear functions: a 5 = g ( w 3 , 5 · a 3 + w 4 , 5 · a 4 ) = g ( w 3 , 5 · g ( w 1 , 3 · a 1 + w 2 , 3 · a 2 ) + w 4 , 5 · g ( w 1 , 4 · a 1 + w 2 , 4 · a 2 )) Adjusting the weights changes the function: ⇒ this is how we do learning! of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl¨ � Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 7
Single-layer perceptrons Perceptron output 1 0.8 0.6 0.4 0.2 -4 -2 0 2 4 0 -4 x 2 -2 Output Input 0 2 x 1 W j,i 4 Units Units Output units all operate separately: – there are no shared weights – each output unit corresponds to a separate function Adjusting weights moves the location, orientation, and steepness of the cliff of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl¨ � Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 8
Expressiveness of perceptrons Consider a perceptron with g = the step function Can represent AND, OR, NOT, majority, etc., but not XOR Represents a linear separator in input space: Σ j w j x j > 0 or w · x > 0 or h w ( x ) x 1 x 1 x 1 �� �� 1 1 1 ? �� �� ��� ��� 0 0 0 x 2 x 2 x 2 0 1 0 1 0 1 (a) x 1 and x 2 (b) x 1 or x 2 (c) x 1 xor x 2 �� �� ��� ��� of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl¨ � Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 9
Perceptron learning Learn by adjusting weights to reduce the error on the training set The perceptron learning rule: w j ← w j + α ( y − h ) x j where h = h w ( x ) ∈ { 0 , 1 } is the calculated hypothesis, y ∈ { 0 , 1 } is the desired value, and 0 < α < 1 is the learning rate. Or, in other words: • if y = 1 , h = 0 , add αx j to w j • if y = 0 , h = 1 , subtract αx j from w j • otherwise y = h , do nothing of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl¨ � Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 10
Perceptrons = linear classifiers Perceptron learning rule converges to a consistent function for any linearly separable data set But what if the data set is not linearly separable? of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl¨ � Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 11
Data that are not linearly separable Perceptron learning rule converges to a consistent function for any linearly separable data set But what can we do if the data set is not linearly separable? • Stop after a fixed number of iterations • Stop when the total error does not change between iterations • Let α decrease between iterations of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl¨ � Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 12
Perceptrons vs decision trees Perceptron learns the majority function easily, DTL is hopeless DTL learns the restaurant function easily, perceptron cannot represent it Proportion correct on test set Proportion correct on test set 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 Perceptron 0.6 Decision tree 0.5 0.5 Perceptron Decision tree 0.4 0.4 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Training set size - MAJORITY on 11 inputs Training set size - RESTAURANT data of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl¨ � Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 13
Multilayer perceptrons Layers are usually fully connected; the number of hidden units are typically chosen by hand Output units a i W j,i Hidden units a j W k,j Input units a k of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl¨ � Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 14
Expressiveness of MLPs What functions can be described by MLPs? – with 2 hidden layers: all continuous functions – with 3 hidden layers: all functions h W ( x 1 , x 2 ) h W ( x 1 , x 2 ) 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 -4 -2 0 2 4 -4 -2 0 2 4 0 0 -4 x 2 -4 x 2 -2 -2 0 0 2 2 x 1 x 1 4 4 Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump Add bumps of various sizes and locations to fit any surface The proof requires exponentially many hidden units of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl¨ � Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 15
Example: Handwritten digit recognition MLPs are quite good for complex pattern recognition tasks, (but the resulting hypotheses cannot be understood easily) 3-nearest-neighbor classifier = 2.4% error MLP (400 inputs, 300 hidden, 10 output) = 1.6% error LeNet, an MLP specialized for image analysis = 0.9% error SVM, without any domain knowledge = 1.1% error of; based on AIMA Slides c Artificial Intelligence, spring 2013, Peter Ljungl¨ � Stuart Russel and Peter Norvig, 2004 Chapter 18, Section 7 16
Recommend
More recommend