artificial neural networks
play

Artificial Neural Networks Threshold units Gradient descent - PDF document

Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics 1 Connectionist Models Consider humans: Neuron


  1. Artificial Neural Networks • Threshold units • Gradient descent • Multilayer networks • Backpropagation • Hidden layer representations • Example: Face Recognition • Advanced topics 1

  2. Connectionist Models Consider humans: • Neuron switching time ˜ . 001 second • Number of neurons ˜ 10 10 • Connections per neuron ˜ 10 4 − 5 • Scene recognition time ˜ . 1 second • 100 inference steps doesn’t seem like enough → much parallel computation Properties of artificial neural nets (ANN’s): • Many neuron-like threshold switching units • Many weighted interconnections among units • Highly parallel, distributed process • Emphasis on tuning weights automatically 2

  3. When to Consider Neural Networks • Input is high-dimensional discrete or real-valued (e.g. raw sensor input) • Output is discrete or real valued • Output is a vector of values • Possibly noisy data • Form of target function is unknown • Human readability of result is unimportant Examples: • Speech phoneme recognition [Waibel] • Image classification [Kanade, Baluja, Rowley] • Financial prediction 3

  4. ALVINN drives 70 mph on highways 4

  5. Perceptron x 1 w 1 x 0 =1 w 0 x 2 w 2 Σ . �� n . { Σ wi xi n Σ wi xi . 1 if > 0 i =0 o = w n i =0 -1 otherwise x n ⎧ 1 if w 0 + w 1 x 1 + · · · + w n x n > 0 ⎪ ⎪ o ( x 1 , . . . , x n ) = ⎨ − 1 otherwise. ⎪ ⎪ ⎩ Sometimes we’ll use simpler vector notation: ⎧ 1 if � w · � x > 0 ⎪ ⎪ o ( � x ) = ⎨ − 1 otherwise. ⎪ ⎪ ⎩ 5

  6. Decision Surface of a Perceptron x2 x2 + + - - + + x1 x1 - - + - ( a ) ( b ) Represents some useful functions • What weights represent g ( x 1 , x 2 ) = AND ( x 1 , x 2 )? But some functions not representable • e.g., not linearly separable • Therefore, we’ll want networks of these... 6

  7. Perceptron training rule w i ← w i + Δ w i where Δ w i = η ( t − o ) x i Where: • t = c ( � x ) is target value • o is perceptron output • η is small constant (e.g., .1) called learning rate 7

  8. Perceptron training rule Can prove it will converge • If training data is linearly separable • and η sufficiently small 8

  9. Gradient Descent To understand, consider simpler linear unit , where o = w 0 + w 1 x 1 + · · · + w n x n Let’s learn w i ’s that minimize the squared error w ] ≡ 1 d ∈ D ( t d − o d ) 2 E [ � � 2 Where D is set of training examples 9

  10. Gradient Descent 25 20 15 E[w] 10 5 0 2 1 -2 -1 0 0 1 2 -1 3 w1 w0 Gradient ⎣ ∂E , ∂E , · · · ∂E ⎡ ⎤ ∇ E [ � w ] ≡ ⎢ ⎥ ⎦ ∂w 0 ∂w 1 ∂w n Training rule: Δ � w = − η ∇ E [ � w ] i.e., Δ w i = − η ∂E ∂w i 10

  11. Gradient Descent 1 ∂E ∂ d ( t d − o d ) 2 = � 2 ∂w i ∂w i = 1 ∂ ( t d − o d ) 2 � 2 ∂w i d = 1 d 2( t d − o d ) ∂ ( t d − o d ) � 2 ∂w i d ( t d − o d ) ∂ = ( t d − � w · � x d ) � ∂w i ∂E = d ( t d − o d )( − x i,d ) � ∂w i 11

  12. Gradient Descent Gradient-Descent ( training examples, η ) Each training example is a pair of the form � � x, t � , where � x is the vector of input values, and t is the target output value. η is the learning rate (e.g., .05). • Initialize each w i to some small random value • Until the termination condition is met, Do – Initialize each Δ w i to zero. – For each � � x, t � in training examples , Do ∗ Input the instance � x to the unit and compute the output o ∗ For each linear unit weight w i , Do Δ w i ← Δ w i + η ( t − o ) x i – For each linear unit weight w i , Do w i ← w i + Δ w i 12

  13. Summary Perceptron training rule guaranteed to succeed if • Training examples are linearly separable • Sufficiently small learning rate η Linear unit training rule uses gradient descent • Guaranteed to converge to hypothesis with minimum squared error • Given sufficiently small learning rate η • Even when training data contains noise • Even when training data not separable by H 13

  14. Incremental (Stochastic) Gradient Descent Batch mode Gradient Descent: Do until satisfied 1. Compute the gradient ∇ E D [ � w ] 2. � w ← � w − η ∇ E D [ � w ] Incremental mode Gradient Descent: Do until satisfied • For each training example d in D 1. Compute the gradient ∇ E d [ � w ] 2. � w ← � w − η ∇ E d [ � w ] w ] ≡ 1 d ∈ D ( t d − o d ) 2 E D [ � � 2 w ] ≡ 1 2( t d − o d ) 2 E d [ � Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η made small enough 14

  15. Multilayer Networks of Sigmoid Units 15

  16. Learning Complex Concepts with Gradient Descent • Threshold Units – Complex decision surfaces – But, cannot differentiate threshold rule • Linear Units – Differentiable – But, networks can only learn linear functions Need a non-linear, differentiable threshold function 16

  17. Sigmoid Unit x1 w1 x0 = 1 x2 w2 w0 � Σ . � . n net = Σ wi xi � 1 . o = σ (net) = i =0 -net wn � 1 + e xn σ ( x ) is the sigmoid function 1 1 + e − x Nice property: dσ ( x ) = σ ( x )(1 − σ ( x )) dx We can derive gradient decent rules to train • One sigmoid unit • Multilayer networks of sigmoid units → Backpropagation 17

  18. Error Gradient for a Sigmoid Unit 1 ∂E ∂ d ∈ D ( t d − o d ) 2 = � 2 ∂w i ∂w i = 1 ∂ ( t d − o d ) 2 � 2 ∂w i d = 1 ∂ d 2( t d − o d ) ( t d − o d ) � 2 ∂w i ⎝ − ∂o d ⎛ ⎞ = d ( t d − o d ) � ⎜ ⎟ ⎠ ∂w i ∂o d ∂net d = − d ( t d − o d ) � ∂net d ∂w i But we know: = ∂σ ( net d ) ∂o d = o d (1 − o d ) ∂net d ∂net d ∂net d = ∂ ( � w · � x d ) = x i,d ∂w i ∂w i So: ∂E = − d ∈ D ( t d − o d ) o d (1 − o d ) x i,d � ∂w i 18

  19. Backpropagation Algorithm Initialize all weights to small random numbers. Until satisfied, Do • For each training example, Do 1. Input the training example to the network and compute the network outputs 2. For each output unit k δ k ← o k (1 − o k )( t k − o k ) 3. For each hidden unit h δ h ← o h (1 − o h ) k ∈ outputs w h,k δ k � 4. Update each network weight w i,j w i,j ← w i,j + Δ w i,j where Δ w i,j = ηδ j x i,j 19

  20. More on Backpropagation • Gradient descent over entire network weight vector • Easily generalized to arbitrary directed graphs • Will find a local, not necessarily global error minimum – In practice, often works well (can run multiple times) • Often include weight momentum α Δ w i,j ( n ) = ηδ j x i,j + α Δ w i,j ( n − 1) • Minimizes error over training examples – Will it generalize well to subsequent examples? • Training can take thousands of iterations → slow! • Using network after training is very fast 20

  21. Learning Hidden Layer Representations Inputs Outputs A target function: Input Output 10000000 → 10000000 01000000 → 01000000 00100000 → 00100000 00010000 → 00010000 00001000 → 00001000 00000100 → 00000100 00000010 → 00000010 → 00000001 00000001 Can this be learned?? 21

  22. Learning Hidden Layer Representations A network: Inputs Outputs Learned hidden layer representation: Input Hidden Output Values 10000000 → .89 .04 .08 → 10000000 01000000 → .01 .11 .88 → 01000000 00100000 → .01 .97 .27 → 00100000 00010000 → .99 .97 .71 → 00010000 00001000 → .03 .05 .02 → 00001000 00000100 → .22 .99 .99 → 00000100 00000010 → .80 .01 .98 → 00000010 00000001 → .60 .94 .01 → 00000001 22

  23. Training Sum of squared errors for each output unit 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500 23

  24. Training Hidden unit encoding for input 01000000 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 500 1000 1500 2000 2500 24

  25. Training Weights from inputs to one hidden unit 4 3 2 1 0 -1 -2 -3 -4 -5 0 500 1000 1500 2000 2500 25

  26. Convergence of Backpropagation Gradient descent to some local minimum • Perhaps not global minimum... • Add momentum • Stochastic gradient descent • Train multiple nets with different initial weights Nature of convergence • Initialize weights near zero • Therefore, initial networks near-linear • Increasingly non-linear functions possible as training progresses 26

  27. Expressive Capabilities of ANNs Boolean functions: • Every boolean function can be represented by network with single hidden layer • but might require exponential (in number of inputs) hidden units Continuous functions: • Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] • Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]. 27

  28. Overfitting in ANNs Error versus weight updates (example 1) 0.01 Training set error 0.009 Validation set error 0.008 0.007 Error 0.006 0.005 0.004 0.003 0.002 0 5000 10000 15000 20000 Number of weight updates Error versus weight updates (example 2) 0.08 Training set error 0.07 Validation set error 0.06 0.05 Error 0.04 0.03 0.02 0.01 0 0 1000 2000 3000 4000 5000 6000 Number of weight updates 28

  29. Neural Nets for Face Recognition left strt rght up 30x32 inputs Typical input images 90% accurate learning head pose, and recognizing 1-of-20 faces 29

Recommend


More recommend