supervised learning in neural networks
play

Supervised Learning in Neural Networks Keith L. Downing The - PowerPoint PPT Presentation

Supervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no March 7, 2011 Keith L. Downing Supervised Learning in Neural Networks Supervised Learning


  1. Supervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no March 7, 2011 Keith L. Downing Supervised Learning in Neural Networks

  2. Supervised Learning Constant feedback from an instructor, indicating not only right/wrong, but also the correct answer for each training case. Many cases (i.e., input-output pairs) to be learned. Weights are modified by a complex procedure (back-propagation) based on output error. Feed-forward networks with back-propagation learning are the standard implementation. 99% of neural network applications use this. Typical usage: problems with a) lots of input-output training data, and b) goal of a mapping (function) from inputs to outputs. Not biologically plausible, although the cerebellum appears to exhibit some aspects. But, the result of backprop, a trained ANN to perform some function, can be very useful to neuroscientists as a sufficiency proof . Keith L. Downing Supervised Learning in Neural Networks

  3. Backpropagation Overview Training/Test Cases: {(d1, r1) (d2, r2) (d3, r3)....} d3 r3 Encoder Decoder r* E = r3 - r* dE/dW Feed-Forward Phase - Inputs sent through the ANN to compute outputs. Feedback Phase - Error passed back from output to input layers and used to update weights along the way. Keith L. Downing Supervised Learning in Neural Networks

  4. Training -vs- Testing Cases N times, with learning Training Neural Net Test 1 time, without learning Generalization - correctly handling test cases (that ANN has not been trained on). Over-Training - weights become so fine-tuned to the training cases that generalization suffers: failure on many test cases. Keith L. Downing Supervised Learning in Neural Networks

  5. Widrow-Hoff (a.k.a. Delta) Rule 2 3 T = target output value X δ = error w 1 Node N 1 Y 0 0 S δ = T - Y Y Δ w = ηδ X Delta ( δ ) = error; Eta ( η ) = learning rate Goal: change w so as to reduce | δ | . Intuitive: If δ > 0, then we want to decrease it, so we must increase Y. Thus, we must increase the sum of weighted inputs to N, and we do that by increasing (decreasing) w if X is positive (negative). Similar for δ < 0 Assumes derivative of N’s transfer function is everywhere non-negative. Keith L. Downing Supervised Learning in Neural Networks

  6. Gradient Descent Goal = minimize total error across all output nodes Method = modify weights throughout the network (i.e., at all levels) to follow the route of steepest descent in error space. min Δ E Error(E) Weight Vector (W) ∆ w ij = − η ∂ E i ∂ w ij Keith L. Downing Supervised Learning in Neural Networks

  7. Computing ∂ E i ∂ w ij tid 1 fT wi1 x1d sumid oid i win xnd Eid n Sum of Squared Errors (SSE) E i = 1 ( t id − o id ) 2 2 ∑ d ∈ D ∂ E = 1 2 ( t id − o id ) ∂ ( t id − o id ) ( t id − o id ) ∂ ( − o id ) = ∑ 2 ∑ ∂ w ij ∂ w ij ∂ w ij d ∈ D d ∈ D Keith L. Downing Supervised Learning in Neural Networks

  8. Computing ∂ ( − o id ) ∂ w ij tid 1 fT wi1 x1d sumid i oid win xnd Eid n Since output = f(sum weighted inputs) ∂ E ( t id − o id ) ∂ ( − f T ( sum id )) = ∑ ∂ w ij ∂ w ij d ∈ D where n ∑ sum id = w ik x kd k = 1 Using Chain Rule: ∂ f ( g ( x )) ∂ g ( x ) × ∂ g ( x ) ∂ f = ∂ x ∂ x ∂ ( f T ( sum id )) = ∂ f T ( sum id ) × ∂ sum id = ∂ f T ( sum id ) × x jd ∂ w ij ∂ sum id ∂ w ij ∂ sum id Keith L. Downing Supervised Learning in Neural Networks

  9. Computing ∂ sum id - Easy!! ∂ w ij ∑ n � � = ∂ � � = ∂ w i 1 x 1 d + w i 2 x 2 d + ... + w ij x jd + ... + w in x nd ∂ sum id k = 1 w ik x kd ∂ w ij ∂ w ij ∂ w ij + ... + ∂ ( w ij x jd ) = ∂ ( w i 1 x 1 d ) + ∂ ( w i 2 x 2 d ) + ... + ∂ ( w in x nd ) ∂ w ij ∂ w ij ∂ w ij ∂ w ij = 0 + 0 + ... + x jd + ... + 0 = x jd Keith L. Downing Supervised Learning in Neural Networks

  10. Computing ∂ f T ( sum id ) - Harder for some f T ∂ sum id f T = Identity function: f T ( sum id ) = sum id ∂ f T ( sum id ) = 1 ∂ sum id Thus: ∂ ( f T ( sum id )) = ∂ f T ( sum id ) × ∂ sum id = 1 × x jd = x jd ∂ w ij ∂ sum id ∂ w ij 1 f T = Sigmoid: f T ( sum id ) = 1 + e − sumid ∂ f T ( sum id ) = o id ( 1 − o id ) ∂ sum id Thus: ∂ ( f T ( sum id )) = ∂ f T ( sum id ) × ∂ sum id = o id ( 1 − o id ) × x jd = o id ( 1 − o id ) x jd ∂ w ij ∂ sum id ∂ w ij Keith L. Downing Supervised Learning in Neural Networks

  11. The only non-trivial calculation ( 1 + e − sum id ) − 1 � = ( − 1 ) ∂ ( 1 + e − sum id ) = ∂ � ∂ f T ( sum id ) ( 1 + e − sum id ) − 2 ∂ sum id ∂ sum id ∂ sum id e − sum id = ( − 1 )( − 1 ) e − sum id ( 1 + e − sum id ) − 2 = ( 1 + e − sum id ) 2 But notice that: e − sum id ( 1 + e − sum id ) 2 = f T ( sum id )( 1 − f T ( sum id )) = o id ( 1 − o id ) Keith L. Downing Supervised Learning in Neural Networks

  12. Putting it all together ∂ E i ( t id − o id ) ∂ ( − f T ( sum id )) � ( t id − o id ) ∂ f T ( sum id ) × ∂ sum id � = ∑ = − ∑ ∂ w ij ∂ w ij ∂ sum id ∂ w ij d ∈ D d ∈ D So for f T = Identity: ∂ E i = − ∑ ( t id − o id ) x jd ∂ w ij d ∈ D and for f T = Sigmoid: ∂ E i = − ∑ ( t id − o id ) o id ( 1 − o id ) x jd ∂ w ij d ∈ D Keith L. Downing Supervised Learning in Neural Networks

  13. Weight Updates ( f T = Sigmoid) Batch: update weights after each training epoch ∆ w ij = − η ∂ E i = η ∑ ( t id − o id ) o id ( 1 − o id ) x jd ∂ w ij d ∈ D The weight changes are actually computed after each training case, but w ij is not updated until the epoch’s end. Incremental: update weights after each training case ∆ w ij = − η ∂ E i = η ( t id − o id ) o id ( 1 − o id ) x jd ∂ w ij A lower learning rate ( η ) recommended here than for batch method. Can be dependent upon case-presentation order. So randomly sort the cases after each epoch. Keith L. Downing Supervised Learning in Neural Networks

  14. Backpropagation in Multi-Layered Neural Networks d(Ed) d(sum1d) 1 d(sum1d) d(ojd) d(ojd) w1j d(sumjd) Ed ojd j sumjd wnj d(sumnd) d(ojd) n d(Ed) d(sumnd) For each node (j) and each training case (d), backpropagation computes an error term: ∂ E d δ jd = − ∂ sum jd by calculating the influence of sum jd along each connection from node j to the next downstream layer. Keith L. Downing Supervised Learning in Neural Networks

  15. Computing δ jd d(Ed) d(sum1d) 1 d(sum1d) d(ojd) d(ojd) w1j d(sumjd) Ed ojd j sumjd wnj d(sumnd) d(ojd) n d(Ed) d(sumnd) ∂ E d Along the upper path, the contribution to ∂ sum jd is: ∂ o jd × ∂ sum 1 d ∂ E d × ∂ sum jd ∂ o jd ∂ sum 1 d So summing along all paths: n ∂ o jd ∂ E d ∂ sum kd ∂ E d ∑ = ∂ sum jd ∂ sum jd ∂ o jd ∂ sum kd k = 1 Keith L. Downing Supervised Learning in Neural Networks

  16. Computing δ jd Just as before, most terms are 0 in the derivative of the sum, so: ∂ sum kd = w kj ∂ o jd Assuming f T = a sigmoid: ∂ o jd = ∂ f T ( sum jd ) = o jd ( 1 − o jd ) ∂ sum jd ∂ sum jd Thus: n = − ∂ o jd ∂ E d ∂ sum kd ∂ E d ∑ δ jd = − ∂ sum jd ∂ sum jd ∂ o jd ∂ sum kd k = 1 n n ∑ ∑ = − o jd ( 1 − o jd ) w kj ( − δ kd ) = o jd ( 1 − o jd ) w kj δ kd k = 1 k = 1 Keith L. Downing Supervised Learning in Neural Networks

  17. Computing δ jd Note that δ jd is defined recursively in terms of the δ values in the next downstream layer: n ∑ δ jd = o jd ( 1 − o jd ) w kj δ kd k = 1 So all δ values in the network can be computed by moving backwards, one layer at a time. Keith L. Downing Supervised Learning in Neural Networks

  18. Computing ∂ E d ∂ w ij from δ jd - Easy!! 1 d(Ed) wi1 d(sumid) ojd sumid wij i j The only effect of w ij upon the error is via its effect upon sum id , which is: ∂ sum id = o jd ∂ w ij So: ∂ E d = ∂ sum id ∂ E d = ∂ sum id × × ( − δ id ) = − o jd δ id ∂ w ij ∂ w ij ∂ sum id ∂ w ij Keith L. Downing Supervised Learning in Neural Networks

  19. Computing ∆ w ij Given an error term, δ id (for node i on training case d), the update of w ij for all nodes j that feed into i is: ∆ w ij = − η ∂ E d = − η ( − o jd δ id ) = ηδ id o jd ∂ w ij So given δ i , you can easily calculate ∆ w ij for all incoming arcs. Keith L. Downing Supervised Learning in Neural Networks

  20. Learning XOR Sum-Squared-Error Sum-Squared-Error Sum-Squared-Error 1.2 1.2 1.2 1.0 1.0 1.0 0.8 0.8 0.8 Error Error Error 0.6 Error 0.6 Error 0.6 Error 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 Epoch Epoch Epoch Epoch = All 4 entries of the XOR truth table. 2 (inputs) X 2 (hidden) X 1 (output) network Random init of all weights in [-1 1]. Not linearly separable, so it takes awhile! Each run is different due to random weight init. Keith L. Downing Supervised Learning in Neural Networks

  21. Learning to Classify Wines Class Properties 1 14.23 1.71 2.43 15.6 127 2.8 ··· ··· 1 13.2 1.78 2.14 11.2 100 2.65 ··· ··· 2 13.11 1.01 1.7 15 78 2.98 ··· ··· 3 13.17 2.59 2.37 20 120 1.65 ··· ··· . . . Wine 1 Properties 2 Wine 1 Class 1 5 Hidden Layer 13 Keith L. Downing Supervised Learning in Neural Networks

Recommend


More recommend