Supervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and Technology (NTNU) Trondheim, Norway keithd@idi.ntnu.no March 7, 2011 Keith L. Downing Supervised Learning in Neural Networks
Supervised Learning Constant feedback from an instructor, indicating not only right/wrong, but also the correct answer for each training case. Many cases (i.e., input-output pairs) to be learned. Weights are modified by a complex procedure (back-propagation) based on output error. Feed-forward networks with back-propagation learning are the standard implementation. 99% of neural network applications use this. Typical usage: problems with a) lots of input-output training data, and b) goal of a mapping (function) from inputs to outputs. Not biologically plausible, although the cerebellum appears to exhibit some aspects. But, the result of backprop, a trained ANN to perform some function, can be very useful to neuroscientists as a sufficiency proof . Keith L. Downing Supervised Learning in Neural Networks
Backpropagation Overview Training/Test Cases: {(d1, r1) (d2, r2) (d3, r3)....} d3 r3 Encoder Decoder r* E = r3 - r* dE/dW Feed-Forward Phase - Inputs sent through the ANN to compute outputs. Feedback Phase - Error passed back from output to input layers and used to update weights along the way. Keith L. Downing Supervised Learning in Neural Networks
Training -vs- Testing Cases N times, with learning Training Neural Net Test 1 time, without learning Generalization - correctly handling test cases (that ANN has not been trained on). Over-Training - weights become so fine-tuned to the training cases that generalization suffers: failure on many test cases. Keith L. Downing Supervised Learning in Neural Networks
Widrow-Hoff (a.k.a. Delta) Rule 2 3 T = target output value X δ = error w 1 Node N 1 Y 0 0 S δ = T - Y Y Δ w = ηδ X Delta ( δ ) = error; Eta ( η ) = learning rate Goal: change w so as to reduce | δ | . Intuitive: If δ > 0, then we want to decrease it, so we must increase Y. Thus, we must increase the sum of weighted inputs to N, and we do that by increasing (decreasing) w if X is positive (negative). Similar for δ < 0 Assumes derivative of N’s transfer function is everywhere non-negative. Keith L. Downing Supervised Learning in Neural Networks
Gradient Descent Goal = minimize total error across all output nodes Method = modify weights throughout the network (i.e., at all levels) to follow the route of steepest descent in error space. min Δ E Error(E) Weight Vector (W) ∆ w ij = − η ∂ E i ∂ w ij Keith L. Downing Supervised Learning in Neural Networks
Computing ∂ E i ∂ w ij tid 1 fT wi1 x1d sumid oid i win xnd Eid n Sum of Squared Errors (SSE) E i = 1 ( t id − o id ) 2 2 ∑ d ∈ D ∂ E = 1 2 ( t id − o id ) ∂ ( t id − o id ) ( t id − o id ) ∂ ( − o id ) = ∑ 2 ∑ ∂ w ij ∂ w ij ∂ w ij d ∈ D d ∈ D Keith L. Downing Supervised Learning in Neural Networks
Computing ∂ ( − o id ) ∂ w ij tid 1 fT wi1 x1d sumid i oid win xnd Eid n Since output = f(sum weighted inputs) ∂ E ( t id − o id ) ∂ ( − f T ( sum id )) = ∑ ∂ w ij ∂ w ij d ∈ D where n ∑ sum id = w ik x kd k = 1 Using Chain Rule: ∂ f ( g ( x )) ∂ g ( x ) × ∂ g ( x ) ∂ f = ∂ x ∂ x ∂ ( f T ( sum id )) = ∂ f T ( sum id ) × ∂ sum id = ∂ f T ( sum id ) × x jd ∂ w ij ∂ sum id ∂ w ij ∂ sum id Keith L. Downing Supervised Learning in Neural Networks
Computing ∂ sum id - Easy!! ∂ w ij ∑ n � � = ∂ � � = ∂ w i 1 x 1 d + w i 2 x 2 d + ... + w ij x jd + ... + w in x nd ∂ sum id k = 1 w ik x kd ∂ w ij ∂ w ij ∂ w ij + ... + ∂ ( w ij x jd ) = ∂ ( w i 1 x 1 d ) + ∂ ( w i 2 x 2 d ) + ... + ∂ ( w in x nd ) ∂ w ij ∂ w ij ∂ w ij ∂ w ij = 0 + 0 + ... + x jd + ... + 0 = x jd Keith L. Downing Supervised Learning in Neural Networks
Computing ∂ f T ( sum id ) - Harder for some f T ∂ sum id f T = Identity function: f T ( sum id ) = sum id ∂ f T ( sum id ) = 1 ∂ sum id Thus: ∂ ( f T ( sum id )) = ∂ f T ( sum id ) × ∂ sum id = 1 × x jd = x jd ∂ w ij ∂ sum id ∂ w ij 1 f T = Sigmoid: f T ( sum id ) = 1 + e − sumid ∂ f T ( sum id ) = o id ( 1 − o id ) ∂ sum id Thus: ∂ ( f T ( sum id )) = ∂ f T ( sum id ) × ∂ sum id = o id ( 1 − o id ) × x jd = o id ( 1 − o id ) x jd ∂ w ij ∂ sum id ∂ w ij Keith L. Downing Supervised Learning in Neural Networks
The only non-trivial calculation ( 1 + e − sum id ) − 1 � = ( − 1 ) ∂ ( 1 + e − sum id ) = ∂ � ∂ f T ( sum id ) ( 1 + e − sum id ) − 2 ∂ sum id ∂ sum id ∂ sum id e − sum id = ( − 1 )( − 1 ) e − sum id ( 1 + e − sum id ) − 2 = ( 1 + e − sum id ) 2 But notice that: e − sum id ( 1 + e − sum id ) 2 = f T ( sum id )( 1 − f T ( sum id )) = o id ( 1 − o id ) Keith L. Downing Supervised Learning in Neural Networks
Putting it all together ∂ E i ( t id − o id ) ∂ ( − f T ( sum id )) � ( t id − o id ) ∂ f T ( sum id ) × ∂ sum id � = ∑ = − ∑ ∂ w ij ∂ w ij ∂ sum id ∂ w ij d ∈ D d ∈ D So for f T = Identity: ∂ E i = − ∑ ( t id − o id ) x jd ∂ w ij d ∈ D and for f T = Sigmoid: ∂ E i = − ∑ ( t id − o id ) o id ( 1 − o id ) x jd ∂ w ij d ∈ D Keith L. Downing Supervised Learning in Neural Networks
Weight Updates ( f T = Sigmoid) Batch: update weights after each training epoch ∆ w ij = − η ∂ E i = η ∑ ( t id − o id ) o id ( 1 − o id ) x jd ∂ w ij d ∈ D The weight changes are actually computed after each training case, but w ij is not updated until the epoch’s end. Incremental: update weights after each training case ∆ w ij = − η ∂ E i = η ( t id − o id ) o id ( 1 − o id ) x jd ∂ w ij A lower learning rate ( η ) recommended here than for batch method. Can be dependent upon case-presentation order. So randomly sort the cases after each epoch. Keith L. Downing Supervised Learning in Neural Networks
Backpropagation in Multi-Layered Neural Networks d(Ed) d(sum1d) 1 d(sum1d) d(ojd) d(ojd) w1j d(sumjd) Ed ojd j sumjd wnj d(sumnd) d(ojd) n d(Ed) d(sumnd) For each node (j) and each training case (d), backpropagation computes an error term: ∂ E d δ jd = − ∂ sum jd by calculating the influence of sum jd along each connection from node j to the next downstream layer. Keith L. Downing Supervised Learning in Neural Networks
Computing δ jd d(Ed) d(sum1d) 1 d(sum1d) d(ojd) d(ojd) w1j d(sumjd) Ed ojd j sumjd wnj d(sumnd) d(ojd) n d(Ed) d(sumnd) ∂ E d Along the upper path, the contribution to ∂ sum jd is: ∂ o jd × ∂ sum 1 d ∂ E d × ∂ sum jd ∂ o jd ∂ sum 1 d So summing along all paths: n ∂ o jd ∂ E d ∂ sum kd ∂ E d ∑ = ∂ sum jd ∂ sum jd ∂ o jd ∂ sum kd k = 1 Keith L. Downing Supervised Learning in Neural Networks
Computing δ jd Just as before, most terms are 0 in the derivative of the sum, so: ∂ sum kd = w kj ∂ o jd Assuming f T = a sigmoid: ∂ o jd = ∂ f T ( sum jd ) = o jd ( 1 − o jd ) ∂ sum jd ∂ sum jd Thus: n = − ∂ o jd ∂ E d ∂ sum kd ∂ E d ∑ δ jd = − ∂ sum jd ∂ sum jd ∂ o jd ∂ sum kd k = 1 n n ∑ ∑ = − o jd ( 1 − o jd ) w kj ( − δ kd ) = o jd ( 1 − o jd ) w kj δ kd k = 1 k = 1 Keith L. Downing Supervised Learning in Neural Networks
Computing δ jd Note that δ jd is defined recursively in terms of the δ values in the next downstream layer: n ∑ δ jd = o jd ( 1 − o jd ) w kj δ kd k = 1 So all δ values in the network can be computed by moving backwards, one layer at a time. Keith L. Downing Supervised Learning in Neural Networks
Computing ∂ E d ∂ w ij from δ jd - Easy!! 1 d(Ed) wi1 d(sumid) ojd sumid wij i j The only effect of w ij upon the error is via its effect upon sum id , which is: ∂ sum id = o jd ∂ w ij So: ∂ E d = ∂ sum id ∂ E d = ∂ sum id × × ( − δ id ) = − o jd δ id ∂ w ij ∂ w ij ∂ sum id ∂ w ij Keith L. Downing Supervised Learning in Neural Networks
Computing ∆ w ij Given an error term, δ id (for node i on training case d), the update of w ij for all nodes j that feed into i is: ∆ w ij = − η ∂ E d = − η ( − o jd δ id ) = ηδ id o jd ∂ w ij So given δ i , you can easily calculate ∆ w ij for all incoming arcs. Keith L. Downing Supervised Learning in Neural Networks
Learning XOR Sum-Squared-Error Sum-Squared-Error Sum-Squared-Error 1.2 1.2 1.2 1.0 1.0 1.0 0.8 0.8 0.8 Error Error Error 0.6 Error 0.6 Error 0.6 Error 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 Epoch Epoch Epoch Epoch = All 4 entries of the XOR truth table. 2 (inputs) X 2 (hidden) X 1 (output) network Random init of all weights in [-1 1]. Not linearly separable, so it takes awhile! Each run is different due to random weight init. Keith L. Downing Supervised Learning in Neural Networks
Learning to Classify Wines Class Properties 1 14.23 1.71 2.43 15.6 127 2.8 ··· ··· 1 13.2 1.78 2.14 11.2 100 2.65 ··· ··· 2 13.11 1.01 1.7 15 78 2.98 ··· ··· 3 13.17 2.59 2.37 20 120 1.65 ··· ··· . . . Wine 1 Properties 2 Wine 1 Class 1 5 Hidden Layer 13 Keith L. Downing Supervised Learning in Neural Networks
Recommend
More recommend