feedforward networks
play

Feedforward Networks Gradient Descent Learning and Backpropagation - PDF document

Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 565 Winter 2003 Department of Computer Science University of Calgary Canada Learning by Gradient Descent Definition of the Learning Problem Let us


  1. Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 565 — Winter 2003 Department of Computer Science University of Calgary Canada Learning by Gradient Descent Definition of the Learning Problem Let us start with the simple case of linear cells, which we have introduced as percep - tron units. m L and The linear network should learn mappings (for m = 1, …, P ) between Ë an input pattern x m = H x 1 m , …, x N Ë an associated target pattern T m .

  2. 03-Backprop-Printout.nb 2 Figure 1. Perceptron m of cell i for the input pattern x m is calculated as The output O i m = ‚ H w ki ÿx k m L O i (1) k m for input pat - The goal of the learning procedure is, that eventually the output O i m : tern x m corresponds to the desired output T i m = ‚ H w ki ÿx k m L ! T i m = O i (2) k Explicit Solution (Linear Network)* For a linear network, the weights that satisfy Equation (2) can be calculated explic - itly using the pseudo-inverse: P ‚ m H Q k - 1 L ml x k w ik = 1 l Å Å Å Å T i (3) ml P ‚ Q ml = 1 m x k l Å Å Å Å x k (4) k

  3. Feedforward Networks and Gradient Descent Learning 3 ‡ Correlation Matrix Here Q ml is a component of the correlation matrix Q k of the input patterns: i y j z j z j z j z j z 1 x k 1 1 x k 2 1 x k P x k x k x k … j z j z j z Q k = k { . . . . (5) P x k 1 P x k P x k x k … … You can check that this is indeed a solution by verifying ‚ m = T i m . w ik x k (6) k ‡ Caveat Note that Q - 1 only exists for linearly independent input patterns. That means, if there are a i such that for all k = 1, …, N 1 + a 2 x k 2 + … + a P x k P = 0, a 1 x k (7) m cannot be selected independently from each other, and the then the outputs O i problem is NOT solvable.

  4. 03-Backprop-Printout.nb 4 Learning by Gradient Descent (Linear Network) ÷÷ ” Let us now try to find a learning rule for a linear network with M output units. Starting from a random initial weight setting w 0 , the learning procedure should find a solution weight matrix for Equation (2). ÷÷ L : ” ‡ Error Function For this purpose, we define a cost or error function E H w ”L = 1 E H w H T m m L 2 2 ‚ ‚ M P m - O m Å Å Å Å m = 1 m= 1 i y j z ”L = 1 j z (8) E H w m - ‚ H w km ÿx k m L j z 2 „ „ M P j T m z 2 k { Å Å Å Å ÷÷ L ¥ 0 will approach zero as w ” ÷÷ = 8 w km < satisfies Equation (2). ” k E H w m = 1 m= 1 This cost function is a quadratic function in weight space.

  5. Feedforward Networks and Gradient Descent Learning 5 ÷÷ L is a paraboloid with a single global minimum. ” ‡ Paraboloid Therefore, E H w << RealTime3D` Plot3D @ x 2 + y 2 , 8 x, - 5, 5 < , 8 y, - 5, 5 <D ;

  6. 03-Backprop-Printout.nb 6 ContourPlot @ x 2 + y 2 , 8 x, - 5, 5 < , 8 y, - 5, 5 <D ; 4 2 0 -2 -4 -4 -2 0 2 4 If the pattern vectors are linearly independent—i.e., a solution for Equation (2) exists—the minimum is at E = 0. ÷÷ L in weight space by following the negative ” ‡ Finding the Minimum: Following the Gradient We can find the minimum of E H w ÷ L ” gradient ÷ L = - ∑ E H w ” E H w ” ÷ ” - ∑ w Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å (9) ∑ w We can implement this gradient strategy as follows:

  7. Feedforward Networks and Gradient Descent Learning 7 ÷÷ is changed by D w ki proportionate to the E gradient at the ” ‡ Changing a Weight Each weight w ki œ w ”L current weight position (i.e., the current settings of all the weights): D w ki = -h ∑ E H w Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å (10) ∑ w ki ‡ Steps Towards the Solution i 2 y j z i y j z j j z z j z j z m - ‚ H w nm ÿx n m L j z j 2 „ „ z M P j T m z j z j z j z ∑ 1 k { D w ki = -h Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å k { ∑ w ki n m = 1 m= 1 i 2 y j i y z j z j z j z j z j m - ‚ H w nm ÿx n m L z 2 „ j„ j z j z P M j T m z j z z ∑ D w ki = -h 1 k { k { Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å (11) ∑ w ki n m= 1 m = 1 i y j z j z m - ‚ H w ni ÿx n m L z H -x k m L 2 „ j z P j T i D w ki = -h 1 k { Å Å Å Å 2 n m= 1 ‡ Weight Adaptation Rule D w ki = h ‚ H T i m L x k P m - O i m (12) m= 1 The parameter h is usually referred to as the learning rate . In this formula, the adaptation of the weights are accumulated over all patterns.

  8. 03-Backprop-Printout.nb 8 ‡ Delta, LMS Learning If we change the weights after each presentation of an input pattern to the network, we get a simpler form for the weight update term: D w ki = h H T i m L x k m - O i m (13) or m x k m D w ki = h d i (14) with m = T i m - O i m . d i (15) This learning rule has several names: Ë Delta rule Ë Adaline rule Ë Widrow-Hoff rule Ë LMS (least mean square) rule. Gradient Descent Learning with Nonlinear Cells We will now extend the gradient descent technique for the case of nonlinear cells, that is, where the activation/output function is a general nonlinear function g(x). † The input function is denoted by h H x L . † The output function g H h H x LL is assumed to be differentiable in x .

  9. Feedforward Networks and Gradient Descent Learning 9 ‡ Rewriting the Error Function The definition of the error function (Equation (8)) can be simply rewritten as follows: ”L = 1 E H w H T m m L 2 2 ‚ ‚ M P m - O m Å Å Å Å m = 1 m= 1 i i y y j j z z ”L = 1 j j z z (16) E H w j‚ H w km ÿx k m L j j z z 2 „ „ M P j T m z z 2 k m - g k { { Å Å Å Å k m = 1 m= 1 ‡ Weight Gradients Consequently, we can compute the w ki gradients: ”L ∑ E H w = ‚ H T i m - g H h i m LL ÿ g £ H h i m L ÿx k P m Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å (17) ∑ w ki m= 1 ‡ From Weight Gradients to the Learning Rule This eventually (after some more calculations) shows us that the adaptation term D w ki for w ki has the same form as in Equations (10), (13), and (14), namely: m x k m D w ki = h d i (18) where m = H T i m L ÿ g £ H h i m L m - O i d i (19)

  10. 03-Backprop-Printout.nb 10 Suitable Activation Functions The calculation of the above d terms is easy for the following functions g , which are commonly used as activation functions: ‡ Hyperbolic Tangens: g H x L = tanh b x g £ H x L = b H 1 - g 2 H x LL (20) Hyperbolic Tangens Plot: Plot @ Tanh @ x D , 8 x, - 5, 5 <D ; 1 0.5 -4 -2 2 4 -0.5 -1

  11. Feedforward Networks and Gradient Descent Learning 11 Plot of the first derivative: Plot @ Tanh' @ x D , 8 x, - 5, 5 <D ; 1 0.8 0.6 0.4 0.2 -4 -2 2 4 Check for equality with 1 - tanh 2 x Plot @ 1 - Tanh @ x D 2 , 8 x, - 5, 5 <D ; 1 0.8 0.6 0.4 0.2 -4 -2 2 4 Influence of the b parameter: p1 @ b _ D : = Plot @ Tanh @ b x D , 8 x, - 5, 5 < , PlotRange Ø All, DisplayFunction Ø Identity D p2 @ b _ D : = Plot @ Tanh' @ b x D , 8 x, - 5, 5 < , PlotRange Ø All, DisplayFunction Ø Identity D

  12. 03-Backprop-Printout.nb 12 Table @ Show @ GraphicsArray @8 p1 @ b D , p2 @ b D<DD , 8 b , 1, 5 <D ; 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 -0.5 0.2 -1 -4 -2 2 4 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4

  13. Feedforward Networks and Gradient Descent Learning 13 Table @ Show @ GraphicsArray @8 p1 @ b D , p2 @ b D<DD , 8 b , 0.1, 1, 0.1 <D ; 0.4 -4 -2 2 4 0.95 0.2 0.9 -4 -2 2 4 0.85 -0.2 -0.4 0.8 -4 -2 2 4 0.6 0.9 0.4 0.8 0.2 0.7 -4 -2 2 4 -0.2 0.6 -0.4 0.5 -0.6 -4 -2 2 4 0.75 0.5 0.8 0.25 0.6 -4 -2 2 4 -0.25 0.4 -0.5 -0.75 0.2 1 -4 -2 2 4 0.8 0.5 0.6 -4 -2 2 4 0.4 -0.5 0.2 -1 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4

  14. 03-Backprop-Printout.nb 14 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4

  15. Feedforward Networks and Gradient Descent Learning 15 ‡ Sigmoid: g H x L = 1 Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å 1 + e - 2 b x (21) g £ H x L = 2 b g H x L H 1 - g H x LL Sigmoid Plot: sigmoid @ x_, b _ D : = 1 Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å 1 + E - 2 b x Plot @ sigmoid @ x, 1 D , 8 x, - 5, 5 <D ; 1 0.8 0.6 0.4 0.2 -4 -2 2 4 Plot of the first derivative: D @ sigmoid @ x, b D , x D 2 ‰ - 2 x b b H 1 + ‰ - 2 x b L 2 Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å

  16. 03-Backprop-Printout.nb 16 Plot @ D @ sigmoid @ x, 1 D , x D êê Evaluate, 8 x, - 5, 5 <D ; 0.5 0.4 0.3 0.2 0.1 -4 -2 2 4 H 1 - g L Check for equality with 2 ÿ g ÿ Plot @ 2 sigmoid @ x, 1 D H 1 - sigmoid @ x, 1 DL , 8 x, - 5, 5 <D ; 0.5 0.4 0.3 0.2 0.1 -4 -2 2 4

Recommend


More recommend