Feedforward Networks Gradient Descent Learning and Backpropagation - PDF document

Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 565 — Winter 2003 Department of Computer Science University of Calgary Canada Learning by Gradient Descent Definition of the Learning Problem Let us start with the simple case of linear cells, which we have introduced as percep - tron units. m L and The linear network should learn mappings (for m = 1, …, P ) between Ë an input pattern x m = H x 1 m , …, x N Ë an associated target pattern T m .

03-Backprop-Printout.nb 2 Figure 1. Perceptron m of cell i for the input pattern x m is calculated as The output O i m = ‚ H w ki ÿx k m L O i (1) k m for input pat - The goal of the learning procedure is, that eventually the output O i m : tern x m corresponds to the desired output T i m = ‚ H w ki ÿx k m L ! T i m = O i (2) k Explicit Solution (Linear Network)* For a linear network, the weights that satisfy Equation (2) can be calculated explic - itly using the pseudo-inverse: P ‚ m H Q k - 1 L ml x k w ik = 1 l Å Å Å Å T i (3) ml P ‚ Q ml = 1 m x k l Å Å Å Å x k (4) k

Feedforward Networks and Gradient Descent Learning 3 ‡ Correlation Matrix Here Q ml is a component of the correlation matrix Q k of the input patterns: i y j z j z j z j z j z 1 x k 1 1 x k 2 1 x k P x k x k x k … j z j z j z Q k = k { . . . . (5) P x k 1 P x k P x k x k … … You can check that this is indeed a solution by verifying ‚ m = T i m . w ik x k (6) k ‡ Caveat Note that Q - 1 only exists for linearly independent input patterns. That means, if there are a i such that for all k = 1, …, N 1 + a 2 x k 2 + … + a P x k P = 0, a 1 x k (7) m cannot be selected independently from each other, and the then the outputs O i problem is NOT solvable.

03-Backprop-Printout.nb 4 Learning by Gradient Descent (Linear Network) ÷÷ ” Let us now try to find a learning rule for a linear network with M output units. Starting from a random initial weight setting w 0 , the learning procedure should find a solution weight matrix for Equation (2). ÷÷ L : ” ‡ Error Function For this purpose, we define a cost or error function E H w ”L = 1 E H w H T m m L 2 2 ‚ ‚ M P m - O m Å Å Å Å m = 1 m= 1 i y j z ”L = 1 j z (8) E H w m - ‚ H w km ÿx k m L j z 2 „ „ M P j T m z 2 k { Å Å Å Å ÷÷ L ¥ 0 will approach zero as w ” ÷÷ = 8 w km < satisfies Equation (2). ” k E H w m = 1 m= 1 This cost function is a quadratic function in weight space.

Feedforward Networks and Gradient Descent Learning 5 ÷÷ L is a paraboloid with a single global minimum. ” ‡ Paraboloid Therefore, E H w << RealTime3D` Plot3D @ x 2 + y 2 , 8 x, - 5, 5 < , 8 y, - 5, 5 <D ;

03-Backprop-Printout.nb 6 ContourPlot @ x 2 + y 2 , 8 x, - 5, 5 < , 8 y, - 5, 5 <D ; 4 2 0 -2 -4 -4 -2 0 2 4 If the pattern vectors are linearly independent—i.e., a solution for Equation (2) exists—the minimum is at E = 0. ÷÷ L in weight space by following the negative ” ‡ Finding the Minimum: Following the Gradient We can find the minimum of E H w ÷ L ” gradient ÷ L = - ∑ E H w ” E H w ” ÷ ” - ∑ w Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å (9) ∑ w We can implement this gradient strategy as follows:

Feedforward Networks and Gradient Descent Learning 7 ÷÷ is changed by D w ki proportionate to the E gradient at the ” ‡ Changing a Weight Each weight w ki œ w ”L current weight position (i.e., the current settings of all the weights): D w ki = -h ∑ E H w Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å (10) ∑ w ki ‡ Steps Towards the Solution i 2 y j z i y j z j j z z j z j z m - ‚ H w nm ÿx n m L j z j 2 „ „ z M P j T m z j z j z j z ∑ 1 k { D w ki = -h Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å k { ∑ w ki n m = 1 m= 1 i 2 y j i y z j z j z j z j z j m - ‚ H w nm ÿx n m L z 2 „ j„ j z j z P M j T m z j z z ∑ D w ki = -h 1 k { k { Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å (11) ∑ w ki n m= 1 m = 1 i y j z j z m - ‚ H w ni ÿx n m L z H -x k m L 2 „ j z P j T i D w ki = -h 1 k { Å Å Å Å 2 n m= 1 ‡ Weight Adaptation Rule D w ki = h ‚ H T i m L x k P m - O i m (12) m= 1 The parameter h is usually referred to as the learning rate . In this formula, the adaptation of the weights are accumulated over all patterns.

03-Backprop-Printout.nb 8 ‡ Delta, LMS Learning If we change the weights after each presentation of an input pattern to the network, we get a simpler form for the weight update term: D w ki = h H T i m L x k m - O i m (13) or m x k m D w ki = h d i (14) with m = T i m - O i m . d i (15) This learning rule has several names: Ë Delta rule Ë Adaline rule Ë Widrow-Hoff rule Ë LMS (least mean square) rule. Gradient Descent Learning with Nonlinear Cells We will now extend the gradient descent technique for the case of nonlinear cells, that is, where the activation/output function is a general nonlinear function g(x). † The input function is denoted by h H x L . † The output function g H h H x LL is assumed to be differentiable in x .

Feedforward Networks and Gradient Descent Learning 9 ‡ Rewriting the Error Function The definition of the error function (Equation (8)) can be simply rewritten as follows: ”L = 1 E H w H T m m L 2 2 ‚ ‚ M P m - O m Å Å Å Å m = 1 m= 1 i i y y j j z z ”L = 1 j j z z (16) E H w j‚ H w km ÿx k m L j j z z 2 „ „ M P j T m z z 2 k m - g k { { Å Å Å Å k m = 1 m= 1 ‡ Weight Gradients Consequently, we can compute the w ki gradients: ”L ∑ E H w = ‚ H T i m - g H h i m LL ÿ g £ H h i m L ÿx k P m Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å (17) ∑ w ki m= 1 ‡ From Weight Gradients to the Learning Rule This eventually (after some more calculations) shows us that the adaptation term D w ki for w ki has the same form as in Equations (10), (13), and (14), namely: m x k m D w ki = h d i (18) where m = H T i m L ÿ g £ H h i m L m - O i d i (19)

03-Backprop-Printout.nb 10 Suitable Activation Functions The calculation of the above d terms is easy for the following functions g , which are commonly used as activation functions: ‡ Hyperbolic Tangens: g H x L = tanh b x g £ H x L = b H 1 - g 2 H x LL (20) Hyperbolic Tangens Plot: Plot @ Tanh @ x D , 8 x, - 5, 5 <D ; 1 0.5 -4 -2 2 4 -0.5 -1

Feedforward Networks and Gradient Descent Learning 11 Plot of the first derivative: Plot @ Tanh' @ x D , 8 x, - 5, 5 <D ; 1 0.8 0.6 0.4 0.2 -4 -2 2 4 Check for equality with 1 - tanh 2 x Plot @ 1 - Tanh @ x D 2 , 8 x, - 5, 5 <D ; 1 0.8 0.6 0.4 0.2 -4 -2 2 4 Influence of the b parameter: p1 @ b _ D : = Plot @ Tanh @ b x D , 8 x, - 5, 5 < , PlotRange Ø All, DisplayFunction Ø Identity D p2 @ b _ D : = Plot @ Tanh' @ b x D , 8 x, - 5, 5 < , PlotRange Ø All, DisplayFunction Ø Identity D

03-Backprop-Printout.nb 12 Table @ Show @ GraphicsArray @8 p1 @ b D , p2 @ b D<DD , 8 b , 1, 5 <D ; 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 -0.5 0.2 -1 -4 -2 2 4 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4

Feedforward Networks and Gradient Descent Learning 13 Table @ Show @ GraphicsArray @8 p1 @ b D , p2 @ b D<DD , 8 b , 0.1, 1, 0.1 <D ; 0.4 -4 -2 2 4 0.95 0.2 0.9 -4 -2 2 4 0.85 -0.2 -0.4 0.8 -4 -2 2 4 0.6 0.9 0.4 0.8 0.2 0.7 -4 -2 2 4 -0.2 0.6 -0.4 0.5 -0.6 -4 -2 2 4 0.75 0.5 0.8 0.25 0.6 -4 -2 2 4 -0.25 0.4 -0.5 -0.75 0.2 1 -4 -2 2 4 0.8 0.5 0.6 -4 -2 2 4 0.4 -0.5 0.2 -1 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4

03-Backprop-Printout.nb 14 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4 1 1 0.8 0.5 0.6 0.4 -4 -2 2 4 0.2 -0.5 -1 -4 -2 2 4

Feedforward Networks and Gradient Descent Learning 15 ‡ Sigmoid: g H x L = 1 Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å 1 + e - 2 b x (21) g £ H x L = 2 b g H x L H 1 - g H x LL Sigmoid Plot: sigmoid @ x_, b _ D : = 1 Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å 1 + E - 2 b x Plot @ sigmoid @ x, 1 D , 8 x, - 5, 5 <D ; 1 0.8 0.6 0.4 0.2 -4 -2 2 4 Plot of the first derivative: D @ sigmoid @ x, b D , x D 2 ‰ - 2 x b b H 1 + ‰ - 2 x b L 2 Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å Å

03-Backprop-Printout.nb 16 Plot @ D @ sigmoid @ x, 1 D , x D êê Evaluate, 8 x, - 5, 5 <D ; 0.5 0.4 0.3 0.2 0.1 -4 -2 2 4 H 1 - g L Check for equality with 2 ÿ g ÿ Plot @ 2 sigmoid @ x, 1 D H 1 - sigmoid @ x, 1 DL , 8 x, - 5, 5 <D ; 0.5 0.4 0.3 0.2 0.1 -4 -2 2 4

Feedforward Networks Gradient Descent Learning and Backpropagation - PDF document

Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 565 Winter 2003 Department of Computer Science University of Calgary Canada Learning by Gradient Descent Definition of the Learning Problem Let us

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward

CS7015 (Deep Learning) : Lecture 3 Sigmoid Neurons, Gradient Descent, Feedforward Neural Networks,

An Introduction to Neural Networks - Feedforward NN Backpropagation Agathe Merceron Beuth

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks

Feedforward Control So far, most of the focus of this course has been on feedback control. In

CHAPTER 15: FEEDFORWARD CONTROL Outline of the lesson. A process challenge - improve

MultiLayer Neural Networks Xiaogang Wang xgwang@ee.cuhk.edu.hk January 15, 2019 cuhk Xiaogang

Parallel Gradient Descent for Multilayer Feedforward Neural Networks Palash Goyal 1 Nitin Kamra 1

Deep learning J er emy Fix CentraleSup elec jeremy.fix@centralesupelec.fr 2016 1 / 94

Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A

Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

From Feedforward-Designed Convolutional Neural Networks (FF-CNNs) to Successive Subspace Learning

Machine Learning Lecture 06: Deep Feedforward Networks Nevin L. Zhang lzhang@cse.ust.hk

Deep Feedforward Networks Thanks to Sargur Srihari, Alexander Ororbia, Christopher Olah Deep

Artificial Neural Networks (Part 2) Gradient Descent Learning and Backpropagation Christian Jacob

Lecture 26: Support Vector Classifjcation, Unsupervised Learning Instructor: Prof. Ganesh

Contents 1. Introduction 2. (Un)decidability on modal MTL logics Reducing to PCP The Global

Empirical Comparisons of Fast Methods Dustin Lang and Mike Klaas { dalang, klaas } @cs.ubc.ca

Supervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and

Community structure in networks Argimiro Arratia & Marta Arias Universitat Polit` ecnica de

Scala & Spark PTT18/19 Prof. Dr. Ralf Lmmel Msc. Johannes Hrtel Msc. Marcel Heinz (C)

CS 225 Data Structures Fe Feb. 18 It Iterators G G Carl Evans It Iter erators Suppose

Feedforward Networks Gradient Descent Learning and Backpropagation - PDF document

Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 565 Winter 2003 Department of Computer Science University of Calgary Canada Learning by Gradient Descent Definition of the Learning Problem Let us

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward

CS7015 (Deep Learning) : Lecture 3 Sigmoid Neurons, Gradient Descent, Feedforward Neural Networks,

An Introduction to Neural Networks - Feedforward NN Backpropagation Agathe Merceron Beuth

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks

Feedforward Control So far, most of the focus of this course has been on feedback control. In

CHAPTER 15: FEEDFORWARD CONTROL Outline of the lesson. A process challenge - improve

MultiLayer Neural Networks Xiaogang Wang xgwang@ee.cuhk.edu.hk January 15, 2019 cuhk Xiaogang

Parallel Gradient Descent for Multilayer Feedforward Neural Networks Palash Goyal 1 Nitin Kamra 1

Deep learning J er emy Fix CentraleSup elec jeremy.fix@centralesupelec.fr 2016 1 / 94

Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A

Deep Feedforward Networks Lecture slides for Chapter 6 of Deep Learning www.deeplearningbook.org

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

From Feedforward-Designed Convolutional Neural Networks (FF-CNNs) to Successive Subspace Learning

Machine Learning Lecture 06: Deep Feedforward Networks Nevin L. Zhang lzhang@cse.ust.hk

Deep Feedforward Networks Thanks to Sargur Srihari, Alexander Ororbia, Christopher Olah Deep

Artificial Neural Networks (Part 2) Gradient Descent Learning and Backpropagation Christian Jacob

Lecture 26: Support Vector Classifjcation, Unsupervised Learning Instructor: Prof. Ganesh

Contents 1. Introduction 2. (Un)decidability on modal MTL logics Reducing to PCP The Global

Empirical Comparisons of Fast Methods Dustin Lang and Mike Klaas { dalang, klaas } @cs.ubc.ca

Supervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and

Community structure in networks Argimiro Arratia &amp; Marta Arias Universitat Polit` ecnica de

Scala &amp; Spark PTT18/19 Prof. Dr. Ralf Lmmel Msc. Johannes Hrtel Msc. Marcel Heinz (C)

CS 225 Data Structures Fe Feb. 18 It Iterators G G Carl Evans It Iter erators Suppose

Community structure in networks Argimiro Arratia & Marta Arias Universitat Polit` ecnica de

Scala & Spark PTT18/19 Prof. Dr. Ralf Lmmel Msc. Johannes Hrtel Msc. Marcel Heinz (C)