CS7015 (Deep Learning) : Lecture 5 Gradient Descent (GD), Momentum - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 5 Gradient Descent (GD), Momentum Based GD, Nesterov Accelerated GD, Stochastic GD, AdaGrad, RMSProp, Adam Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Acknowledgements For most of the lecture, I have borrowed ideas from the videos by Ryan Harris on “visualize backpropagation” (available on youtube) Some content is based on the course CS231n a by Andrej Karpathy and others a http://cs231n.stanford.edu/2016/ 2 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Module 5.1: Learning Parameters : Infeasible (Guess Work) 3 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

x σ y = f ( x ) Input for training { x i , y i } N i =1 → N pairs of ( x, y ) 1 1 f ( x ) = Training objective 1+ e − ( w · x + b ) Find w and b such that: N � ( y i − f ( x i )) 2 minimize L ( w, b ) = w,b i =1 What does it mean to train the network? Suppose we train the network with ( x, y ) = (0 . 5 , 0 . 2) and (2 . 5 , 0 . 9) At the end of training we expect to find w ∗ , b ∗ such that: f (0 . 5) → 0 . 2 and f (2 . 5) → 0 . 9 4 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5 In other words...

x σ y = f ( x ) In other words... We hope to find a sigmoid function 1 such that (0 . 5 , 0 . 2) and (2 . 5 , 0 . 9) lie on this sigmoid 1 f ( x ) = 1+ e − ( w · x + b ) 5 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Let us see this in more detail.... 6 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Can we try to find such a w ∗ , b ∗ manually Let us try a random guess.. (say, w = 0 . 5 , b = 0) Clearly not good, but how bad is it ? Let us revisit L ( w, b ) to see how bad it is ... N L ( w, b ) = 1 � ( y i − f ( x i )) 2 2 ∗ i =1 = 1 2 ∗ (( y 1 − f ( x 1 )) 2 + ( y 2 − f ( x 2 )) 2 ) = 1 2 ∗ ((0 . 9 − f (2 . 5)) 2 + (0 . 2 − f (0 . 5)) 2 ) = 0 . 073 We want L ( w, b ) to be as close to 0 as possible 7 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Let us try some other values of w , b w b L ( w, b ) 0.50 0.00 0.0730 -0.10 0.00 0.1481 0.94 -0.94 0.0214 1.42 -1.73 0.0028 1.65 -2.08 0.0003 1.78 -2.27 0.0000 Oops!! this made things even worse... Perhaps it would help to push w and b in the other direction... 8 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Let us look at something better than our “guess work” algorithm.... 9 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Since we have only 2 points and 2 parameters ( w , b ) we can easily plot L ( w, b ) for different values of ( w , b ) and pick the one where L ( w, b ) is minimum But of course this becomes intract- able once you have many more data points and many more parameters !! Further, even here we have plotted the error surface only for a small range of ( w , b ) [from ( − 6 , 6) and not from ( − inf , inf)] 10 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Let us look at the geometric interpretation of our “guess work” algorithm in terms of this error surface 11 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

12 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Module 5.2: Learning Parameters : Gradient Descent 19 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Now let’s see if there is a more efficient and principled way of doing this 20 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Goal Find a better way of traversing the error surface so that we can reach the minimum value quickly without resorting to brute force search! 21 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

vector of parameters, say, randomly initial- ized We moved in the direc- θ = [ w, b ] θ new θ tion of ∆ θ ∆ θ = [∆ w, ∆ b ] η · ∆ θ ∆ θ Let us be a bit conservat- change in the ive: move only by a small values of w , b amount η θ new = θ + η · ∆ θ Question: What is the right ∆ θ to use? The answer comes from Taylor series 22 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

For ease of notation, let ∆ θ = u , then from Taylor series, we have, L ( θ + ηu ) = L ( θ ) + η ∗ u T ∇ L ( θ ) + η 2 2! ∗ u T ∇ 2 L ( θ ) u + η 3 3! ∗ ... + η 4 4! ∗ ... = L ( θ ) + η ∗ u T ∇ L ( θ ) [ η is typically small, so η 2 , η 3 , ... → 0] Note that the move ( ηu ) would be favorable only if, L ( θ + ηu ) − L ( θ ) < 0 [ i.e., if the new loss is less than the previous loss ] This implies, u T ∇ L ( θ ) < 0 23 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Okay, so we have, u T ∇ L ( θ ) < 0 But, what is the range of u T ∇ L ( θ ) ? Let’s see.... Let β be the angle between u T and ∇ L ( θ ), then we know that, u T ∇ L ( θ ) − 1 ≤ cos ( β ) = || u || ∗ ||∇ L ( θ ) || ≤ 1 Multiply throughout by k = || u || ∗ ||∇ L ( θ ) || − k ≤ k ∗ cos ( β ) = u T ∇ L ( θ ) ≤ k Thus, L ( θ + ηu ) − L ( θ ) = u T ∇ L ( θ ) = k ∗ cos ( β ) will be most negative when cos ( β ) = − 1 i.e. , when β is 180 ◦ 24 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Gradient Descent Rule The direction u that we intend to move in should be at 180 ◦ w.r.t. the gradient In other words, move in a direction opposite to the gradient Parameter Update Equations w t +1 = w t − η ∇ w t b t +1 = b t − η ∇ b t where, ∇ w t = ∂ L ( w, b ) , ∇ b t = ∂ L ( w, b ) ∂w ∂b at w = w t , b = b t at w = w t , b = b t So we now have a more principled way of moving in the w - b plane than our “guess work” algorithm 25 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Let’s create an algorithm from this rule ... Algorithm 1: gradient descent() t ← 0; max iterations ← 1000; while t < max iterations do w t +1 ← w t − η ∇ w t ; b t +1 ← b t − η ∇ b t ; end To see this algorithm in practice let us first derive ∇ w and ∇ b for our toy neural network 26 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

x σ y = f ( x ) 1 Let’s assume there is only 1 point to fit 1 f ( x ) = ( x, y ) 1+ e − ( w · x + b ) L ( w, b ) = 1 2 ∗ ( f ( x ) − y ) 2 ∇ w = ∂ L ( w, b ) = ∂ ∂w [1 2 ∗ ( f ( x ) − y ) 2 ] ∂w 27 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

∇ w = ∂ ∂w [1 ∂ 1 � � 2 ∗ ( f ( x ) − y ) 2 ] 1 + e − ( wx + b ) ∂w = 1 2 ∗ [2 ∗ ( f ( x ) − y ) ∗ ∂ − 1 ∂ ∂w ( e − ( wx + b ) )) ∂w ( f ( x ) − y )] = (1 + e − ( wx + b ) ) 2 = ( f ( x ) − y ) ∗ ∂ (1 + e − ( wx + b ) ) 2 ∗ ( e − ( wx + b ) ) ∂ − 1 ∂w ( f ( x )) = ∂w ( − ( wx + b ))) 1 = ( f ( x ) − y ) ∗ ∂ � � e − ( wx + b ) − 1 1 + e − ( wx + b ) ∂w = (1 + e − ( wx + b ) ) ∗ (1 + e − ( wx + b ) ) ∗ ( − x ) = ( f ( x ) − y ) ∗ f ( x ) ∗ (1 − f ( x )) ∗ x e − ( wx + b ) 1 = (1 + e − ( wx + b ) ) ∗ (1 + e − ( wx + b ) ) ∗ ( x ) = f ( x ) ∗ (1 − f ( x )) ∗ x 28 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

x σ y = f ( x ) 1 So if there is only 1 point ( x, y ), we have, 1 f ( x ) = 1+ e − ( w · x + b ) ∇ w = ( f ( x ) − y ) ∗ f ( x ) ∗ (1 − f ( x )) ∗ x For two points, 2 � ∇ w = ( f ( x i ) − y i ) ∗ f ( x i ) ∗ (1 − f ( x i )) ∗ x i i =1 2 � ∇ b = ( f ( x i ) − y i ) ∗ f ( x i ) ∗ (1 − f ( x i )) i =1 29 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

y When the curve is steep the gradient f ( x ) = x 2 + 1 ( ∆ y 1 ∆ x 1 ) is large 6 When the curve is gentle the gradient ( ∆ y 2 5 ∆ x 2 ) is small Recall that our weight updates are 4 proportional to the gradient w = w − ∆ y 1 η ∇ w 3 Hence in the areas where the curve is gentle the updates are small whereas ∆ x 1 2 in the areas where the curve is steep ∆ y 2 the updates are large 1 ∆ x 2 0 x − 1 0 1 2 3 4 31 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Let’s see what happens when we start from a different point 32 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Irrespective of where we start from once we hit a surface which has a gentle slope, the progress slows down 33 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Module 5.3 : Contours 34 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

Visualizing things in 3d can sometimes become a bit cumbersome Can we do a 2d visualization of this traversal along the error surface Yes, let’s take a look at something known as contours 35 / 94 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 5

CS7015 (Deep Learning) : Lecture 5 Gradient Descent (GD), Momentum - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 5 Gradient Descent (GD), Momentum Based GD, Nesterov Accelerated GD, Stochastic GD, AdaGrad, RMSProp, Adam Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 18 Markov Networks Mitesh M. Khapra Department of Computer

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 13 Visualizing Convolutional Neural Networks, Guided

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling,

Amir Ali Kouzeh Geran and Arash Reyhani-Masoleh Presented by: Arash Reyhani-Masoleh Department

Data Link Layer Understand principles behind data link layer services: Error detection,

Precise Neutron Lifetime Measurement Using Pulsed Neutron Beams at J-PARC Motivation 8.4 sec

CTSA S AS CATALYSTS OF TRANSLATION: THE PUBLIC IMAGE Olga Brazhnik, Ph.D. Division of Clinical

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

SeparatingThickness fromGeometricThickness DavidEppstein

LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning Tianyi Chen

PA-GD: On the Convergence of Perturbed Alternating Gradient Descent to Second-Order Stationary