Nestorov’s Accelerated Gradient • Change the order of operations • At any iteration, to compute the current step: – First extend the previous step – Then compute the gradient step at the resultant position – Add the two to obtain the final step
Nestorov’s Accelerated Gradient • Nestorov’s method ∆𝑋 (𝑙) = 𝛾∆𝑋 (𝑙−1) − 𝜃𝛼 𝑋 𝐹𝑠𝑠 𝑋 (𝑙) + 𝛾∆𝑋 (𝑙−1) 𝑋 (𝑙) = 𝑋 (𝑙−1) + ∆𝑋 (𝑙)
Nestorov’s Accelerated Gradient • Comparison with momentum (example from Hinton) • Converges much faster
Moving on: Topics for the day • Incremental updates • Revisiting “trend” algorithms • Generalization • Tricks of the trade – Divergences.. – Activations – Normalizations
The training formulation output (y) Input (X) • Given input output pairs at a number of locations, estimate the entire function
Gradient descent • Start with an initial function • Adjust its value at all points to make the outputs closer to the required value – Gradient descent adjusts parameters to adjust the function value at all points – Repeat this iteratively until we get arbitrarily close to the target function at the training points
Gradient descent • Start with an initial function • Adjust its value at all points to make the outputs closer to the required value – Gradient descent adjusts parameters to adjust the function value at all points – Repeat this iteratively until we get arbitrarily close to the target function at the training points
Gradient descent • Start with an initial function • Adjust its value at all points to make the outputs closer to the required value – Gradient descent adjusts parameters to adjust the function value at all points – Repeat this iteratively until we get arbitrarily close to the target function at the training points
Gradient descent • Start with an initial function • Adjust its value at all points to make the outputs closer to the required value – Gradient descent adjusts parameters to adjust the function value at all points – Repeat this iteratively until we get arbitrarily close to the target function at the training points
Effect of number of samples • Problem with conventional gradient descent: we try to simultaneously adjust the function at all training points – We must process all training points before making a single adjustment – “Batch” update
Alternative: Incremental update • Alternative: adjust the function at one training point at a time – Keep adjustments small – Eventually, when we have processed all the training points, we will have adjusted the entire function • With greater overall adjustment than we would if we made a single “Batch” update
Alternative: Incremental update • Alternative: adjust the function at one training point at a time – Keep adjustments small – Eventually, when we have processed all the training points, we will have adjusted the entire function • With greater overall adjustment than we would if we made a single “Batch” update
Alternative: Incremental update • Alternative: adjust the function at one training point at a time – Keep adjustments small – Eventually, when we have processed all the training points, we will have adjusted the entire function • With greater overall adjustment than we would if we made a single “Batch” update
Alternative: Incremental update • Alternative: adjust the function at one training point at a time – Keep adjustments small – Eventually, when we have processed all the training points, we will have adjusted the entire function • With greater overall adjustment than we would if we made a single “Batch” update
Alternative: Incremental update • Alternative: adjust the function at one training point at a time – Keep adjustments small – Eventually, when we have processed all the training points, we will have adjusted the entire function • With greater overall adjustment than we would if we made a single “Batch” update
Incremental Update: Stochastic Gradient Descent • Given 𝑌 1 , 𝑒 1 , 𝑌 2 , 𝑒 2 ,…, 𝑌 𝑈 , 𝑒 𝑈 • Initialize all weights 𝑋 1 , 𝑋 2 , … , 𝑋 𝐿 • Do: – For all 𝑢 = 1: 𝑈 • For every layer 𝑙 : – Compute 𝛼 𝑋 𝑙 𝑬𝒋𝒘(𝒁 𝒖 , 𝒆 𝒖 ) – Update 𝑋 𝑙 = 𝑋 𝑙 − 𝜃𝛼 𝑋 𝑙 𝑬𝒋𝒘(𝒁 𝒖 , 𝒆 𝒖 ) • Until 𝐹𝑠𝑠 has converged 43
Caveats: order of presentation • If we loop through the samples in the same order, we may get cyclic behavior
Caveats: order of presentation • If we loop through the samples in the same order, we may get cyclic behavior • We must go through them randomly
Caveats: order of presentation • If we loop through the samples in the same order, we may get cyclic behavior
Caveats: order of presentation • If we loop through the samples in the same order, we may get cyclic behavior
Caveats: order of presentation • If we loop through the samples in the same order, we may get cyclic behavior
Caveats: order of presentation • If we loop through the samples in the same order, we may get cyclic behavior • We must go through them randomly to get more convergent behavior
Caveats: learning rate output (y) Input (X) • Except in the case of a perfect fit, even an optimal overall fit will look incorrect to individual instances – Correcting the function for individual instances will lead to never-ending, non-convergent updates – We must shrink the learning rate with iterations to prevent this • Correction for individual instances with the eventual miniscule learning rates will not modify the function
Incremental Update: Stochastic Gradient Descent • Given 𝑌 1 , 𝑒 1 , 𝑌 2 , 𝑒 2 ,…, 𝑌 𝑈 , 𝑒 𝑈 • Initialize all weights 𝑋 1 , 𝑋 2 , … , 𝑋 𝐿 ; 𝑘 = 0 • Do: – Randomly permute 𝑌 1 , 𝑒 1 , 𝑌 2 , 𝑒 2 ,…, 𝑌 𝑈 , 𝑒 𝑈 – For all 𝑢 = 1: 𝑈 • 𝑘 = 𝑘 + 1 • For every layer 𝑙 : – Compute 𝛼 𝑋 𝑙 𝑬𝒋𝒘(𝒁 𝒖 , 𝒆 𝒖 ) – Update 𝑋 𝑙 = 𝑋 𝑙 − 𝜃 𝑘 𝛼 𝑋 𝑙 𝑬𝒋𝒘(𝒁 𝒖 , 𝒆 𝒖 ) • Until 𝐹𝑠𝑠 has converged 51
Incremental Update: Stochastic Gradient Descent • Given 𝑌 1 , 𝑒 1 , 𝑌 2 , 𝑒 2 ,…, 𝑌 𝑈 , 𝑒 𝑈 • Initialize all weights 𝑋 1 , 𝑋 2 , … , 𝑋 𝐿 ; 𝑘 = 0 • Do: – Randomly permute 𝑌 1 , 𝑒 1 , 𝑌 2 , 𝑒 2 ,…, 𝑌 𝑈 , 𝑒 𝑈 – For all 𝑢 = 1: 𝑈 Randomize input order • 𝑘 = 𝑘 + 1 • For every layer 𝑙 : Learning rate reduces with j – Compute 𝛼 𝑋 𝑙 𝑬𝒋𝒘(𝒁 𝒖 , 𝒆 𝒖 ) – Update 𝑋 𝑙 = 𝑋 𝑙 − 𝜃 𝑘 𝛼 𝑋 𝑙 𝑬𝒋𝒘(𝒁 𝒖 , 𝒆 𝒖 ) • Until 𝐹𝑠𝑠 has converged 52
Stochastic Gradient Descent • The iterations can make multiple passes over the data • A single pass through the entire training data is called an “epoch” – An epoch over a training set with 𝑈 samples results in 𝑈 updates of parameters
When does SGD work • SGD converges “almost surely” to a global or local minimum for most functions – Sufficient condition: step sizes follow the following conditions 𝜃 𝑙 = ∞ 𝑙 • Eventually the entire parameter space can be searched 2 < ∞ 𝜃 𝑙 𝑙 • The steps shrink – The fastest converging series that satisfies both above requirements is 𝜃 𝑙 ∝ 1 𝑙 • This is the optimal rate of shrinking the step size for strongly convex functions – More generally, the learning rates are optimally determined • If the loss is convex, SGD converges to the optimal solution • For non-convex losses SGD converges to a local minimum
Batch gradient convergence • In contrast, using the batch update method, for strongly convex functions, 𝑋 (𝑙) − 𝑋 ∗ < 𝑑 𝑙 𝑋 (0) − 𝑋 ∗ 1 – Giving us the iterations to 𝜗 convergence as 𝑃 𝑚𝑝 𝜗 • For generic convex functions, the 𝜗 convergence is 1 𝑃 𝜗 • Batch gradients converge “faster” – But SGD performs 𝑈 updates for every batch update
SGD convergence • We will define convergence in terms of the number of iterations taken to get within 𝜗 of the optimal solution 𝑔 𝑋 (𝑙) − 𝑔 𝑋 ∗ – < 𝜗 – Note: 𝑔 𝑋 here is the error on the entire training data, although SGD itself updates after every training instance • Using the optimal learning rate 1/𝑙 , for strongly convex functions, 𝑋 (𝑙) − 𝑋 ∗ < 1 𝑙 𝑋 (0) − 𝑋 ∗ 1 – Giving us the iterations to 𝜗 convergence as 𝑃 𝜗 • For generically convex (but not strongly convex) function, various proofs 1 1 report an 𝜗 convergence of 𝑙 using a learning rate of 𝑙 .
SGD Convergence: Loss value If: • 𝑔 is 𝜇 -strongly convex, and • at step 𝑢 we have a noisy estimate of the 2 ≤ 𝐻 2 for all 𝑢 , subgradient ො 𝑢 with 𝔽 𝑢 ො • and we use step size 𝜃 𝑢 = Τ 1 𝜇𝑢 Then for any 𝑈 > 1 : 𝔽 𝑔 𝑥 𝑈 − 𝑔(𝑥 ∗ ) ≤ 17𝐻 2 (1 + log 𝑈 ) 𝜇𝑈
SGD Convergence • We can bound the expected difference between the loss over our data using the optimal weights, 𝑥 ∗ , and log(𝑈) the weights at any single iteration, 𝑥 𝑈 , to 𝒫 for 𝑈 log(𝑈) strongly convex loss or 𝒫 for convex loss 𝑈 1 • Averaging schemes can improve the bound to 𝒫 𝑈 1 and 𝒫 𝑈 • Smoothness of the loss is not required
SGD example • A simpler problem: K-means • Note: SGD converges slower • Also note the rather large variation between runs – Lets try to understand these results..
Recall: Modelling a function (𝑌) 𝑍 = 𝑔(𝑌; 𝑿) • To learn a network 𝑔 𝑌; 𝑿 to model a function (𝑌) we minimize the expected divergence 𝑿 = argmin න 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑌 𝑄(𝑌)𝑒𝑌 𝑋 𝑌 = argmin 𝐹 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑌 𝑋 61
Recall: The Empirical risk d i X i • In practice, we minimize the empirical error 𝑂 = 1 𝐹𝑠𝑠 𝑔 𝑌; 𝑋 , 𝑌 𝑂 𝑒𝑗𝑤 𝑔 𝑌 𝑗 ; 𝑋 , 𝑒 𝑗 𝑗=1 𝑿 = argmin 𝐹𝑠𝑠 𝑔 𝑌; 𝑋 , 𝑌 𝑋 • The expected value of the empirical error is actually the expected divergence 𝐹 𝐹𝑠𝑠 𝑔 𝑌; 𝑋 , 𝑌 = 𝐹 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑌 62
Recap: The Empirical risk d i X i • In practice, we minimize the empirical error 𝑂 = 1 𝐹𝑠𝑠 𝑔 𝑌; 𝑋 , 𝑌 𝑂 𝑒𝑗𝑤 𝑔 𝑌 𝑗 ; 𝑋 , 𝑒 𝑗 𝑗=1 The empirical error is an unbiased estimate of the expected error 𝑿 = argmin 𝐹𝑠𝑠 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑌 Though there is no guarantee that minimizing it will minimize the 𝑋 expected error • The expected value of the empirical error is actually the expected error 𝐹 𝐹𝑠𝑠 𝑔 𝑌; 𝑋 , 𝑌 = 𝐹 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑌 63
Recap: The Empirical risk d i X i The variance of the empirical error: var(Err) = 1/N var(div) The variance of the estimator is proportional to 1/N The larger this variance, the greater the likelihood that the W that • In practice, we minimize the empirical error minimizes the empirical error will differ significantly from the W that 𝑂 minimizes the expected error = 1 𝐹𝑠𝑠 𝑔 𝑌; 𝑋 , 𝑌 𝑂 𝑒𝑗𝑤 𝑔 𝑌 𝑗 ; 𝑋 , 𝑒 𝑗 The empirical error is an unbiased estimate of the expected error 𝑗=1 Though there is no guarantee that minimizing it will minimize the 𝑿 = argmin 𝐹𝑠𝑠 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑌 𝑋 expected error • The expected value of the empirical error is actually the expected error 𝐹 𝐹𝑠𝑠 𝑔 𝑌; 𝑋 , 𝑌 = 𝐹 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑌 64
SGD d i X i • At each iteration, SGD focuses on the divergence of a single sample 𝑒𝑗𝑤 𝑔 𝑌 𝑗 ; 𝑋 , 𝑒 𝑗 • The expected value of the sample error is still the expected divergence 𝐹 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑌 65
SGD d i X i • At each iteration, SGD focuses on the divergence of a single sample 𝑒𝑗𝑤 𝑔 𝑌 𝑗 ; 𝑋 , 𝑒 𝑗 The sample error is also an unbiased estimate of the expected error • The expected value of the sample error is still the expected divergence 𝐹 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑌 66
SGD d i X i The variance of the sample error is the variance of the divergence itself: var(div) • At each iteration, SGD focuses on the divergence This is N times the variance of the empirical average minimized by batch update of a single sample 𝑒𝑗𝑤 𝑔 𝑌 𝑗 ; 𝑋 , 𝑒 𝑗 The sample error is also an unbiased estimate of the expected error • The expected value of the sample error is still the expected divergence 𝐹 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑌 67
Explaining the variance 𝑔(𝑦) (𝑦; 𝑋) 𝑦 • The blue curve is the function being approximated • The red curve is the approximation by the model at a given 𝑋 • The heights of the shaded regions represent the point-by-point error – The divergence is a function of the error – We want to find the 𝑋 that minimizes the average divergence
Explaining the variance 𝑔(𝑦) (𝑦; 𝑋) 𝑦 • Sample estimate approximates the shaded area with the average length of the lines of these curves is the red curve itself • Variance: The spread between the different curves is the variance
Explaining the variance 𝑔(𝑦) (𝑦; 𝑋) 𝑦 • Sample estimate approximates the shaded area with the average length of the lines • This average length will change with position of the samples
Explaining the variance 𝑔(𝑦) (𝑦; 𝑋) 𝑦 • Having more samples makes the estimate more robust to changes in the position of samples – The variance of the estimate is smaller
Explaining the variance With only one sample 𝑔(𝑦) (𝑦; 𝑋) 𝑦 • Having very few samples makes the estimate swing wildly with the sample position – Since our estimator learns the 𝑋 to minimize this estimate, the learned 𝑋 too can swing wildly
Explaining the variance With only one sample 𝑔(𝑦) (𝑦; 𝑋) 𝑦 • Having very few samples makes the estimate swing wildly with the sample position – Since our estimator learns the 𝑋 to minimize this estimate, the learned 𝑋 too can swing wildly
Explaining the variance With only one sample 𝑔(𝑦) (𝑦; 𝑋) 𝑦 • Having very few samples makes the estimate swing wildly with the sample position – Since our estimator learns the 𝑋 to minimize this estimate, the learned 𝑋 too can swing wildly
SGD example • A simpler problem: K-means • Note: SGD converges slower • Also has large variation between runs
SGD vs batch • SGD uses the gradient from only one sample at a time, and is consequently high variance • But also provides significantly quicker updates than batch • Is there a good medium?
Alternative: Mini-batch update • Alternative: adjust the function at a small, randomly chosen subset of points – Keep adjustments small – If the subsets cover the training set, we will have adjusted the entire function • As before, vary the subsets randomly in different passes through the training data
Incremental Update: Mini-batch update • Given 𝑌 1 , 𝑒 1 , 𝑌 2 , 𝑒 2 ,…, 𝑌 𝑈 , 𝑒 𝑈 • Initialize all weights 𝑋 1 , 𝑋 2 , … , 𝑋 𝐿 ; 𝑘 = 0 • Do: – Randomly permute 𝑌 1 , 𝑒 1 , 𝑌 2 , 𝑒 2 ,…, 𝑌 𝑈 , 𝑒 𝑈 – For 𝑢 = 1: 𝑐: 𝑈 • 𝑘 = 𝑘 + 1 • For every layer k: – ∆𝑋 𝑙 = 0 • For t’ = t : t+b -1 – For every layer 𝑙 : » Compute 𝛼 𝑋 𝑙 𝐸𝑗𝑤(𝑍 𝑢 , 𝑒 𝑢 ) » ∆𝑋 𝑙 = ∆𝑋 𝑙 + 𝛼 𝑋 𝑙 𝐸𝑗𝑤(𝑍 𝑢 , 𝑒 𝑢 ) • Update – For every layer k: 𝑋 𝑙 = 𝑋 𝑙 − 𝜃 𝑘 ∆𝑋 𝑙 • Until 𝐹𝑠𝑠 has converged 78
Incremental Update: Mini-batch update • Given 𝑌 1 , 𝑒 1 , 𝑌 2 , 𝑒 2 ,…, 𝑌 𝑈 , 𝑒 𝑈 • Initialize all weights 𝑋 1 , 𝑋 2 , … , 𝑋 𝐿 ; 𝑘 = 0 • Do: – Randomly permute 𝑌 1 , 𝑒 1 , 𝑌 2 , 𝑒 2 ,…, 𝑌 𝑈 , 𝑒 𝑈 – For 𝑢 = 1: 𝑐: 𝑈 Mini-batch size • 𝑘 = 𝑘 + 1 • For every layer k: Shrinking step size – ∆𝑋 𝑙 = 0 • For t’ = t : t+b -1 – For every layer 𝑙 : » Compute 𝛼 𝑋 𝑙 𝐸𝑗𝑤(𝑍 𝑢 , 𝑒 𝑢 ) » ∆𝑋 𝑙 = ∆𝑋 𝑙 + 𝛼 𝑋 𝑙 𝐸𝑗𝑤(𝑍 𝑢 , 𝑒 𝑢 ) • Update – For every layer k: 𝑋 𝑙 = 𝑋 𝑙 − 𝜃 𝑘 ∆𝑋 𝑙 • Until 𝐹𝑠𝑠 has converged 79
Mini Batches d i X i • Mini-batch updates compute and minimize a batch error 𝑐 = 1 𝐶𝑏𝑢𝑑ℎ𝐹𝑠𝑠 𝑔 𝑌; 𝑋 , 𝑌 𝑐 𝑒𝑗𝑤 𝑔 𝑌 𝑗 ; 𝑋 , 𝑒 𝑗 𝑗=1 • The expected value of the batch error is also the expected divergence 𝐹 𝐶𝑏𝑢𝑑ℎ𝐹𝑠𝑠 𝑔 𝑌; 𝑋 , 𝑌 = 𝐹 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑌 80
Mini Batches d i X i • Mini-batch updates computes an empirical batch error 𝑐 = 1 𝐶𝑏𝑢𝑑ℎ𝐹𝑠𝑠 𝑔 𝑌; 𝑋 , 𝑌 𝑐 𝑒𝑗𝑤 𝑔 𝑌 𝑗 ; 𝑋 , 𝑒 𝑗 The batch error is also an unbiased estimate of the expected error 𝑗=1 • The expected value of the batch error is also the expected divergence 𝐹 𝐶𝑏𝑢𝑑ℎ𝐹𝑠𝑠 𝑔 𝑌; 𝑋 , 𝑌 = 𝐹 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑌 81
Mini Batches d i X i • Mini-batch updates computes an empirical batch error The variance of the batch error: var(Err) = 1/b var(div) 𝑐 This will be much smaller than the variance of the sample error in SGD = 1 𝐶𝑏𝑢𝑑ℎ𝐹𝑠𝑠 𝑔 𝑌; 𝑋 , 𝑌 𝑐 𝑒𝑗𝑤 𝑔 𝑌 𝑗 ; 𝑋 , 𝑒 𝑗 The batch error is also an unbiased estimate of the expected error 𝑗=1 • The expected value of the batch error is also the expected divergence 𝐹 𝐶𝑏𝑢𝑑ℎ𝐹𝑠𝑠 𝑔 𝑌; 𝑋 , 𝑌 = 𝐹 𝑒𝑗𝑤 𝑔 𝑌; 𝑋 , 𝑌 82
Minibatch convergence 1 • For convex functions, convergence rate for SGD is 𝒫 𝑙 . • For mini-batch updates with batches of size 𝑐 , the 1 1 convergence rate is 𝒫 𝑐𝑙 + 𝑙 – Apparently an improvement of 𝑐 over SGD – But since the batch size is 𝑐 , we perform 𝑐 times as many computations per iteration as SGD – We actually get a degradation of 𝑐 • However, in practice – The objectives are generally not convex; mini-batches are more effective with the right learning rates – We also get additional benefits of vector processing
SGD example • Mini-batch performs comparably to batch training on this simple problem – But converges orders of magnitude faster
Measuring Error • Convergence is generally defined in terms of the overall training error – Not sample or batch error • Infeasible to actually measure the overall training error after each iteration • More typically, we estimate is as – Divergence or classification error on a held-out set – Average sample/batch error over the past 𝑂 samples/batches
Training and minibatches • In practice, training is usually performed using mini- batches – The mini-batch size is a hyper parameter to be optimized • Convergence depends on learning rate – Simple technique: fix learning rate until the error plateaus, then reduce learning rate by a fixed factor (e.g. 10) – Advanced methods : Adaptive updates, where the learning rate is itself determined as part of the estimation
Training and minibatches • In practice, training is usually performed using mini- batches – The mini-batch size is a hyper parameter to be optimized • Convergence depends on learning rate – Simple technique: fix learning rate until the error plateaus, then reduce learning rate by a fixed factor (e.g. 10) – Advanced methods : Adaptive updates, where the learning rate is itself determined as part of the estimation
Recall: Momentum • The momentum method ∆𝑋 (𝑙) = 𝛾∆𝑋 (𝑙−1) + 𝜃𝛼 𝑋 𝐹𝑠𝑠 𝑋 (𝑙−1) • Updates using a running average of the gradient
Momentum and incremental updates • The momentum method ∆𝑋 (𝑙) = 𝛾∆𝑋 (𝑙−1) + 𝜃𝛼 𝑋 𝐹𝑠𝑠 𝑋 (𝑙−1) • Incremental SGD and mini-batch gradients tend to have high variance • Momentum smooths out the variations – Smoother and faster convergence
Nestorov’s Accelerated Gradient • At any iteration, to compute the current step: – First extend the previous step – Then compute the gradient at the resultant position – Add the two to obtain the final step • This also applies directly to incremental update methods – The accelerated gradient smooths out the variance in the gradients
More recent methods • Several newer methods have been proposed that follow the general pattern of enhancing long- term trends to smooth out the variations of the mini-batch gradient – RMS Prop – ADAM: very popular in practice – Adagrad – AdaDelta – … • All roughly equivalent in performance
Variance-normalized step • In recent past – Total movement in Y component of updates is high – Movement in X components is lower • Current update, modify usual gradient-based update: – Scale down Y component – Scale up X component • A variety of algorithms have been proposed on this premise – We will see a popular example 96
RMS Prop • Notation: – Updates are by parameter – Sum derivative of divergence w.r.t any individual parameter 𝑥 is shown as 𝜖 𝑥 𝐸 – The squared derivative is 𝜖 𝑥 2 𝐸 = 𝜖 𝑥 𝐸 2 – The mean squared derivative is a running estimate of the 2 𝐸 average squared derivative. We will show this as 𝐹 𝜖 𝑥 • Modified update rule: We want to – scale down updates with large mean squared derivatives – scale up updates with small mean squared derivatives 97
RMS Prop • This is a variant on the basic mini-batch SGD algorithm • Procedure: – Maintain a running estimate of the mean squared value of derivatives for each parameter – Scale update of the parameter by the inverse of the root mean squared derivative 2 𝐸 𝑙 = 𝛿𝐹 𝜖 𝑥 2 𝐸 𝑙−1 + 1 − 𝛿 2 𝐸 𝑙 𝐹 𝜖 𝑥 𝜖 𝑥 𝜃 𝑥 𝑙+1 = 𝑥 𝑙 − 𝜖 𝑥 𝐸 2 𝐸 𝑙 + 𝜗 𝐹 𝜖 𝑥 98
RMS Prop (updates are for each weight of each layer) • Do: – Randomly shuffle inputs to change their order 2 𝐸 𝑙 = 0 – Initialize: 𝑙 = 1 ; for all weights 𝑥 in all layers, 𝐹 𝜖 𝑥 – For all 𝑢 = 1: 𝐶: 𝑈 (incrementing in blocks of 𝐶 inputs) • For all weights in all layers initialize 𝜖 𝑥 𝐸 𝑙 = 0 • For 𝑐 = 0: 𝐶 − 1 – Compute » Output 𝒁(𝒀 𝒖+𝒄 ) 𝒆𝑬𝒋𝒘(𝒁(𝒀 𝒖+𝒄 ),𝒆 𝒖+𝒄 ) » Compute gradient 𝒆𝒙 𝒆𝑬𝒋𝒘(𝒁(𝒀 𝒖+𝒄 ),𝒆 𝒖+𝒄 ) » Compute 𝜖 𝑥 𝐸 𝑙 += 𝒆𝒙 • update: 𝟑 𝑬 𝒍 = 𝜹𝑭 𝝐 𝒙 𝟑 𝑬 𝒍−𝟐 + 𝟐 − 𝜹 𝟑 𝑬 𝒍 𝑭 𝝐 𝒙 𝝐 𝒙 𝜽 𝒙 𝒍+𝟐 = 𝒙 𝒍 − 𝝐 𝒙 𝑬 𝟑 𝑬 𝒍 + 𝝑 𝑭 𝝐 𝒙 • 𝑙 = 𝑙 + 1 Until 𝐹(𝑿 1 , 𝑿 2 , … , 𝑿 𝐿 ) has converged • 99
Visualizing the optimizers: Beale’s Function • http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html 101
Visualizing the optimizers: Long Valley • http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html 102
Visualizing the optimizers: Saddle Point • http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html 103
Story so far • Gradient descent can be sped up by incremental updates – Convergence is guaranteed under most conditions – Stochastic gradient descent: update after each observation. Can be much faster than batch learning – Mini-batch updates: update after batches. Can be more efficient than SGD • Convergence can be improved using smoothed updates – RMSprop and more advanced techniques
Topics for the day • Incremental updates • Revisiting “trend” algorithms • Generalization • Tricks of the trade – Divergences.. – Activations – Normalizations
Tricks of the trade.. • To make the network converge better – The Divergence – Dropout – Batch normalization – Other tricks • Gradient clipping • Data augmentation • Other hacks..
Training Neural Nets by Gradient Descent: The Divergence Total training error: 𝐹𝑠𝑠 = 𝟐 𝑼 𝐸𝑗𝑤(𝒁 𝒖 , 𝒆 𝒖 ; 𝐗 1 , 𝐗 2 , … , 𝐗 𝐿 ) 𝒖 • The convergence of the gradient descent depends on the divergence – Ideally, must have a shape that results in a significant gradient in the right direction outside the optimum • To “guide” the algorithm to the right solution 107
Recommend
More recommend