neural networks optimization part 2

Neural Networks: Optimization Part 2 Intro to Deep Learning, Fall - PowerPoint PPT Presentation

Neural Networks: Optimization Part 2 Intro to Deep Learning, Fall 2017 Quiz 3 Quiz 3 Which of the following are necessary conditions for a value x to be a local minimum of a twice differentiable function f defined over the reals having


  1. Nestorov’s Accelerated Gradient β€’ Change the order of operations β€’ At any iteration, to compute the current step: – First extend the previous step – Then compute the gradient step at the resultant position – Add the two to obtain the final step

  2. Nestorov’s Accelerated Gradient β€’ Nestorov’s method βˆ†π‘‹ (𝑙) = π›Ύβˆ†π‘‹ (π‘™βˆ’1) βˆ’ πœƒπ›Ό 𝑋 𝐹𝑠𝑠 𝑋 (𝑙) + π›Ύβˆ†π‘‹ (π‘™βˆ’1) 𝑋 (𝑙) = 𝑋 (π‘™βˆ’1) + βˆ†π‘‹ (𝑙)

  3. Nestorov’s Accelerated Gradient β€’ Comparison with momentum (example from Hinton) β€’ Converges much faster

  4. Moving on: Topics for the day β€’ Incremental updates β€’ Revisiting β€œtrend” algorithms β€’ Generalization β€’ Tricks of the trade – Divergences.. – Activations – Normalizations

  5. The training formulation output (y) Input (X) β€’ Given input output pairs at a number of locations, estimate the entire function

  6. Gradient descent β€’ Start with an initial function β€’ Adjust its value at all points to make the outputs closer to the required value – Gradient descent adjusts parameters to adjust the function value at all points – Repeat this iteratively until we get arbitrarily close to the target function at the training points

  7. Gradient descent β€’ Start with an initial function β€’ Adjust its value at all points to make the outputs closer to the required value – Gradient descent adjusts parameters to adjust the function value at all points – Repeat this iteratively until we get arbitrarily close to the target function at the training points

  8. Gradient descent β€’ Start with an initial function β€’ Adjust its value at all points to make the outputs closer to the required value – Gradient descent adjusts parameters to adjust the function value at all points – Repeat this iteratively until we get arbitrarily close to the target function at the training points

  9. Gradient descent β€’ Start with an initial function β€’ Adjust its value at all points to make the outputs closer to the required value – Gradient descent adjusts parameters to adjust the function value at all points – Repeat this iteratively until we get arbitrarily close to the target function at the training points

  10. Effect of number of samples β€’ Problem with conventional gradient descent: we try to simultaneously adjust the function at all training points – We must process all training points before making a single adjustment – β€œBatch” update

  11. Alternative: Incremental update β€’ Alternative: adjust the function at one training point at a time – Keep adjustments small – Eventually, when we have processed all the training points, we will have adjusted the entire function β€’ With greater overall adjustment than we would if we made a single β€œBatch” update

  12. Alternative: Incremental update β€’ Alternative: adjust the function at one training point at a time – Keep adjustments small – Eventually, when we have processed all the training points, we will have adjusted the entire function β€’ With greater overall adjustment than we would if we made a single β€œBatch” update

  13. Alternative: Incremental update β€’ Alternative: adjust the function at one training point at a time – Keep adjustments small – Eventually, when we have processed all the training points, we will have adjusted the entire function β€’ With greater overall adjustment than we would if we made a single β€œBatch” update

  14. Alternative: Incremental update β€’ Alternative: adjust the function at one training point at a time – Keep adjustments small – Eventually, when we have processed all the training points, we will have adjusted the entire function β€’ With greater overall adjustment than we would if we made a single β€œBatch” update

  15. Alternative: Incremental update β€’ Alternative: adjust the function at one training point at a time – Keep adjustments small – Eventually, when we have processed all the training points, we will have adjusted the entire function β€’ With greater overall adjustment than we would if we made a single β€œBatch” update

  16. Incremental Update: Stochastic Gradient Descent β€’ Given π‘Œ 1 , 𝑒 1 , π‘Œ 2 , 𝑒 2 ,…, π‘Œ π‘ˆ , 𝑒 π‘ˆ β€’ Initialize all weights 𝑋 1 , 𝑋 2 , … , 𝑋 𝐿 β€’ Do: – For all 𝑒 = 1: π‘ˆ β€’ For every layer 𝑙 : – Compute 𝛼 𝑋 𝑙 π‘¬π’‹π’˜(𝒁 𝒖 , 𝒆 𝒖 ) – Update 𝑋 𝑙 = 𝑋 𝑙 βˆ’ πœƒπ›Ό 𝑋 𝑙 π‘¬π’‹π’˜(𝒁 𝒖 , 𝒆 𝒖 ) β€’ Until 𝐹𝑠𝑠 has converged 43

  17. Caveats: order of presentation β€’ If we loop through the samples in the same order, we may get cyclic behavior

  18. Caveats: order of presentation β€’ If we loop through the samples in the same order, we may get cyclic behavior β€’ We must go through them randomly

  19. Caveats: order of presentation β€’ If we loop through the samples in the same order, we may get cyclic behavior

  20. Caveats: order of presentation β€’ If we loop through the samples in the same order, we may get cyclic behavior

  21. Caveats: order of presentation β€’ If we loop through the samples in the same order, we may get cyclic behavior

  22. Caveats: order of presentation β€’ If we loop through the samples in the same order, we may get cyclic behavior β€’ We must go through them randomly to get more convergent behavior

  23. Caveats: learning rate output (y) Input (X) β€’ Except in the case of a perfect fit, even an optimal overall fit will look incorrect to individual instances – Correcting the function for individual instances will lead to never-ending, non-convergent updates – We must shrink the learning rate with iterations to prevent this β€’ Correction for individual instances with the eventual miniscule learning rates will not modify the function

  24. Incremental Update: Stochastic Gradient Descent β€’ Given π‘Œ 1 , 𝑒 1 , π‘Œ 2 , 𝑒 2 ,…, π‘Œ π‘ˆ , 𝑒 π‘ˆ β€’ Initialize all weights 𝑋 1 , 𝑋 2 , … , 𝑋 𝐿 ; π‘˜ = 0 β€’ Do: – Randomly permute π‘Œ 1 , 𝑒 1 , π‘Œ 2 , 𝑒 2 ,…, π‘Œ π‘ˆ , 𝑒 π‘ˆ – For all 𝑒 = 1: π‘ˆ β€’ π‘˜ = π‘˜ + 1 β€’ For every layer 𝑙 : – Compute 𝛼 𝑋 𝑙 π‘¬π’‹π’˜(𝒁 𝒖 , 𝒆 𝒖 ) – Update 𝑋 𝑙 = 𝑋 𝑙 βˆ’ πœƒ π‘˜ 𝛼 𝑋 𝑙 π‘¬π’‹π’˜(𝒁 𝒖 , 𝒆 𝒖 ) β€’ Until 𝐹𝑠𝑠 has converged 51

  25. Incremental Update: Stochastic Gradient Descent β€’ Given π‘Œ 1 , 𝑒 1 , π‘Œ 2 , 𝑒 2 ,…, π‘Œ π‘ˆ , 𝑒 π‘ˆ β€’ Initialize all weights 𝑋 1 , 𝑋 2 , … , 𝑋 𝐿 ; π‘˜ = 0 β€’ Do: – Randomly permute π‘Œ 1 , 𝑒 1 , π‘Œ 2 , 𝑒 2 ,…, π‘Œ π‘ˆ , 𝑒 π‘ˆ – For all 𝑒 = 1: π‘ˆ Randomize input order β€’ π‘˜ = π‘˜ + 1 β€’ For every layer 𝑙 : Learning rate reduces with j – Compute 𝛼 𝑋 𝑙 π‘¬π’‹π’˜(𝒁 𝒖 , 𝒆 𝒖 ) – Update 𝑋 𝑙 = 𝑋 𝑙 βˆ’ πœƒ π‘˜ 𝛼 𝑋 𝑙 π‘¬π’‹π’˜(𝒁 𝒖 , 𝒆 𝒖 ) β€’ Until 𝐹𝑠𝑠 has converged 52

  26. Stochastic Gradient Descent β€’ The iterations can make multiple passes over the data β€’ A single pass through the entire training data is called an β€œepoch” – An epoch over a training set with π‘ˆ samples results in π‘ˆ updates of parameters

  27. When does SGD work β€’ SGD converges β€œalmost surely” to a global or local minimum for most functions – Sufficient condition: step sizes follow the following conditions ෍ πœƒ 𝑙 = ∞ 𝑙 β€’ Eventually the entire parameter space can be searched 2 < ∞ ෍ πœƒ 𝑙 𝑙 β€’ The steps shrink – The fastest converging series that satisfies both above requirements is πœƒ 𝑙 ∝ 1 𝑙 β€’ This is the optimal rate of shrinking the step size for strongly convex functions – More generally, the learning rates are optimally determined β€’ If the loss is convex, SGD converges to the optimal solution β€’ For non-convex losses SGD converges to a local minimum

  28. Batch gradient convergence β€’ In contrast, using the batch update method, for strongly convex functions, 𝑋 (𝑙) βˆ’ 𝑋 βˆ— < 𝑑 𝑙 𝑋 (0) βˆ’ 𝑋 βˆ— 1 – Giving us the iterations to πœ— convergence as 𝑃 π‘šπ‘π‘• πœ— β€’ For generic convex functions, the πœ— convergence is 1 𝑃 πœ— β€’ Batch gradients converge β€œfaster” – But SGD performs π‘ˆ updates for every batch update

  29. SGD convergence β€’ We will define convergence in terms of the number of iterations taken to get within πœ— of the optimal solution 𝑔 𝑋 (𝑙) βˆ’ 𝑔 𝑋 βˆ— – < πœ— – Note: 𝑔 𝑋 here is the error on the entire training data, although SGD itself updates after every training instance β€’ Using the optimal learning rate 1/𝑙 , for strongly convex functions, 𝑋 (𝑙) βˆ’ 𝑋 βˆ— < 1 𝑙 𝑋 (0) βˆ’ 𝑋 βˆ— 1 – Giving us the iterations to πœ— convergence as 𝑃 πœ— β€’ For generically convex (but not strongly convex) function, various proofs 1 1 report an πœ— convergence of 𝑙 using a learning rate of 𝑙 .

  30. SGD Convergence: Loss value If: β€’ 𝑔 is πœ‡ -strongly convex, and β€’ at step 𝑒 we have a noisy estimate of the 2 ≀ 𝐻 2 for all 𝑒 , subgradient ො 𝑕 𝑒 with 𝔽 𝑕 𝑒 ො β€’ and we use step size πœƒ 𝑒 = Ξ€ 1 πœ‡π‘’ Then for any π‘ˆ > 1 : 𝔽 𝑔 π‘₯ π‘ˆ βˆ’ 𝑔(π‘₯ βˆ— ) ≀ 17𝐻 2 (1 + log π‘ˆ ) πœ‡π‘ˆ

  31. SGD Convergence β€’ We can bound the expected difference between the loss over our data using the optimal weights, π‘₯ βˆ— , and log(π‘ˆ) the weights at any single iteration, π‘₯ π‘ˆ , to 𝒫 for π‘ˆ log(π‘ˆ) strongly convex loss or 𝒫 for convex loss π‘ˆ 1 β€’ Averaging schemes can improve the bound to 𝒫 π‘ˆ 1 and 𝒫 π‘ˆ β€’ Smoothness of the loss is not required

  32. SGD example β€’ A simpler problem: K-means β€’ Note: SGD converges slower β€’ Also note the rather large variation between runs – Lets try to understand these results..

  33. Recall: Modelling a function 𝑕(π‘Œ) 𝑍 = 𝑔(π‘Œ; 𝑿) β€’ To learn a network 𝑔 π‘Œ; 𝑿 to model a function 𝑕(π‘Œ) we minimize the expected divergence ΰ·’ 𝑿 = argmin ΰΆ± 𝑒𝑗𝑀 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 𝑄(π‘Œ)π‘’π‘Œ 𝑋 π‘Œ = argmin 𝐹 𝑒𝑗𝑀 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 𝑋 61

  34. Recall: The Empirical risk d i X i β€’ In practice, we minimize the empirical error 𝑂 = 1 𝐹𝑠𝑠 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 𝑂 ෍ 𝑒𝑗𝑀 𝑔 π‘Œ 𝑗 ; 𝑋 , 𝑒 𝑗 𝑗=1 ΰ·’ 𝑿 = argmin 𝐹𝑠𝑠 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 𝑋 β€’ The expected value of the empirical error is actually the expected divergence 𝐹 𝐹𝑠𝑠 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ = 𝐹 𝑒𝑗𝑀 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 62

  35. Recap: The Empirical risk d i X i β€’ In practice, we minimize the empirical error 𝑂 = 1 𝐹𝑠𝑠 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 𝑂 ෍ 𝑒𝑗𝑀 𝑔 π‘Œ 𝑗 ; 𝑋 , 𝑒 𝑗 𝑗=1 The empirical error is an unbiased estimate of the expected error ΰ·’ 𝑿 = argmin 𝐹𝑠𝑠 𝑒𝑗𝑀 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ Though there is no guarantee that minimizing it will minimize the 𝑋 expected error β€’ The expected value of the empirical error is actually the expected error 𝐹 𝐹𝑠𝑠 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ = 𝐹 𝑒𝑗𝑀 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 63

  36. Recap: The Empirical risk d i X i The variance of the empirical error: var(Err) = 1/N var(div) The variance of the estimator is proportional to 1/N The larger this variance, the greater the likelihood that the W that β€’ In practice, we minimize the empirical error minimizes the empirical error will differ significantly from the W that 𝑂 minimizes the expected error = 1 𝐹𝑠𝑠 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 𝑂 ෍ 𝑒𝑗𝑀 𝑔 π‘Œ 𝑗 ; 𝑋 , 𝑒 𝑗 The empirical error is an unbiased estimate of the expected error 𝑗=1 ΰ·’ Though there is no guarantee that minimizing it will minimize the 𝑿 = argmin 𝐹𝑠𝑠 𝑒𝑗𝑀 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 𝑋 expected error β€’ The expected value of the empirical error is actually the expected error 𝐹 𝐹𝑠𝑠 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ = 𝐹 𝑒𝑗𝑀 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 64

  37. SGD d i X i β€’ At each iteration, SGD focuses on the divergence of a single sample 𝑒𝑗𝑀 𝑔 π‘Œ 𝑗 ; 𝑋 , 𝑒 𝑗 β€’ The expected value of the sample error is still the expected divergence 𝐹 𝑒𝑗𝑀 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 65

  38. SGD d i X i β€’ At each iteration, SGD focuses on the divergence of a single sample 𝑒𝑗𝑀 𝑔 π‘Œ 𝑗 ; 𝑋 , 𝑒 𝑗 The sample error is also an unbiased estimate of the expected error β€’ The expected value of the sample error is still the expected divergence 𝐹 𝑒𝑗𝑀 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 66

  39. SGD d i X i The variance of the sample error is the variance of the divergence itself: var(div) β€’ At each iteration, SGD focuses on the divergence This is N times the variance of the empirical average minimized by batch update of a single sample 𝑒𝑗𝑀 𝑔 π‘Œ 𝑗 ; 𝑋 , 𝑒 𝑗 The sample error is also an unbiased estimate of the expected error β€’ The expected value of the sample error is still the expected divergence 𝐹 𝑒𝑗𝑀 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 67

  40. Explaining the variance 𝑔(𝑦) 𝑕(𝑦; 𝑋) 𝑦 β€’ The blue curve is the function being approximated β€’ The red curve is the approximation by the model at a given 𝑋 β€’ The heights of the shaded regions represent the point-by-point error – The divergence is a function of the error – We want to find the 𝑋 that minimizes the average divergence

  41. Explaining the variance 𝑔(𝑦) 𝑕(𝑦; 𝑋) 𝑦 β€’ Sample estimate approximates the shaded area with the average length of the lines of these curves is the red curve itself β€’ Variance: The spread between the different curves is the variance

  42. Explaining the variance 𝑔(𝑦) 𝑕(𝑦; 𝑋) 𝑦 β€’ Sample estimate approximates the shaded area with the average length of the lines β€’ This average length will change with position of the samples

  43. Explaining the variance 𝑔(𝑦) 𝑕(𝑦; 𝑋) 𝑦 β€’ Having more samples makes the estimate more robust to changes in the position of samples – The variance of the estimate is smaller

  44. Explaining the variance With only one sample 𝑔(𝑦) 𝑕(𝑦; 𝑋) 𝑦 β€’ Having very few samples makes the estimate swing wildly with the sample position – Since our estimator learns the 𝑋 to minimize this estimate, the learned 𝑋 too can swing wildly

  45. Explaining the variance With only one sample 𝑔(𝑦) 𝑕(𝑦; 𝑋) 𝑦 β€’ Having very few samples makes the estimate swing wildly with the sample position – Since our estimator learns the 𝑋 to minimize this estimate, the learned 𝑋 too can swing wildly

  46. Explaining the variance With only one sample 𝑔(𝑦) 𝑕(𝑦; 𝑋) 𝑦 β€’ Having very few samples makes the estimate swing wildly with the sample position – Since our estimator learns the 𝑋 to minimize this estimate, the learned 𝑋 too can swing wildly

  47. SGD example β€’ A simpler problem: K-means β€’ Note: SGD converges slower β€’ Also has large variation between runs

  48. SGD vs batch β€’ SGD uses the gradient from only one sample at a time, and is consequently high variance β€’ But also provides significantly quicker updates than batch β€’ Is there a good medium?

  49. Alternative: Mini-batch update β€’ Alternative: adjust the function at a small, randomly chosen subset of points – Keep adjustments small – If the subsets cover the training set, we will have adjusted the entire function β€’ As before, vary the subsets randomly in different passes through the training data

  50. Incremental Update: Mini-batch update β€’ Given π‘Œ 1 , 𝑒 1 , π‘Œ 2 , 𝑒 2 ,…, π‘Œ π‘ˆ , 𝑒 π‘ˆ β€’ Initialize all weights 𝑋 1 , 𝑋 2 , … , 𝑋 𝐿 ; π‘˜ = 0 β€’ Do: – Randomly permute π‘Œ 1 , 𝑒 1 , π‘Œ 2 , 𝑒 2 ,…, π‘Œ π‘ˆ , 𝑒 π‘ˆ – For 𝑒 = 1: 𝑐: π‘ˆ β€’ π‘˜ = π‘˜ + 1 β€’ For every layer k: – βˆ†π‘‹ 𝑙 = 0 β€’ For t’ = t : t+b -1 – For every layer 𝑙 : Β» Compute 𝛼 𝑋 𝑙 𝐸𝑗𝑀(𝑍 𝑒 , 𝑒 𝑒 ) Β» βˆ†π‘‹ 𝑙 = βˆ†π‘‹ 𝑙 + 𝛼 𝑋 𝑙 𝐸𝑗𝑀(𝑍 𝑒 , 𝑒 𝑒 ) β€’ Update – For every layer k: 𝑋 𝑙 = 𝑋 𝑙 βˆ’ πœƒ π‘˜ βˆ†π‘‹ 𝑙 β€’ Until 𝐹𝑠𝑠 has converged 78

  51. Incremental Update: Mini-batch update β€’ Given π‘Œ 1 , 𝑒 1 , π‘Œ 2 , 𝑒 2 ,…, π‘Œ π‘ˆ , 𝑒 π‘ˆ β€’ Initialize all weights 𝑋 1 , 𝑋 2 , … , 𝑋 𝐿 ; π‘˜ = 0 β€’ Do: – Randomly permute π‘Œ 1 , 𝑒 1 , π‘Œ 2 , 𝑒 2 ,…, π‘Œ π‘ˆ , 𝑒 π‘ˆ – For 𝑒 = 1: 𝑐: π‘ˆ Mini-batch size β€’ π‘˜ = π‘˜ + 1 β€’ For every layer k: Shrinking step size – βˆ†π‘‹ 𝑙 = 0 β€’ For t’ = t : t+b -1 – For every layer 𝑙 : Β» Compute 𝛼 𝑋 𝑙 𝐸𝑗𝑀(𝑍 𝑒 , 𝑒 𝑒 ) Β» βˆ†π‘‹ 𝑙 = βˆ†π‘‹ 𝑙 + 𝛼 𝑋 𝑙 𝐸𝑗𝑀(𝑍 𝑒 , 𝑒 𝑒 ) β€’ Update – For every layer k: 𝑋 𝑙 = 𝑋 𝑙 βˆ’ πœƒ π‘˜ βˆ†π‘‹ 𝑙 β€’ Until 𝐹𝑠𝑠 has converged 79

  52. Mini Batches d i X i β€’ Mini-batch updates compute and minimize a batch error 𝑐 = 1 πΆπ‘π‘’π‘‘β„ŽπΉπ‘ π‘  𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 𝑐 ෍ 𝑒𝑗𝑀 𝑔 π‘Œ 𝑗 ; 𝑋 , 𝑒 𝑗 𝑗=1 β€’ The expected value of the batch error is also the expected divergence 𝐹 πΆπ‘π‘’π‘‘β„ŽπΉπ‘ π‘  𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ = 𝐹 𝑒𝑗𝑀 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 80

  53. Mini Batches d i X i β€’ Mini-batch updates computes an empirical batch error 𝑐 = 1 πΆπ‘π‘’π‘‘β„ŽπΉπ‘ π‘  𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 𝑐 ෍ 𝑒𝑗𝑀 𝑔 π‘Œ 𝑗 ; 𝑋 , 𝑒 𝑗 The batch error is also an unbiased estimate of the expected error 𝑗=1 β€’ The expected value of the batch error is also the expected divergence 𝐹 πΆπ‘π‘’π‘‘β„ŽπΉπ‘ π‘  𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ = 𝐹 𝑒𝑗𝑀 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 81

  54. Mini Batches d i X i β€’ Mini-batch updates computes an empirical batch error The variance of the batch error: var(Err) = 1/b var(div) 𝑐 This will be much smaller than the variance of the sample error in SGD = 1 πΆπ‘π‘’π‘‘β„ŽπΉπ‘ π‘  𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 𝑐 ෍ 𝑒𝑗𝑀 𝑔 π‘Œ 𝑗 ; 𝑋 , 𝑒 𝑗 The batch error is also an unbiased estimate of the expected error 𝑗=1 β€’ The expected value of the batch error is also the expected divergence 𝐹 πΆπ‘π‘’π‘‘β„ŽπΉπ‘ π‘  𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ = 𝐹 𝑒𝑗𝑀 𝑔 π‘Œ; 𝑋 , 𝑕 π‘Œ 82

  55. Minibatch convergence 1 β€’ For convex functions, convergence rate for SGD is 𝒫 𝑙 . β€’ For mini-batch updates with batches of size 𝑐 , the 1 1 convergence rate is 𝒫 𝑐𝑙 + 𝑙 – Apparently an improvement of 𝑐 over SGD – But since the batch size is 𝑐 , we perform 𝑐 times as many computations per iteration as SGD – We actually get a degradation of 𝑐 β€’ However, in practice – The objectives are generally not convex; mini-batches are more effective with the right learning rates – We also get additional benefits of vector processing

  56. SGD example β€’ Mini-batch performs comparably to batch training on this simple problem – But converges orders of magnitude faster

  57. Measuring Error β€’ Convergence is generally defined in terms of the overall training error – Not sample or batch error β€’ Infeasible to actually measure the overall training error after each iteration β€’ More typically, we estimate is as – Divergence or classification error on a held-out set – Average sample/batch error over the past 𝑂 samples/batches

  58. Training and minibatches β€’ In practice, training is usually performed using mini- batches – The mini-batch size is a hyper parameter to be optimized β€’ Convergence depends on learning rate – Simple technique: fix learning rate until the error plateaus, then reduce learning rate by a fixed factor (e.g. 10) – Advanced methods : Adaptive updates, where the learning rate is itself determined as part of the estimation

  59. Training and minibatches β€’ In practice, training is usually performed using mini- batches – The mini-batch size is a hyper parameter to be optimized β€’ Convergence depends on learning rate – Simple technique: fix learning rate until the error plateaus, then reduce learning rate by a fixed factor (e.g. 10) – Advanced methods : Adaptive updates, where the learning rate is itself determined as part of the estimation

  60. Recall: Momentum β€’ The momentum method βˆ†π‘‹ (𝑙) = π›Ύβˆ†π‘‹ (π‘™βˆ’1) + πœƒπ›Ό 𝑋 𝐹𝑠𝑠 𝑋 (π‘™βˆ’1) β€’ Updates using a running average of the gradient

  61. Momentum and incremental updates β€’ The momentum method βˆ†π‘‹ (𝑙) = π›Ύβˆ†π‘‹ (π‘™βˆ’1) + πœƒπ›Ό 𝑋 𝐹𝑠𝑠 𝑋 (π‘™βˆ’1) β€’ Incremental SGD and mini-batch gradients tend to have high variance β€’ Momentum smooths out the variations – Smoother and faster convergence

  62. Nestorov’s Accelerated Gradient β€’ At any iteration, to compute the current step: – First extend the previous step – Then compute the gradient at the resultant position – Add the two to obtain the final step β€’ This also applies directly to incremental update methods – The accelerated gradient smooths out the variance in the gradients

  63. More recent methods β€’ Several newer methods have been proposed that follow the general pattern of enhancing long- term trends to smooth out the variations of the mini-batch gradient – RMS Prop – ADAM: very popular in practice – Adagrad – AdaDelta – … β€’ All roughly equivalent in performance

  64. Variance-normalized step β€’ In recent past – Total movement in Y component of updates is high – Movement in X components is lower β€’ Current update, modify usual gradient-based update: – Scale down Y component – Scale up X component β€’ A variety of algorithms have been proposed on this premise – We will see a popular example 96

  65. RMS Prop β€’ Notation: – Updates are by parameter – Sum derivative of divergence w.r.t any individual parameter π‘₯ is shown as πœ– π‘₯ 𝐸 – The squared derivative is πœ– π‘₯ 2 𝐸 = πœ– π‘₯ 𝐸 2 – The mean squared derivative is a running estimate of the 2 𝐸 average squared derivative. We will show this as 𝐹 πœ– π‘₯ β€’ Modified update rule: We want to – scale down updates with large mean squared derivatives – scale up updates with small mean squared derivatives 97

  66. RMS Prop β€’ This is a variant on the basic mini-batch SGD algorithm β€’ Procedure: – Maintain a running estimate of the mean squared value of derivatives for each parameter – Scale update of the parameter by the inverse of the root mean squared derivative 2 𝐸 𝑙 = 𝛿𝐹 πœ– π‘₯ 2 𝐸 π‘™βˆ’1 + 1 βˆ’ 𝛿 2 𝐸 𝑙 𝐹 πœ– π‘₯ πœ– π‘₯ πœƒ π‘₯ 𝑙+1 = π‘₯ 𝑙 βˆ’ πœ– π‘₯ 𝐸 2 𝐸 𝑙 + πœ— 𝐹 πœ– π‘₯ 98

  67. RMS Prop (updates are for each weight of each layer) β€’ Do: – Randomly shuffle inputs to change their order 2 𝐸 𝑙 = 0 – Initialize: 𝑙 = 1 ; for all weights π‘₯ in all layers, 𝐹 πœ– π‘₯ – For all 𝑒 = 1: 𝐢: π‘ˆ (incrementing in blocks of 𝐢 inputs) β€’ For all weights in all layers initialize πœ– π‘₯ 𝐸 𝑙 = 0 β€’ For 𝑐 = 0: 𝐢 βˆ’ 1 – Compute Β» Output 𝒁(𝒀 𝒖+𝒄 ) π’†π‘¬π’‹π’˜(𝒁(𝒀 𝒖+𝒄 ),𝒆 𝒖+𝒄 ) Β» Compute gradient 𝒆𝒙 π’†π‘¬π’‹π’˜(𝒁(𝒀 𝒖+𝒄 ),𝒆 𝒖+𝒄 ) Β» Compute πœ– π‘₯ 𝐸 𝑙 += 𝒆𝒙 β€’ update: πŸ‘ 𝑬 𝒍 = πœΉπ‘­ 𝝐 𝒙 πŸ‘ 𝑬 π’βˆ’πŸ + 𝟐 βˆ’ 𝜹 πŸ‘ 𝑬 𝒍 𝑭 𝝐 𝒙 𝝐 𝒙 𝜽 𝒙 𝒍+𝟐 = 𝒙 𝒍 βˆ’ 𝝐 𝒙 𝑬 πŸ‘ 𝑬 𝒍 + 𝝑 𝑭 𝝐 𝒙 β€’ 𝑙 = 𝑙 + 1 Until 𝐹(𝑿 1 , 𝑿 2 , … , 𝑿 𝐿 ) has converged β€’ 99

  68. Visualizing the optimizers: Beale’s Function β€’ http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html 101

  69. Visualizing the optimizers: Long Valley β€’ http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html 102

  70. Visualizing the optimizers: Saddle Point β€’ http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html 103

  71. Story so far β€’ Gradient descent can be sped up by incremental updates – Convergence is guaranteed under most conditions – Stochastic gradient descent: update after each observation. Can be much faster than batch learning – Mini-batch updates: update after batches. Can be more efficient than SGD β€’ Convergence can be improved using smoothed updates – RMSprop and more advanced techniques

  72. Topics for the day β€’ Incremental updates β€’ Revisiting β€œtrend” algorithms β€’ Generalization β€’ Tricks of the trade – Divergences.. – Activations – Normalizations

  73. Tricks of the trade.. β€’ To make the network converge better – The Divergence – Dropout – Batch normalization – Other tricks β€’ Gradient clipping β€’ Data augmentation β€’ Other hacks..

  74. Training Neural Nets by Gradient Descent: The Divergence Total training error: 𝐹𝑠𝑠 = 𝟐 𝑼 ෍ 𝐸𝑗𝑀(𝒁 𝒖 , 𝒆 𝒖 ; 𝐗 1 , 𝐗 2 , … , 𝐗 𝐿 ) 𝒖 β€’ The convergence of the gradient descent depends on the divergence – Ideally, must have a shape that results in a significant gradient in the right direction outside the optimum β€’ To β€œguide” the algorithm to the right solution 107

Recommend


More recommend