Optimization for Machine Learning Tom Schaul schaul@cims.nyu.edu
Recap: Learning Machines • Learning machines (Neural Networks, etc.) • Forward passes produce a function of input • Trainable parameters (aka weights, biases, etc.) • Backward passes compute gradients w.r.t. target • Modular structure chain rule (aka Backprop) • Loss function (aka energy, cost, error) • an expectation over samples from dataset • Today: algorithms for minimizing the loss Optimization for ML Tom Schaul – 10/8/2011 2 /49
Flattening Parameters • Parameter space • Gradient from backprop • Element-wise correspondence Optimization for ML Tom Schaul – 10/8/2011 3 /49
Energy Surfaces • We can visualize the loss as a function of parameters • Properties: • Local optima • Saddle points • Steep cliffs • Narrow, bent valleys • Flat areas • Only convex in the simplest cases • Convex optimization tools are of limited use Optimization for ML Tom Schaul – 10/8/2011 4 /49
Sample Variance • Every sample has a contribution to the loss • Sample distributions are complex • Sample gradients can have high variance Optimization for ML Tom Schaul – 10/8/2011 5 /49
Optimization Types • First-order methods, aka gradient descent • use gradients • incremental steps downhill on surface • Second-order methods • use second derivatives (curvature) • attempt large jumps (into the bottom of the valley) • Zeroth-order methods, aka black-box • use on values of loss function exclusively • somewhat random jumps Optimization for ML Tom Schaul – 10/8/2011 6 /49
Batch vs. Stochastic • Batch methods are based on true loss • Reliable gradients, large updates • Stochastic methods use sample gradients • Many more updates, smaller steps • Minibatch methods interpolate in-between • Gradients are averaged over n samples Optimization for ML Tom Schaul – 10/8/2011 7 /49
Gradient Descent • Step in direction of steepest descent • Gradient comes from backprop • How to choose step-size? • Line search (extra evaluations) • Fixed number Optimization for ML Tom Schaul – 10/8/2011 8 /49
Convergence of GD (1D) • Iteratively approach optimum Optimization for ML Tom Schaul – 10/8/2011 9 /49
Optimal Learning Rate (1D) • Weight change • With quadratic loss • Optimal leaning rate Optimization for ML Tom Schaul – 10/8/2011 10 /49
Convergence of GD (N-Dim) • Assumption: smooth loss function • Quadratic approximation around optimum with Hessian matrix • Convergence condition: Must shrink any vector Optimization for ML Tom Schaul – 10/8/2011 11 /49
Convergence of GD (N-Dim) • We do a change of coordinates such that diagonal • Then • Intuition: GD in N dimensions is equivalent to N 1D-descents along the eigenvectors of H • Convergence if Optimization for ML Tom Schaul – 10/8/2011 12 /49
GD Convergence: Example • Batch GD • Small learning rate • Convergence Optimization for ML Tom Schaul – 10/8/2011 13 /49
GD Convergence: Example • Batch GD • Large learning rate • Divergence Optimization for ML Tom Schaul – 10/8/2011 14 /49
GD Convergence: Example • Stochastic GD • Large learning rate • Fast convergence Optimization for ML Tom Schaul – 10/8/2011 15 /49
Convergence Speed • With optimal fixed learning rate • One-step convergence in that direction • Slower in all others • Total number of iterations proportional to the conditioning number of Hessian Optimization for ML Tom Schaul – 10/8/2011 16 /49
Optimal LR Estimation • A cheap way of finding (without finding H first) • Part 1: cheap Hessian-vector products • Part 2: power method Optimization for ML Tom Schaul – 10/8/2011 17 /49
Hessian-vector Products • Based on this approximation (finite difference) where we obtain simply from one additional forward/backward pass (after perturbing the parameters) Optimization for ML Tom Schaul – 10/8/2011 18 /49
Power Method • We know that iterating converges to the principal eigenvector, with • With sample-estimates (on-line) we introduce some robustness by averaging: (good enough after 10-100 samples, ) Optimization for ML Tom Schaul – 10/8/2011 19 /49
Optimal LR Estimation Optimization for ML Tom Schaul – 10/8/2011 20 /49
Conditioning of H • Some parameters are more sensitive than others • Very different scales • Illustration • Solution 1 (model/data) • Solution 2 (algorithm) Optimization for ML Tom Schaul – 10/8/2011 21 /49
H-eigenvalues in Neural Nets (1) • Few large ones • Many medium ones • Spanning orders of magnitude Optimization for ML Tom Schaul – 10/8/2011 22 /49
H-eigenvalues in Neural Nets (2) • Differences by layer • Steeper gradients on biases Optimization for ML Tom Schaul – 10/8/2011 23 /49
H-Conditioning: Solution 1 • Normalize data • Always useful, rarely sufficient • How? • Subtract mean from inputs • If possible: decorrelate inputs • Divide by standard deviation on each input (all unit-variance) Optimization for ML Tom Schaul – 10/8/2011 24 /49
H-Conditioning: Solution 1 • Normalize data • Structural choices • Non-linearities with zero-mean unit-variance activations • Explicit normalization layers • Weight initialization • such that all hidden activations have approximately zero-mean unit-variance Optimization for ML Tom Schaul – 10/8/2011 25 /49
H-Conditioning: Solution 2 • Algorithmic solution: • Take smaller steps in sensitive directions • One learning rate per parameter • Estimate diagonal Hessian • Small constant for stability Optimization for ML Tom Schaul – 10/8/2011 26 /49
Hessian Estimation • Approximate full Hessian • Finite difference approximation of the k-th row • One forward/backward for each parameter (perturbed slightly) • Concatenate all the rows • Symmetrize resulting matrix Optimization for ML Tom Schaul – 10/8/2011 27 /49
BBprop (1) • Cheaply approximate the Hessian in a modular architecture. y • Assume we have • Find and • Apply chain rule X • Positive-definite approximation Optimization for ML Tom Schaul – 10/8/2011 28 /49
BBprop (2) • Just the diagonal terms y X • Take exponentially moving average of estimates Optimization for ML Tom Schaul – 10/8/2011 29 /49
Batch vs. Stochastic • Batch methods • True loss, reliable gradients, large updates • But: • Expensive on large datasets • Slowed by redundant samples • Stochastic methods • Many more updates, smaller steps • Minibatch methods • Gradients are averaged over n samples Optimization for ML Tom Schaul – 10/8/2011 30 /49
Batch vs. Stochastic • Batch methods • Stochastic methods (SGD) • Many more updates, smaller steps • More aggressive • Also works online (e.g. streaming data) • Cooling schedule on learning rate (guaranteed to converge) • Minibatch methods Optimization for ML Tom Schaul – 10/8/2011 31 /49
Batch vs. Stochastic • Batch methods • Stochastic methods • Minibatch methods • Stochastic updates, but more accurate gradients based on a small number of samples • In-between SGD and batch GD • Not usually faster, but much easier to parallelize • Samples in a mini-batch should be diverse • Don’t forget to shuffle dataset! • Stratified sampling Optimization for ML Tom Schaul – 10/8/2011 32 /49
Variance-normalization • SGD learning rate depends on Hessian, but also on sample variance • Intuition: parameters whose gradients vary wildly across samples should be updated with smaller learning rates than stable ones • Variance-scaled rates: where • Those are exponential moving averages Optimization for ML Tom Schaul – 10/8/2011 33 /49
Variance-normalization • This scheme is adaptive: no need for tuning Optimization for ML Tom Schaul – 10/8/2011 34 /49
Optimization Types • First-order methods, aka gradient descent • use gradients • incremental steps downhill on energy surface • Second-order methods • use second derivatives (curvature) • attempt large jumps (into the bottom of the valley) • Zeroth-order methods, aka black-box • use on values of loss function exclusively • somewhat random jumps Optimization for ML Tom Schaul – 10/8/2011 35 /49
Second-order Optimization • Newton’s method • Quasi-Newton (BFGS) • Conjugate gradients • Gauss-Newton (Levenberg-Marquandt) • Many more: • Momentum • Nesterov gradient • Natural gradient descent Optimization for ML Tom Schaul – 10/8/2011 36 /49
Newton’s Method • Locally quadratic loss: • Minimize w.r.t. weight change • Jumps to the center of quadratic bowl • Optimal (single step) if quadratic approximation hold, no guarantees otherwise Optimization for ML Tom Schaul – 10/8/2011 37 /49
Quasi-Newton / BFGS • Keep an estimate of the inverse Hessian M • Gradient premultiplied by M • M always positive-definite • Line search • Update M incrementally • e.g. in BFGS algorithm (Broyden-Fletcher-Goldfarb-Shanno) Optimization for ML Tom Schaul – 10/8/2011 38 /49
Recommend
More recommend