Learning From Data Lecture 22 Neural Networks and Overfitting Approximation vs. Generalization Regularization and Early Stopping Minimizing E in More Efficienty M. Magdon-Ismail CSCI 4100/6100
recap: Neural Networks and Fitting the Data Forward Propagation: 0 W(1) W(2) → s (2) · · · W( L ) → x ( L ) = h ( x ) θ θ x = x (0) → s (1) → x (1) → s ( L ) − − − − − Gradient Descent -1 � � 1 s ( ℓ ) = (W ( ℓ ) ) t x ( ℓ − 1) x ( ℓ ) = θ ( s ( ℓ ) ) log 10 (error) -2 (Compute h and E in ) -3 Choose W = { W (1) , W (1) , . . . , W ( L ) } to minimize E in SGD -4 Gradient descent: 0 2 4 6 log 10 (iteration) W( t + 1) ← W( t ) − η ∇ E in (W( t )) ∂ e → need δ ( ℓ ) = ∂ e Compute gradient − → need ∂ W ( ℓ ) − ∂ s ( ℓ ) ∂ e ∂ W ( ℓ ) = x ( ℓ − 1) ( δ ( ℓ ) ) t Symmetry Backpropagation: δ (1) ← − δ (2) · · · ← − δ ( L − 1) ← − δ ( L ) W ( ℓ +1) δ ( ℓ +1) � d ( ℓ ) δ ( ℓ ) = θ ′ ( s ( ℓ ) ) ⊗ � 1 Average Intensity M Neural Networks and Overfitting : 2 /15 � A c L Creator: Malik Magdon-Ismail 2-layer neural network − →
2-Layer Neural Network w 0 w 1 v 1 w 2 v 2 m v 3 w 3 � h ( x ) t x ) x h ( x ) = θ w 0 + w j θ ( v j v 4 w 4 j =1 w 5 v 5 v m . . w m . M Neural Networks and Overfitting : 3 /15 � A c L Creator: Malik Magdon-Ismail Tunable Transform − →
The Neural Network has a Tunable Transform Neural Network Nonlinear Transform k -RBF-Network ˜ d � m � � � k � � � h ( x ) = θ w 0 + w j Φ j ( x ) h ( x ) = θ w 0 + w j θ ( v j t x ) h ( x ) = θ w 0 + w j φ ( | | x − µ j | | ) j =1 j =1 j =1 � 1 � E in = O m ↑ approximation M Neural Networks and Overfitting : 4 /15 � A c L Creator: Malik Magdon-Ismail Generalization − →
Generalization MLP: d vc = O ( md log( md )) √ m = N (convergence to optimal for MLP, just like k -NN) semi-parametric because you still have to learn parameters. tanh : d vc = O ( md ( m + d )) M Neural Networks and Overfitting : 5 /15 � A c L Creator: Malik Magdon-Ismail Regularization – weight decay − →
Regularization – Weight Decay N E aug ( w ) = 1 ( h ( x n ; w ) − y n ) 2 + λ ( w ( ℓ ) � � ij ) 2 N N n =1 ℓ,i,j ∂E aug ( w ) = ∂E in ( w ) + 2 λ N W ( ℓ ) ∂ W ( ℓ ) ∂ W ( ℓ ) ↑ backpropagation M Neural Networks and Overfitting : 6 /15 � A c L Creator: Malik Magdon-Ismail Digits data − →
Weight Decay with Digits Data No Weight Decay Weight Decay, λ = 0 . 01 Symmetry Symmetry Average Intensity Average Intensity M Neural Networks and Overfitting : 7 /15 � A c L Creator: Malik Magdon-Ismail Early Stopping − →
Early Stopping E out ( w t ) Gradient Descent Ω( d vc ( H t )) w 1 = w 0 − η g 0 | | g 0 | | Error w 0 H 1 H 1 = { w : | | w − w 0 | | ≤ η } E in ( w t ) H 2 w 1 w 2 t ∗ iteration, t w 0 H 2 = H 1 ∪ { w : | | w − w 1 | | ≤ η } H 3 w 1 contour of constant E in w 2 w 0 w 3 H 3 = H 2 ∪ { w : | | w − w 2 | | ≤ η } w ( t ∗ ) Each iteration explores a larger H H 1 ⊂ H 2 ⊂ H 3 ⊂ H 4 ⊂ · · · w (0) M Neural Networks and Overfitting : 8 /15 � A c L Creator: Malik Magdon-Ismail Early stopping on digits data − →
Early Stopping on Digits Data -1 E val log 10 (error) -1.2 Symmetry E in -1.4 -1.6 t ∗ 10 2 10 3 10 4 10 5 10 6 iteration, t Use a validation set to determine t ∗ Output w ∗ , do not retrain with all the data till t ∗ . Average Intensity M Neural Networks and Overfitting : 9 /15 � A c L Creator: Malik Magdon-Ismail Minimizing E in − →
Minimizing E in 1. Use regression for classification 2. Use better algorithms than gradient descent 0 gradient descent -2 log 10 (error) -4 -6 conjugate gradients -8 0.1 1 10 10 2 10 3 10 4 optimization time (sec) M Neural Networks and Overfitting : 10 /15 � A c L Creator: Malik Magdon-Ismail Beefing up gradient descent − →
Beefing Up Gradient Descent Determine the gradient g in-sample error, E in in-sample error, E in E in ( w ) E in ( w ) weights, w weights, w Shallow: use large η . Deep: use small η . M Neural Networks and Overfitting : 11 /15 � A c L Creator: Malik Magdon-Ismail Variable learning rate − →
Variable Learning Rate Gradient Descent 1: Initialize w (0), and η 0 at t = 0. Set α > 1 and β < 1. 2: while stopping criterion has not been met do Let g ( t ) = ∇ E in ( w ( t )), and set v ( t ) = − g ( t ). 3: if E in ( w ( t ) + η t v ( t )) < E in ( w ( t )) then 4: accept: w ( t + 1) = w ( t ) + η t v ( t ); 5: increment η : η t +1 = αη t . α ∈ [1 . 05 , 1 . 1] else 6: reject: w ( t + 1) = w ( t ); 7: decrease η : η t +1 = βη t . β ∈ [0 . 7 , 0 . 8] end if 8: Iterate to the next step, t ← t + 1. 9: 10: end while M Neural Networks and Overfitting : 12 /15 � A c L Creator: Malik Magdon-Ismail Steepest Descent - Line Search − →
Steepest Descent - Line Search 1: Initialize w (0) and set t = 0; contour of constant E in 2: while stopping criterion has not been met do Let g ( t ) = ∇ E in ( w ( t )), and set v ( t ) = − g ( t ). 3: v ( t ) Let η ∗ = argmin η E in ( w ( t ) + η v ( t )). 4: w ∗ w 2 w ( t + 1) = w ( t ) + η ∗ v ( t ). w ( t + 1) 5: − g ( t + 1) Iterate to the next step, t ← t + 1. 6: 7: end while w ( t ) w 1 How to accomplish the line search (step 4)? Simple bisection (binary search) suffices in practice E ( η 3 ) E ( η 1 ) E ( η 2 ) η 1 η 2 η 3 ¯ η M Neural Networks and Overfitting : 13 /15 � A c L Creator: Malik Magdon-Ismail Comparison of optimization heuristics − →
Comparison of Optimization Heuristics 0 -1 -2 log 10 (error) -3 gradient descent -4 variable η -5 steepest descent 0.1 1 10 10 2 10 3 10 4 optimization time (sec) Optimization Time Method 10 sec 1,000 sec 50,000 sec Gradient Descent 0.122 0.0214 0.0113 Stochastic Gradient Descent 0.0203 0.000447 1 . 6310 × 10 − 5 Variable Learning Rate 0.0432 0.0180 0.000197 Steepest Descent 0.0497 0.0194 0.000140 M Neural Networks and Overfitting : 14 /15 � A c L Creator: Malik Magdon-Ismail Conjugate gradients − →
Conjugate Gradients 1. Line search just like steepest descent. 0 2. Choose a better direction than − g steepest descent -2 log 10 (error) contour of constant E in -4 -6 conjugate gradients v ( t ) -8 v ( t + 1) w 2 0.1 1 10 10 2 10 3 10 4 w ( t + 1) optimization time (sec) Optimization Time w ( t ) Method 10 sec 1,000 sec 50,000 sec 1 . 6310 × 10 − 5 Stochastic Gradient Descent 0.0203 0.000447 Steepest Descent 0.0497 0.0194 0.000140 w 1 1 . 13 × 10 − 6 2 . 73 × 10 − 9 Conjugate Gradients 0.0200 There are better algorithms (eg. Levenberg-Marquardt), but we will stop here M Neural Networks and Overfitting : 15 /15 � A c L Creator: Malik Magdon-Ismail
Recommend
More recommend