Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented Error M. Magdon-Ismail CSCI 4100/6100
recap: Overfitting Fitting the data more than is warranted Data Target Fit y x M Regularization : 2 /30 � A c L Creator: Malik Magdon-Ismail Noise − →
recap: Noise is Part of y We Cannot Model Stochastic Noise Deterministic Noise h ∗ f ( x ) y y y = h ∗ ( x )+det. noise y = f ( x )+stoch. noise x x Stochastic and Deterministic Noise Hurt Learning Human: Good at extracting the simple pattern, ignoring the noise and complications. � Computer: Pays equal attention to all pixels. Needs help simplifying → (features , regularization). M Regularization : 3 /30 � A c L Creator: Malik Magdon-Ismail What is regularization? − →
Regularization What is regularization? A cure for our tendency to fit (get distracted by) the noise, hence improving E out . How does it work? By constraining the model so that we cannot fit the noise. ↑ putting on the brakes Side effects? The medication will have side effects – if we cannot fit the noise, maybe we cannot fit f (the signal)? M Regularization : 4 /30 � A c L Creator: Malik Magdon-Ismail Constraining − →
Constraining the Model: Does it Help? y x . . . and the winner is: M Regularization : 5 /30 � A c L Creator: Malik Magdon-Ismail Small weights − →
Constraining the Model: Does it Help? y y x x constrain weights to be smaller . . . and the winner is: M Regularization : 6 /30 � A c L Creator: Malik Magdon-Ismail bias − →
Bias Goes Up A Little g ( x ) ¯ g ( x ) ¯ y y sin( x ) sin( x ) x x no regularization regularization bias = 0 . 21 bias = 0 . 23 ← side effect (Constant model had bias =0.5 and var =0.25.) M Regularization : 7 /30 � A c L Creator: Malik Magdon-Ismail var − →
Variance Drop is Dramatic! g ( x ) ¯ g ( x ) ¯ y y sin( x ) sin( x ) x x no regularization regularization bias = 0 . 21 bias = 0 . 23 ← side effect var = 1 . 69 var = 0 . 33 ← treatment (Constant model had bias =0.5 and var =0.25.) M Regularization : 8 /30 � A c L Creator: Malik Magdon-Ismail Regularication in a nutshell − →
Regularization in a Nutshell VC analysis: E out ( g ) ≤ E in ( g ) + Ω( H ) տ If you use a simpler H and get a good fit, then your E out is better. Regularization takes this a step further: If you use a ‘simpler’ h and get a good fit, then is your E out better? M Regularization : 9 /30 � A c L Creator: Malik Magdon-Ismail Polynomials − →
Polynomials of Order Q - A Useful Testbed H q : polynomials of order Q . Standard Polynomial Legendre Polynomial we’re using linear regression ւ 1 1 x L 1 ( x ) h ( x ) = w t z ( x ) h ( x ) = w t z ( x ) x 2 z = L 2 ( x ) z = . = w 0 + w 1 x + · · · + w q x q . . = w 0 + w 1 L 1 ( x ) + · · · + w q L q ( x ) . . . տ x q L q ( x ) allows us to treat the weights ‘independently’ L 1 L 2 L 3 L 4 L 5 2 (3 x 2 − 1) 2 (5 x 3 − 3 x ) 8 (35 x 4 − 30 x 2 + 3) 8 (63 x 5 · · · ) 1 1 1 1 x M Regularization : 10 /30 � A c L Creator: Malik Magdon-Ismail recap: linear regression − →
recap: Linear Regression ( x 1 , y 1 ) , . . . , ( x N , y N ) − → ( z 1 , y 1 ) , . . . , ( z N , y N ) � �� � � �� � X y Z y N min : E in ( w ) = 1 � ( w t z n − y n ) 2 N n =1 = 1 N (Z w − y ) t (Z w − y ) w lin = (Z t Z) − 1 Z t y ր linear regression fit M Regularization : 11 /30 � A c L Creator: Malik Magdon-Ismail Already saw constraints − →
Constraining The Model: H 10 vs. H 2 � � h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) H 10 = � h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) � H 2 = such that: w 3 = w 4 = · · · = w 10 = 0 ր a ‘hard’ order constraint that sets some weights to zero H 2 ⊂ H 10 M Regularization : 12 /30 � A c L Creator: Malik Magdon-Ismail Soft constraint − →
Soft Order Constraint Don’t set weights explicitly to zero (e.g. w 3 = 0). Give a budget and let the learning choose. H 10 C → ∞ q � w 2 q ≤ C soft order constraint allows ‘intermediate’ models տ q =0 budget for weights H 2 M Regularization : 13 /30 � A c L Creator: Malik Magdon-Ismail H C − →
Soft Order Constrained Model H C � � h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) H 10 = � h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) � H 2 = such that: w 3 = w 4 = · · · = w 10 = 0 h ( x ) = w 0 + w 1 Φ 1 ( x ) + w 2 Φ 2 ( x ) + w 3 Φ 3 ( x ) + · · · + w 10 Φ 10 ( x ) H C = 10 � w 2 q ≤ C such that: q =0 ր a ‘soft’ budget constraint on the sum of weights VC-perspective: H C is smaller than H 10 = ⇒ better generalization. M Regularization : 14 /30 � A c L Creator: Malik Magdon-Ismail Fitting data − →
Fitting the Data The optimal weights w reg ∈ H C ր regularized should minimize the in-sample error, but be within the budget. w reg is a solution to E in ( w ) = 1 N (Z w − y ) t (Z w − y ) min : w t w ≤ C subject to: M Regularization : 15 /30 � A c L Creator: Malik Magdon-Ismail Getting w reg − →
Solving For w reg E in ( w ) = 1 N (Z w − y ) t (Z w − y ) min : w t w ≤ C subject to: E in = const. w lin Observations: normal 1. Optimal w tries to get as ‘close’ to w lin as possible. w Optimal w will use full budget and be on the surface w t w = C . 2. Surface w t w = C , at optimal w , should be perpindicular to ∇ E in . Otherwise can move along the surface and decrease E in . ∇ E in 3. Normal to surface w t w = C is the vector w . 4. Surface is ⊥ ∇ E in ; surface is ⊥ normal. ∇ E in is parallel to normal (but in opposite direction). w t w = C ∇ E in ( w reg ) = − 2 λ C w reg ր λ C , the lagrange multiplier, is positive. The 2 is for mathematical convenience. M Regularization : 16 /30 � A c L Creator: Malik Magdon-Ismail Unconstrained minimization − →
Solving For w reg is minimized, subject to: w t w ≤ C E in ( w ) ⇔ ∇ E in ( w reg ) + 2 λ C w reg = 0 � ⇔ ∇ ( E in ( w ) + λ C w t w ) w = w reg = 0 � ⇔ E in ( w ) + λ C w t w is minimized, unconditionally There is a correspondence: C ↑ λ C ↓ M Regularization : 17 /30 � A c L Creator: Malik Magdon-Ismail Augmented error − →
The Augmented Error Pick a C and minimize subject to: w t w ≤ C E in ( w ) � Pick a λ C and minimize E aug ( w ) = E in ( w ) + λ C w t w unconditionally տ A penalty for the ‘complexity’ of h , measured by the size of the weights. We can pick any budget C . Translation: we are free to pick any multiplier λ C ↔ What’s the right C ? What’s the right λ C ? M Regularization : 18 /30 � A c L Creator: Malik Magdon-Ismail Linear regression − →
Linear Regression With Soft Order Constraint E aug ( w ) = 1 N (Z w − y ) t (Z w − y ) + λ C w t w տ Convenient to set λ C = λ N E aug ( w ) = (Z w − y ) t (Z w − y ) + λ w t w տ N called ‘weight decay’ as the penalty encourages smaller weights Unconditionally minimize E aug ( w ). M Regularization : 19 /30 � A c L Creator: Malik Magdon-Ismail Linear regression solution − →
The Solution for w reg ∇ E aug ( w ) = 2Z t (Z w − y ) + 2 λ w = 2(Z t Z + λ I) w − 2Z t y Set ∇ E aug ( w ) = 0 w reg = (Z t Z + λ I) − 1 Z t y ↑ λ determines the amount of regularization Recall the unconstrained solution ( λ = 0): w lin = (Z t Z) − 1 Z t y M Regularization : 20 /30 � A c L Creator: Malik Magdon-Ismail Dramatic effect − →
A Little Regularization . . . E in ( w ) + λ N w t w Minimizing with different λ ’s λ = 0 λ = 0 . 0001 Data Target Fit y x Overfitting Wow! M Regularization : 21 /30 � A c L Creator: Malik Magdon-Ismail Just a little works − →
. . . Goes A Long Way E in ( w ) + λ N w t w Minimizing with different λ ’s λ = 0 λ = 0 . 0001 Data Target Fit y y x x Overfitting Wow! M Regularization : 22 /30 � A c L Creator: Malik Magdon-Ismail Easy to overdose − →
Don’t Overdose E in ( w ) + λ N w t w Minimizing with different λ ’s λ = 0 λ = 0 . 0001 λ = 0 . 01 λ = 1 Data Target Fit y y y y x x x x → → Overfitting Underfitting M Regularization : 23 /30 � A c L Creator: Malik Magdon-Ismail Overfitting and underfitting − →
Overfitting and Underfitting 0.84 overfitting underfitting Expected E out 0.8 0.76 0 0.5 1 1.5 2 Regularization Parameter, λ M Regularization : 24 /30 � A c L Creator: Malik Magdon-Ismail Noise and regularization − →
More Noise Needs More Medicine 1 σ 2 = 0 . 5 0.75 Expected E out σ 2 = 0 . 25 0.5 σ 2 = 0 0.25 0.5 1 1.5 2 Regularization Parameter, λ M Regularization : 25 /30 � A c L Creator: Malik Magdon-Ismail Deterministic too − →
Recommend
More recommend