10 regularization
play

10. Regularization More on tradeoffs Regularization Effect of - PowerPoint PPT Presentation

CS/ECE/ISyE 524 Introduction to Optimization Spring 201718 10. Regularization More on tradeoffs Regularization Effect of using different norms Example: hovercraft revisited Laurent Lessard (www.laurentlessard.com) Review of


  1. CS/ECE/ISyE 524 Introduction to Optimization Spring 2017–18 10. Regularization ❼ More on tradeoffs ❼ Regularization ❼ Effect of using different norms ❼ Example: hovercraft revisited Laurent Lessard (www.laurentlessard.com)

  2. Review of tradeoffs Recap of tradeoffs: ❼ We want to make both J 1 ( x ) and J 2 ( x ) small subject to constraints. ❼ Choose a parameter λ > 0, solve minimize J 1 ( x ) + λ J 2 ( x ) x subject to: constraints ❼ Each λ > 0 yields a solution ˆ x λ . ❼ Can visualize tradeoff by plotting J 2 (ˆ x λ ) vs J 1 (ˆ x λ ). This is called the Pareto curve . 10-2

  3. Multi-objective tradeoff ❼ Similar procedure if we have more than two costs we’d like to make small, e.g. J 1 , J 2 , J 3 ❼ Choose parameters λ > 0 and µ > 0. Then solve: minimize J 1 ( x ) + λ J 2 ( x ) + µ J 3 ( x ) x subject to: constraints ❼ Each λ > 0 and µ > 0 yields a solution ˆ x λ,µ . ❼ Can visualize tradeoff by plotting J 3 (ˆ x λ,µ ) vs J 2 (ˆ x λ,µ ) vs J 1 (ˆ x λ,µ ) on a 3D plot. You then obtain a Pareto surface . 10-3

  4. Minimum-norm as a regularization ❼ When Ax = b is underdetermined ( A is wide), we can resolve ambiguity by adding a cost function, e.g. min-norm LS: � x � 2 minimize x subject to: Ax = b ❼ Alternative approach: express it as a tradeoff! � Ax − b � 2 + λ � x � 2 minimize x Tradeoffs of this type are called regularization and λ is called the regularization parameter or regularization weight ❼ If we let λ → ∞ , we just obtain ˆ x = 0 ❼ If we let λ → 0, we obtain the minimum-norm solution! 10-4

  5. Proof of minimum-norm equivalence � Ax − b � 2 + λ � x � 2 minimize x Equivalent to the least squares problem: � A 2 � � � b �� √ � � minimize x − � � 0 λ I x � � Solution is found via pseudoinverse (for tall matrix) �� A � T � A �� − 1 � A � T � b � √ √ √ x = ˆ 0 λ I λ I λ I = ( A T A + λ I ) − 1 A T b 10-5

  6. Proof of minimum-norm equivalence Solution of 2-norm regularization is: x = ( A T A + λ I ) − 1 A T b ˆ ❼ Can’t simply set λ → 0 because A is wide , and therefore A T A will not be invertible. ❼ Use the fact that: A T AA T + λ A T can be factored two ways: ( A T A + λ I ) A T = A T AA T + λ A T = A T ( AA T + λ I ) ( A T A + λ I ) A T = A T ( AA T + λ I ) A T ( AA T + λ I ) − 1 = ( A T A + λ I ) − 1 A T 10-6

  7. Proof of minimum-norm equivalence Solution of 2-norm regularization is: x = ( A T A + λ I ) − 1 A T b ˆ Also equal to: x = A T ( AA T + λ I ) − 1 b ˆ ❼ Since AA T is invertible, we can take the limit λ → 0 by just setting λ = 0. x = A T ( AA T ) − 1 b . This is the exact solution to ❼ In the limit: ˆ the minimum-norm least squares problem we found before! 10-7

  8. Tradeoff visualization � Ax − b � 2 + λ � x � 2 minimize x λ → 0 � 0 , � A † b � 2 � � x � 2 λ → ∞ � � b � 2 , 0 � � Ax − b � 2 10-8

  9. Regularization Regularization: Additional penalty term added to the cost function to encourage a solution with desirable properties. Regularized least squares: � Ax − b � 2 + λ R ( x ) minimize x ❼ R ( x ) is the regularizer (penalty function) ❼ λ is the regularization parameter ❼ The model has different names depending on R ( x ). 10-9

  10. Regularization � Ax − b � 2 + λ R ( x ) minimize x 1. If R ( x ) = � x � 2 = x 2 1 + x 2 2 + · · · + x 2 n It is called: L 2 regularization , Tikhonov regularization , or Ridge regression depending on the application. It has the effect of smoothing the solution. 2. If R ( x ) = � x � 1 = | x 1 | + | x 2 | + · · · + | x n | It is called: L 1 regularization or LASSO . It has the effect of sparsifying the solution (ˆ x will have few nonzero entries). 3. R ( x ) = � x � ∞ = max {| x 1 | , | x 2 | , . . . , | x n |} It is called L ∞ regularization and it has the effect of equalizing the solution (makes most components equal). 10-10

  11. Norm balls For a norm �·� p , the norm ball of radius r is the set: B r = { x ∈ R n | � x � p ≤ r } 1.5 1.5 1.5 1.0 1.0 1.0 0.5 0.5 0.5 - 1.5 - 1.0 - 0.5 0.5 1.0 1.5 - 1.5 - 1.0 - 0.5 0.5 1.0 1.5 - 1.5 - 1.0 - 0.5 0.5 1.0 1.5 - 0.5 - 0.5 - 0.5 - 1.0 - 1.0 - 1.0 - 1.5 - 1.5 - 1.5 � x � 2 ≤ 1 � x � 1 ≤ 1 � x � ∞ ≤ 1 x 2 + y 2 ≤ 1 | x | + | y | ≤ 1 max {| x | , | y |} ≤ 1 10-11

  12. Simple example Consider the minimum-norm problem for different norms: minimize � x � p x subject to: Ax = b ❼ set of solutions to Ax = b 2.5 is an affine subspace 2.0 x 1.5 ❼ solution is point belonging 1.0 to smallest norm ball 0.5 - 1 1 2 3 4 ❼ for p = 2, this occurs at - 0.5 the perpendicular distance 10-12

  13. Simple example 2.5 x 2.0 ❼ for p = 1, this occurs at 1.5 one of the axes. 1.0 0.5 ❼ sparsifying behavior - 1 1 2 3 4 - 0.5 2.5 ❼ for p = ∞ , this occurs at 2.0 1.5 x equal values of 1.0 coordinates 0.5 ❼ equalizing behavior - 1 1 2 3 4 - 0.5 10-13

  14. Another simple example Suppose we have data points { y 1 , . . . , y m } ⊂ R , and we would like to find the best estimator for the data, according to different norms. Suppose data is sorted: y 1 ≤ · · · ≤ y m . �     � y 1 x � � . . � � . . minimize  − �  .   .  � �    � x � � y m x � � p x = 1 ❼ p = 2: ˆ m ( y 1 + · · · + y m ). This is the mean of the data. ❼ p = 1: ˆ x = y ⌈ m / 2 ⌉ . This is the median of the data. x = 1 ❼ p = ∞ : ˆ 2 ( y 1 + y m ). This is the mid-range of the data. Julia demo: Data Norm.ipynb 10-14

  15. Example: hovercraft revisited One-dimensional version of the hovercraft problem: ❼ Start at x 1 = 0 with v 1 = 0 (at rest at position zero) ❼ Finish at x 50 = 100 with v 50 = 0 (at rest at position 100) ❼ Same simple dynamics as before: x t +1 = x t + v t for: t = 1 , 2 , . . . , 49 v t +1 = v t + u t ❼ Decide thruster inputs u 1 , u 2 , . . . , u 49 . ❼ This time: minimize � u � p 10-15

  16. Example: hovercraft revisited minimize � u � p x t , v t , u t subject to: x t +1 = x t + v t for t = 1 , . . . , 49 v t +1 = v t + u t for t = 1 , . . . , 49 x 1 = 0 , x 50 = 100 v 1 = 0 , v 50 = 0 ❼ This model has 150 variables, but very easy to understand. ❼ We can simplify the model considerably... 10-16

  17. Model simplification x t +1 = x t + v t for: t = 1 , 2 , . . . , 49 v t +1 = v t + u t v 50 = v 49 + u 49 = v 48 + u 48 + u 49 = . . . = v 1 + ( u 1 + u 2 + · · · + u 49 ) 10-17

  18. Model simplification x t +1 = x t + v t for: t = 1 , 2 , . . . , 49 v t +1 = v t + u t x 50 = x 49 + v 49 = x 48 + 2 v 48 + u 48 = x 47 + 3 v 47 + 2 u 47 + u 48 = . . . = x 1 + 49 v 1 + (48 u 1 + 47 u 2 + · · · + 2 u 47 + u 48 ) 10-18

  19. Model simplification x t +1 = x t + v t for: t = 1 , 2 , . . . , 49 v t +1 = v t + u t Constraint can be rewritten as:   u 1 u 2 � 48 � � x 50 − x 1 − 49 v 1 � 47 2 1 0 . . .    =  .  . 1 1 1 1 1 v 50 − v 1 . . .   .  u 49 so we don’t need the intermediate variables x t and v t ! Julia demo: Hover 1D.ipynb 10-19

  20. Results 1. Minimizing � u � 2 2 (smooth) 0.3 0.2 0.1 Thrust 0.0 0.1 0.2 0.3 0 10 20 30 40 50 Time 2. Minimizing � u � 1 (sparse) 3 2 1 Thrust 0 1 2 3 0 10 20 30 40 50 Time 3. Minimizing � u � ∞ (equalized) 0.20 0.15 0.10 0.05 Thrust 0.00 0.05 0.10 0.15 0.20 0 10 20 30 40 50 10-20 Time

  21. Tradeoff studies 1. Minimizing � u � 2 2 + λ � u � 1 (smooth and sparse) 0.4 0.2 Thrust 0.0 0.2 0.4 0 10 20 30 40 50 Time 2. Minimizing � u � ∞ + λ � u � 1 (equalized and sparse) 0.6 0.4 0.2 Thrust 0.0 0.2 0.4 0.6 0 10 20 30 40 50 Time 3. Minimizing � u � 2 2 + λ � u � ∞ (equalized and smooth) 0.3 0.2 0.1 Thrust 0.0 0.1 0.2 0.3 0 10 20 30 40 50 10-21 Time

Recommend


More recommend