robustness and regularization
play

Robustness and Regularization: Two sides of the same coin (Joint - PowerPoint PPT Presentation

Robustness and Regularization: Two sides of the same coin (Joint work with Jose Blanchet and Yang Kang) Karthyek Murthy Columbia University Jun 28, 2016 1 / 18 Introduction Richer data has tempted us to consider more elaborate models


  1. Robustness and Regularization: Two sides of the same coin (Joint work with Jose Blanchet and Yang Kang) Karthyek Murthy Columbia University Jun 28, 2016 1 / 18

  2. Introduction ◮ Richer data has tempted us to consider more elaborate models Elaborate models = ⇒ More factors / variables ◮ Generalization has become a lot more challenging ◮ Regularization has been useful in avoiding overfitting Goal: A distributionally robust approach for improving generalization 1 / 18

  3. Motivation for Distributionally robust optimization ◮ Want to solve the stochastic optimization problem � � � min β E Loss X , β ) ◮ Typically, we have access to the probability distribution of X only via its samples { X 1 , . . . , X n } ◮ A common practice is to instead solve n 1 � min Loss( X i , β ) n β i =1 2 / 18

  4. n 1 � � � � min Loss( X i , β ) as a proxy for min β E Loss X , β ) n β i =1 3 / 18

  5. 45 40 35 30 25 20 15 10 5 0 − 15 − 10 − 5 0 5 10 15 n 1 � � � � min Loss( X i , β ) as a proxy for min β E Loss X , β ) n β i =1 3 / 18

  6. 45 45 40 40 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 − 15 − 10 − 5 0 5 10 15 − 15 − 10 − 5 0 5 10 15 n 1 � � � � min Loss( X i , β ) as a proxy for min β E Loss X , β ) n β i =1 3 / 18

  7. Learning Natural to be thought as finding the “best” f such that y i = f ( x i ) + e i , i = 1 , . . . , n x i = ( x 1 , . . . , x d ) is the vector of predictors y i is the corresponding response a a Image source: r-bloggers.com 4 / 18

  8. Learning Natural to be thought as finding the “best” f such that y i = f ( x i ) + e i , i = 1 , . . . , n Empirical loss/risk minimization (ERM): n 1 � � � Loss f ( x i ) , y i n i =1 a a Image source: r-bloggers.com 4 / 18

  9. Learning Natural to be thought as finding the “best” f such that y i = f ( x i ) + e i , i = 1 , . . . , n Empirical loss/risk minimization (ERM): n 1 � � � Loss f ( x i ) , y i n i =1 n = 1 a � y i − f ( x i ) 2 � � n i =1 a Image source: r-bloggers.com 4 / 18

  10. Learning Natural to be thought as finding the “best” f such that y i = f ( x i ) + e i , i = 1 , . . . , n a a Image source: r-bloggers.com Not enough Find an f that fits well over “future” values as well 4 / 18

  11. Generalization Think of data ( x 1 , y 1 ) , . . . ( x n , y n ) as samples from a probability distribution P Then “future values” can also be interpreted as samples from P 5 / 18

  12. Generalization Think of data ( x 1 , y 1 ) , . . . ( x n , y n ) as samples from a probability distribution P Then “future values” can also be interpreted as samples from P n 1 � � � �− → � � � min Loss f ( x i ) , y i min E P Loss f ( X ) , Y ) n f f i =1 However, the access to P is still via samples, P n = 1 � n i =1 δ ( x i , y i ) n 5 / 18

  13. P � � �� Want to solve min f ∈F E P Loss f ( X ) , Y P unknown 6 / 18

  14. P n P � � �� Know how to solve min f ∈F E P n Loss f ( X ) , Y Access to P via training samples P n 6 / 18

  15. P n P More and more samples give better approximation to P , however, the quality of this approximation depends on dim 6 / 18

  16. P n P We are provided with only limited training data ( n samples) Sometimes, to an extent that even n < dim of the parameter of interest . 6 / 18

  17. P n δ P Instead of finding the best fit with respect to P n , why not find a fit that works over all Q such that D ( Q , P n ) ≤ δ 6 / 18

  18. P n δ P Formally, � � �� min Q : D ( Q , P n ) ≤ δ E Q max Loss f ( X ) , Y f ∈F 6 / 18

  19. DR Regression: � � �� min Q : D ( Q , P n ) ≤ δ E Q max Loss f ( X ) , Y f ∈F 7 / 18

  20. DR Linear Regression: �� � 2 � Y − β T X min Q : D ( Q , P n ) ≤ δ E Q max β ∈ R d 7 / 18

  21. DR Linear Regression: �� � 2 � Y − β T X min Q : D ( Q , P n ) ≤ δ E Q max β ∈ R d I. Are these DR regression problems solvable? ◮ If so, how do they compare with known methods for improving generalization? II. How to beat the curse of dimensionality while choosing δ ? ◮ Robust Wasserstein profile function III. Does the framework scale? ◮ Support vector machines ◮ Logistic regression ◮ General sample average approximation 7 / 18

  22. DR Linear Regression: �� � 2 � Y − β T X min max Q : D ( Q , P n ) ≤ δ E Q β ∈ R d How to quantify the distance D ( P , Q )? 8 / 18

  23. DR Linear Regression: �� � 2 � Y − β T X min max Q : D ( Q , P n ) ≤ δ E Q β ∈ R d How to quantify the distance D ( P , Q )? Ans: Let ( U , V ) be two random variables such that U ∼ P and V ∼ Q . Let us call a joint distribution ( U , V ) as π. Then D ( P , Q ) = inf π E π � U − V � 8 / 18

  24. DR Linear Regression: �� � 2 � Y − β T X min max Q : D ( Q , P n ) ≤ δ E Q β ∈ R d T x y remblais 1 d´ eblais How to quantify the distance D ( P , Q )? Ans: Let ( U , V ) be two random variables such that U ∼ P and V ∼ Q . Let us call a joint distribution ( U , V ) as π. Then π E π � U − V � D ( P , Q ) = inf 1 Image from the book Optimal Transport: Old and New by C´ edric Villani 8 / 18

  25. DR Linear Regression: �� � 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d T x y remblais d´ eblais How to quantify the distance D ( P , Q )? Ans: Let ( U , V ) be two random variables such that U ∼ P and V ∼ Q . Let us call a joint distribution ( U , V ) as π. Then � � D c ( P , Q ) = inf π E π c ( U , V ) The metric D c is called optimal transport metric. is the p th order Wasserstein distance When c ( u , v ) = � u − v � p , D 1 / p c 8 / 18

  26. DR Linear Regression: �� � 2 � Y − β T X min Q : D c ( Q , P n ) ≤ δ E Q max β ∈ R d Next, how do we choose δ ? 9 / 18

  27. DR Linear Regression: �� � 2 � Y − β T X min Q : D c ( Q , P n ) ≤ δ E Q max β ∈ R d P n Next, how do we choose δ ? δ P See Fournier and Guillin (2015), Lee and Mehrotra (2013), Shafieezadeh-Abadeh, Esfahani and Kuhn (2015) 9 / 18

  28. DR Linear Regression: �� � 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d The object of interest β ∗ satisfies: Y − β T �� � � E P ∗ X X = 0 P n δ P 10 / 18

  29. DR Linear Regression: �� � 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d The object of interest β ∗ satisfies: Y − β T �� � � E P ∗ X X = 0 P n � 0 � = P X ) T X β − ∗ � Y ( E : Q � Q 10 / 18

  30. DR Linear Regression: �� � 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d The object of interest β ∗ satisfies: Y − β T �� � � E P ∗ X X = 0 P n δ � 0 � = P X ) T X β − ∗ � Y ( E : Q � Q � � Y − β T � � �� � � R n ( β ∗ ) = min D c Q , P n : E Q ∗ X X = 0 10 / 18

  31. DR Linear Regression: �� � 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d P n Theorem 1 δ [Blanchet, Kang & M] � 0 � = P X ) T X If Y = β T β ∗ X + ǫ, − ∗ � Y ( E : Q � Q D nR n ( β ∗ ) − → L � � Y − β T � � �� � � R n ( β ∗ ) = min D c Q , P n : E Q ∗ X X = 0 10 / 18

  32. DR Linear Regression: �� � 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d P n Theorem 1 δ [Blanchet, Kang & M] � 0 � = P X ) T X If Y = β T β ∗ X + ǫ, − ∗ � Y ( E : Q � Q D nR n ( β ∗ ) − → L Choose δ = η n where η is such that P {L ≤ η } ≥ 0 . 95 10 / 18

  33. DR Linear Regression: �� � 2 � Y − β T X min max Q : D c ( Q , P n ) ≤ δ E Q β ∈ R d P n Theorem 1 δ [Blanchet, Kang & M] � 0 � = P X ) T X If Y = β T β ∗ X + ǫ, − ∗ � Y ( E : Q � Q D nR n ( β ∗ ) − → L Choose δ = η α n where η α is such that P {L ≤ η α } ≥ 1 − α. 10 / 18

  34. Robust Wasserstein � � �� � Y − β T X � � � R n ( β ) = min Q , P n : E Q = 0 D c X profile function: P n 11 / 18

  35. Robust Wasserstein � � �� � Y − β T X � � � R n ( β ) = min Q , P n : E Q = 0 D c X profile function: p ( x, y ) P n x y 11 / 18

  36. Robust Wasserstein � � �� � Y − β T X � � � R n ( β ) = min Q , P n : E Q = 0 D c X profile function: p ( x, y ) ˜ P n P n x y 11 / 18

  37. Robust Wasserstein � � �� � Y − β T X � � � R n ( β ) = min Q , P n : E Q = 0 D c X profile function: p ( x, y ) ˜ D c ( P n , P n ) = R n ( β ) x y 11 / 18

  38. Robust Wasserstein � � �� � Y − β T X � � � R n ( β ) = min Q , P n : E Q = 0 D c X profile function: p ( x, y ) ◮ Basically, R n ( β ) is a measure of goodness of β ˜ D c ( P n , P n ) = R n ( β )  L , if β = β ∗  nR n ( β ) − → ∞ , if β � = β ∗  ◮ Similar to empirical likelihood x profile function ◮ In high-dimensional setting, one can instead consider suitable non-asymptotic bounds for nR n ( β ) . y 11 / 18

Recommend


More recommend