mlcc 2017 regularization networks i linear models
play

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco - PowerPoint PPT Presentation

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational aspects of these


  1. MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017

  2. About this class ◮ We introduce a class of learning algorithms based on Tikhonov regularization ◮ We study computational aspects of these algorithms . MLCC 2017 2

  3. Empirical Risk Minimization (ERM) ◮ Empirical Risk Minimization (ERM): probably the most popular approach to design learning algorithms. ◮ General idea: considering the empirical error n E ( f ) = 1 � ˆ ℓ ( y i , f ( x i )) , n i =1 as a proxy for the expected error � E ( f ) = E [ ℓ ( y, f ( x ))] = dxdyp ( x, y ) ℓ ( y, f ( x )) . MLCC 2017 3

  4. The Expected Risk is Not Computable Recall that ◮ ℓ measures the price we pay predicting f ( x ) when the true label is y ◮ E ( f ) cannot be directly computed, since p ( x, y ) is unknown MLCC 2017 4

  5. From Theory to Algorithms: The Hypothesis Space To turn the above idea into an actual algorithm, we: ◮ Fix a suitable hypothesis space H ◮ Minimize ˆ E over H H should allow feasible computations and be rich , since the complexity of the problem is not known a priori. MLCC 2017 5

  6. Example: Space of Linear Functions The simplest example of H is the space of linear functions: H = { f : R d → R : ∃ w ∈ R d such that f ( x ) = x T w, ∀ x ∈ R d } . ◮ Each function f is defined by a vector w ◮ f w ( x ) = x T w . MLCC 2017 6

  7. Rich H s May Require Regularization ◮ If H is rich enough, solving ERM may cause overfitting (solutions highly dependent on the data) ◮ Regularization techniques restore stability and ensure generalization MLCC 2017 7

  8. Tikhonov Regularization Consider the Tikhonov regularization scheme, w ∈ R d ˆ E ( f w ) + λ � w � 2 min (1) It describes a large class of methods sometimes called Regularization Networks. MLCC 2017 8

  9. The Regularizer ◮ � w � 2 is called regularizer ◮ It controls the stability of the solution and prevents overfitting ◮ λ balances the error term and the regularizer MLCC 2017 9

  10. Loss Functions ◮ Different loss functions ℓ induce different classes of methods ◮ We will see common aspects and differences in considering different loss functions ◮ There exists no general computational scheme to solve Tikhonov Regularization ◮ The solution depends on the considered loss function MLCC 2017 10

  11. The Regularized Least Squares Algorithm Regularized Least Squares: Tikhonov regularization n E ( f w ) = 1 � w ∈ R D ˆ ˆ E ( f w ) + λ � w � 2 , min ℓ ( y i , f w ( x i )) (2) n i =1 Square loss function : ℓ ( y, f w ( x )) = ( y − f w ( x )) 2 We then obtain the RLS optimization problem (linear model): n 1 � ( y i − w T x i ) 2 + λw T w, min λ ≥ 0 . (3) n w ∈ R D i =1 MLCC 2017 11

  12. Matrix Notation ◮ The n × d matrix X n , whose rows are the input points ◮ The n × 1 vector Y n , whose entries are the corresponding outputs. With this notation, n 1 ( y i − w T x i ) 2 = 1 � n � Y n − X n w � 2 . n i =1 MLCC 2017 12

  13. Gradients of the ER and of the Regularizer By direct computation, ◮ Gradient of the empirical risk w. r. t. w − 2 nX T n ( Y n − X n w ) ◮ Gradient of the regularizer w. r. t. w 2 w MLCC 2017 13

  14. The RLS Solution By setting the gradient to zero, the solution of RLS solves the linear system ( X T n X n + λnI ) w = X T n Y n . λ controls the invertibility of ( X T n X n + λnI ) MLCC 2017 14

  15. Choosing the Cholesky Solver ◮ Several methods can be used to solve the above linear system ◮ Cholesky decomposition is the method of choice, since X T n X n + λI is symmetric and positive definite. MLCC 2017 15

  16. Time Complexity Time complexity of the method : ◮ Training: O ( nd 2 ) (assuming n >> d ) ◮ Testing: O ( d ) MLCC 2017 16

  17. Dealing with an Offset For linear models, especially in low dimensional spaces, it is useful to consider an offset : w T x + b How to estimate b from data? MLCC 2017 17

  18. Idea: Augmenting the Dimension of the Input Space ◮ Simple idea: augment the dimension of the input space, considering ˜ x = ( x, 1) and ˜ w = ( w, b ) . ◮ This is fine if we do not regularize, but if we do then this method tends to prefer linear functions passing through the origin (zero offset), since the regularizer becomes: w � 2 = � w � 2 + b 2 . � ˜ MLCC 2017 18

  19. Avoiding to Penalize the Solutions with Offset We want to regularize considering only � w � 2 , without penalizing the offset. The modified regularized problem becomes: n 1 � ( y i − w T x i − b ) 2 + λ � w � 2 . min n ( w,b ) ∈ R D +1 i =1 MLCC 2017 19

  20. Solution with Offset: Centering the Data It can be proved that a solution w ∗ , b ∗ of the above problem is given by b ∗ = ¯ x T w ∗ y − ¯ where n y = 1 � ¯ y i n i =1 n x = 1 � ¯ x i n i =1 MLCC 2017 20

  21. Solution with Offset: Centering the Data w ∗ solves n 1 � i ) 2 + λ � w � 2 . i − w T x c ( y c min n w ∈ R D i =1 where y c y and x c i = y − ¯ i = x − ¯ x for all i = 1 , . . . , n . Note: This corresponds to centering the data and then applying the standard RLS algorithm. MLCC 2017 21

  22. Introduction: Regularized Logistic Regression Regularized logistic regression: Tikhonov regularization n E ( f w ) = 1 � w ∈ R d ˆ ˆ E ( f w ) + λ � w � 2 , min ℓ ( y i , f w ( x i )) (4) n i =1 With the logistic loss function : ℓ ( y, f w ( x )) = log(1 + e − yf w ( x ) ) MLCC 2017 22

  23. The Logistic Loss Function Figure: Plot of the logistic regression loss function MLCC 2017 23

  24. Minimization Through Gradient Descent ◮ The logistic loss function is differentiable ◮ The candidate to compute a minimizer is the gradient descent (GD) algorithm MLCC 2017 24

  25. Regularized Logistic Regression (RLR) ◮ The regularized ERM problem associated with the logistic loss is called regularized logistic regression ◮ Its solution can be computed via gradient descent ◮ Note: n n − y i e − y i x T i w t − 1 E ( f w ) = 1 i w t − 1 = 1 − y i � � ∇ ˆ x i x i 1 + e − y i x T 1 + e y i x T n n i w t − 1 i =1 i =1 MLCC 2017 25

  26. RLR: Gradient Descent Iteration For w 0 = 0 , the GD iteration applied to w ∈ R d ˆ E ( f w ) + λ � w � 2 min is � � n 1 − y i � w t = w t − 1 − γ x i i w t − 1 + 2 λw t − 1 1 + e y i x T n i =1 � �� � a for t = 1 , . . . T , where a = ∇ ( ˆ E ( f w ) + λ � w � 2 ) MLCC 2017 26

  27. Logistic Regression and Confidence Estimation ◮ The solution of logistic regression has a probabilistic interpretation ◮ It can be derived from the following model e x T w p (1 | x ) = 1 + e x T w � �� � h where h is called logistic function . ◮ This can be used to compute a confidence for each prediction MLCC 2017 27

  28. Support Vector Machines Formulation in terms of Tikhonov regularization: n E ( f w ) = 1 � w ∈ R d ˆ ˆ E ( f w ) + λ � w � 2 , min ℓ ( y i , f w ( x i )) (5) n i =1 With the Hinge loss function : ℓ ( y, f w ( x )) = | 1 − yf w ( x ) | + 4 3.5 3 2.5 Hinge Loss 2 1.5 1 0.5 0 � 3 � 2 � 1 0 1 2 3 y * f(x) MLCC 2017 28

  29. A more classical formulation (linear case) n 1 � w ∗ = min | 1 − y i w ⊤ x i | + + λ � w � 2 n w ∈ R d i =1 with λ = 1 C MLCC 2017 29

  30. A more classical formulation (linear case) n w ∈ R d ,ξ i ≥ 0 � w � 2 + C � w ∗ = min subject to ξ i n i =1 y i w ⊤ x i ≥ 1 − ξ i ∀ i ∈ { 1 . . . n } MLCC 2017 30

  31. A geometric intuition - classification In general do you have many solutions 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 What do you select? MLCC 2017 31

  32. A geometric intuition - classification Intuitively I would choose an “equidistant” line 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 MLCC 2017 32

  33. A geometric intuition - classification Intuitively I would choose an “equidistant” line 2 1.5 1 0.5 0 − 0.5 − 1 − 1.5 − 2 − 2.5 − 2 − 1.5 − 1 − 0.5 0 0.5 1 1.5 2 MLCC 2017 33

  34. Maximum margin classifier I want the classifier that ◮ classifies perfectly the dataset ◮ maximize the distance from its closest examples 2 1.5 1 0.5 0 − 0.5 − 1 − 1.5 − 2 − 2.5 − 2 − 1.5 − 1 − 0.5 0 0.5 1 1.5 2 MLCC 2017 34

  35. Point-Hyperplane distance How to do it mathematically? Let w our separating hyperplane. We have x = αw + x ⊥ with α = x ⊤ w � w � and x ⊥ = x − αw . Point-Hyperplane distance : d ( x, w ) = � x ⊥ � MLCC 2017 35

  36. Margin An hyperplane w well classifies an example ( x i , y i ) if ◮ y i = 1 and w ⊤ x i > 0 or ◮ y i = − 1 and w ⊤ x i < 0 therefore x i is well classified iff y i w ⊤ x i > 0 Margin : m i = y i w ⊤ x i Note that x ⊥ = x − y i m i � w � w MLCC 2017 36

  37. Maximum margin classifier definition I want the classifier that ◮ classifies perfectly the dataset ◮ maximize the distance from its closest examples w ∗ = max 1 ≤ i ≤ n d ( x i , w ) 2 w ∈ R d min subject to m i > 0 ∀ i ∈ { 1 . . . n } Let call µ the smallest m i thus we have 1 ≤ i ≤ n,µ ≥ 0 � x i � − ( x ⊤ i w ) 2 w ∗ = max min subject to � w � 2 w ∈ R d y i w ⊤ x i ≥ µ ∀ i ∈ { 1 . . . n } that is MLCC 2017 37

Recommend


More recommend