9 54 class 8
play

9.54 class 8 Supervised learning Optimization, regularization, - PowerPoint PPT Presentation

9.54 class 8 Supervised learning Optimization, regularization, kernels Shimon Ullman + Tomaso Poggio Danny Harari + Daneil Zysman + Darren Seibert 9.54, fall semester 2014 The Regularization Kingdom Loss functions and empirical risk


  1. 9.54 class 8 Supervised learning Optimization, regularization, kernels Shimon Ullman + Tomaso Poggio Danny Harari + Daneil Zysman + Darren Seibert 9.54, fall semester 2014

  2. The Regularization Kingdom • Loss functions and empirical risk minimization • Basic regularization algorithms

  3. Math

  4. Given a Training Set S = ( x 1 , y 1 ) , . . . , ( x n , y n ) Find f ( x ) ∼ y

  5. We need a way to measure errors Loss function V ( f ( x ) , y )

  6. • 0 − 1 -loss V ( f ( x ) , y ) = ✓ ( − yf ( x )) ( ✓ is the step function) • square loss (L2) V ( f ( x ) , y ) = ( f ( x ) − y ) 2 = (1 − yf ( x )) 2 • absolute value (L1) V ( f ( x ) , y ) = | f ( x ) − y | • Vapnik’s ✏ - insensitive loss V ( f ( x ) , y ) = ( | f ( x ) − y | − ✏ ) + • hinge loss V ( f ( x ) , y ) = (1 − yf ( x )) + • logistic loss V ( f ( x ) , y ) = log(1 − e − yf ( x ) ) logistic regression • exponential loss V ( f ( x ) , y ) = e − yf ( x )

  7. Given a loss function V ( f ( x ) , y ) We can define the Empirical Error P n I S [ f ] = 1 i =1 V ( f ( x i ) , y i ) n

  8. ``Learning processes do not take place in vacuum.’’ � Cucker and Smale, AMS 2001 We need to fix a Hypotheses Space F H ⊂ F = { f | f : X → Y } H

  9. parametric • Linear model f ( x ) = P p j =1 x j w j • Generalized linear models f ( x ) = P p j =1 Φ ( x ) j w j non-parametric j � 1 Φ ( x ) j w j = P • Reproducing kernel Hilbert spaces f ( x ) = P i � 1 K ( x, x i ) α i K ( x, x 0 ) is a symmetric positive definite function called reproducing kernel F H ⊂ F = { f | f : X → Y } H

  10. parametric • Linear model f ( x ) = P p j =1 x j w j semi-parametric • Generalized linear models f ( x ) = P p j =1 Φ ( x ) j w j j � 1 Φ ( x ) j w j = P • Reproducing kernel Hilbert spaces f ( x ) = P i � 1 K ( x, x i ) α i K ( x, x 0 ) is a symmetric positive definite function called reproducing kernel F H ⊂ F = { f | f : X → Y } H

  11. parametric • Linear model f ( x ) = P p j =1 x j w j semi-parametric • Generalized linear models f ( x ) = P p j =1 Φ ( x ) j w j non-parametric j � 1 Φ ( x ) j w j = P • Reproducing kernel Hilbert spaces f ( x ) = P i � 1 K ( x, x i ) α i K ( x, x 0 ) is a symmetric positive definite function called reproducing kernel F H ⊂ F = { f | f : X → Y } H

  12. Empirical Risk Minimization (ERM) n 1 X min f ∈ H I S [ f ] = min V ( f ( x i ) , y i ) n f ∈ H i =1

  13. Empirical Risk Minimization (ERM) n 1 X min f ∈ H I S [ f ] = min V ( f ( x i ) , y i ) n f ∈ H i =1

  14. Which is a good solution? Empirical Risk Minimization (ERM) n 1 X min f ∈ H E S [ f ] = min V ( f ( x i ) , y i ) n f ∈ H i =1

  15. Training set Learning algorithm x predicted y h (living area of (predicted price) house.) of house) The training set S = ( x 1 , y 1 ) , . . . , ( x n , y n ) is sampled identically and independently (i.i.d) from a fixed unknown probability distribution p ( x, y ) = p ( x ) p ( y | x )

  16. Learning is an ill-posed problem Ill posed problems often arise if one tries to infer general laws from few data the hypothesis space is too large there are not enough data Jacques Hadamard � In general ERM leads to ill-posed solutions because the solution may be too complex it may be not unique it may change radically when leaving one sample out Regularization Theory provides results and techniques to restore well-posedness, that is stability (hence generalization )

  17. • Beyond drawings & intuitions (...) there is a deep, rigorous mathematical foundation of regularized learning algorithms (Cucker and Smale, Vapnik and Chervonenkis, ). � � Theory of learning is a synthesis of different fields, e.g. Computer Science (Algorithms, Complexity) and Mathematics (Optimization, Probability, Statistics). � � • Central to the Theory of Machine Learning is the problem of understanding condition under which ERM can solve inf E ( f ) , E ( f ) = E ( x,y ) V ( y, f ( x ))

  18. Algorithms: The Regularization Kingdom � • loss functions and empirical risk minimization � • basic regularization algorithms �

  19. (Tikhonov) Regularization regularization parameter n f ∈ H { 1 X V ( y i , f ( x i )) + λ R ( f )) } → f λ min S n i =1 regularizer • The regularizer describes the complexity of the solution � f 1 f 2 R ( f 2 ) is bigger than R ( f 1 ) • The regularization parameter determines the trade-off between complexity and empirical risk

  20. Stability and (Tikhonov) Regularization Math Consider f ( x ) = w T x = P p j =1 w j x j , and R ( f ) = w T w , n 1 w T = Y X T ( XX T ) − 1 X ( y i − f ( x i )) 2 min n f ∈ H i =1 ( ) n 1 ( y i � f ( x i )) 2 + λ k f k 2 X w T = Y X T ( XX T + λ I ) − 1 min n f ∈ H i =1

  21. From Linear to Semi-parametric Models p p X X x j w j Φ ( x ) j w j f ( x ) = = ⇒ f ( x ) = j =1 j =1 | {z } | {z } linear model generalized linear model If instead of a linear model we have a generalized linear model we simply have to consider  Φ ( x 1 ) 1  Φ ( x 1 ) p . . . . . . . . . X n =   . . . . . . . . . .   . . . . . Φ ( x n ) 1 Φ ( x n ) p . . . . . . . . .

  22. From Parametric to Nonparametric Models How about nonparametric models? Math Some simple linear algebra shows that w T = Y X T ( XX T ) − 1 = Y ( X T X ) − 1 X T = CX T since X T ( XX T ) − 1 = ( X T X ) − 1 X T Then n f ( x ) = w T x = CX T x = X c i x T i x i We can compute C n or w n depending whether n ≤ p . The above result is the most basic form of the � Representer Theorem.

  23. From Linear to Nonparametric Models Math Note that p n X X n x j = w j x T f ( x ) = i x c i |{z} j =1 i =1 P p j =1 x j i x j We can now consider a truly non parametric model n X X w j Φ ( x ) j = f ( x ) = K ( x, x i ) } c i | {z j ≥ 1 i =1 X Φ ( x i ) j Φ ( x ) j j ≥ 1

  24. From Linear to Nonparametric Models Math We can now consider a truly non parametric model n X X w j Φ ( x ) j = f ( x ) = K ( x, x i ) } c i | {z j ≥ 1 i =1 X Φ ( x i ) j Φ ( x ) j We have j ≥ 1 C n = ( X n X T | {z } + λ nI ) − 1 Y n |{z} + λ nI ) − 1 Y n C n = ( K n n ( K n ) i,j = K ( x i , x j ) ( X n X T n ) i,j = x T i x j

  25. Kernels • Linear kernel K ( x, x 0 ) = x T x 0 • Gaussian kernel K ( x, x 0 ) = e � k x � x 0k 2 σ > 0 , σ 2 • Polynomial kernel K ( x, x 0 ) = ( x T x 0 + 1) d , d ∈ N • Inner Product kernel/Features p X Φ ( x ) j Φ ( x 0 ) j Φ : X → R p . K ( x, x 0 ) = j =1

  26. Reproducing Kernel Hilbert Spaces Math Given K , 9 ! Hilbert space of functions ( H , h · , · i ) such that, • K x := K ( x, · ) 2 H , for all x 2 X , and • f ( x ) = h f, K x i , for all x 2 X , f 2 H . The norm of a function f ( x ) = P n i =1 K ( x, x i ) c i is given by n k f k 2 = X K ( x j , x i ) c i c j i,j =1 and is a natural complexity measure. Note : An RKHS is equivalently defined as a Hilbert space where the evaluation functionals are continuous.

  27. Extensions: Other Loss Functions For most loss functions the solution of Tikhonov regularization is of the form n X f ( x ) = K ( x, x i ) c i . i =1 • V ( f ( x ) , y ) = ( f ( x ) − y ) 2 , RLS • V ( f ( x ) , y ) = ( | f ( x ) − y | − ✏ ) + SVM regression • V ( f ( x ) , y ) = (1 − yf ( x )) + SVM classification • V ( f ( x ) , y ) = log(1 − e − yf ( x ) ) logistic regression • V ( f ( x ) , y ) = e − yf ( x ) boosting

  28. Extensions: Other Loss Functions (cont) By changing the loss function we change the way we compute the coefficients in expansion n X f ( x ) = K ( x, x i ) c i . i =1

  29. • Regularization avoids overfitting, ensures stability of the solution and generalization � • There are many different instance of regularization beyond Tikhonov , e.g. early stopping... min I S [ f ] + λ R ( f ) f | {z } | {z } complexity/smoothness term data fit term

  30. • Regularization ensures stability of the solution and generalization � • There are different instance of regularization beyond Tikhonov, e.g. early stopping

  31. Conclusions • Regularization Theory provides results and techniques to avoid overfitting (stability is key to generalization) � • Regularization provide a core set of concepts and techniques to solve a variety of problems � • Most algorithms can be seen as a form of regularization

  32. Hebbian mechanisms can be used for biological supervised learning (Knudsen, 1990)

Recommend


More recommend