RegML 2020 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT
Learning problem Problem For H ⊂ { f | f : X → Y } , solve � min f ∈H E ( f ) , dρ ( x, y ) L ( f ( x ) , y ) given S n = ( x i , y i ) n i =1 ( ρ , fixed, unknown). L.Rosasco, RegML 2020
Empirical Risk Minimization (ERM) � min f ∈H E ( f ) �→ min E ( f ) f ∈H � n E ( f ) = 1 � L ( f ( x i ) , y i ) n i =1 proxy to E L.Rosasco, RegML 2020
From ERM to regularization ERM can be a bad idea if n is “small” and H is “big” Regularization � � min E ( f ) �→ min E ( f ) + λ R ( f ) � �� � f ∈H f ∈H regularization λ regularization parameter L.Rosasco, RegML 2020
Examples of regularizers Let p � f ( x ) = φ j ( x ) w j j =1 ◮ ℓ 2 p � R ( f ) = � w � 2 = | w j | 2 , j =1 ◮ ℓ 1 p � R ( f ) = � w � 1 = | w j | , j =1 ◮ Differential operators � �∇ f ( x ) � 2 dρ ( x ) , R ( f ) = X ◮ ... L.Rosasco, RegML 2020
From statistics to optimization Problem Solve w ∈ R p � E ( w ) + λ � w � 2 min with � n E ( w ) = 1 � L ( w ⊤ x i , y i ) . n i =1 L.Rosasco, RegML 2020
Minimization E ( w ) + λ � w � 2 � min w ◮ Strongly convex functional ◮ Computations depends on the considered function L.Rosasco, RegML 2020
Logistic regression � n E λ ( w ) = 1 � log(1 + e − y i w ⊤ x i ) + λ � w � 2 . n i =1 � �� � smooth and strongly convex � n E λ ( w ) = − 1 x i y i ∇ � 1 + e y i w ⊤ x i + 2 λw n i =1 L.Rosasco, RegML 2020
Gradient descent Let F : R d → R differentiable, (strictly) convex and such that �∇ F ( w ) − ∇ F ( w ′ ) � ≤ L � w − w ′ � (e.g. sup w � H ( w ) � ≤ L ) � �� � hessian Then w t +1 = w t − 1 w 0 = 0 , L ∇ F ( w t ) , converges to the minimizer of F . L.Rosasco, RegML 2020
Gradient descent for LR � n 1 log(1 + e − y i w ⊤ x i ) + λ � w � 2 min n w ∈ R p i =1 Consider � � n � w t +1 = w t − 1 − 1 x i y i t x i + 2 λw t L n 1 + e y i w ⊤ i =1 L.Rosasco, RegML 2020
Complexity Logistic: O ( ndT ) n number of examples, d dimensionality, T number of steps What if n ≪ d ? Can we get better complexities? L.Rosasco, RegML 2020
Representer theorems Idea Show that n � f ( x ) = w ⊤ x = x ⊤ i xc i , c i ∈ R . i =1 i.e. w = � n i =1 x i c i . Compute c = ( c 1 , . . . , c n ) ∈ R n rather than w ∈ R d . L.Rosasco, RegML 2020
Representer theorem for GD & LR By induction � � n � c t +1 = c t − 1 − 1 e i y i 1 + e y i f t ( x i ) + 2 λc t L n i =1 with e i the i -th element of the canonical basis and � n x ⊤ x i ( c t ) i f t ( x ) = i =1 L.Rosasco, RegML 2020
Non-linear features p � d � f ( x ) = w i x i �→ f ( x ) = w i φ i ( x i ) . i =1 i =1 Φ( x ) = ( φ 1 ( x ) , . . . , φ p ( x )) , Model L.Rosasco, RegML 2020
Non-linear features p � d � f ( x ) = w i x i �→ f ( x ) = w i φ i ( x i ) . i =1 i =1 Φ( x ) = ( φ 1 ( x ) , . . . , φ p ( x )) , Computations Same up-to the change x �→ Φ( x ) L.Rosasco, RegML 2020
Representer theorem with non-linear features � n � n x ⊤ Φ( x i ) ⊤ Φ( x ) c i f ( x ) = i xc i �→ f ( x ) = i =1 i =1 L.Rosasco, RegML 2020
Rewriting logistic regression and gradient descent By induction � � n � c t +1 = c t − 1 − 1 e i y i 1 + e y i f t ( x i ) + 2 λc t L n i =1 with e i the i -th element of the canonical basis and � n Φ( x ) ⊤ Φ( x ) i ( c t ) i f t ( x ) = i =1 L.Rosasco, RegML 2020
Hinge loss and support vector machines � n E λ ( w ) = 1 � | 1 − y i w ⊤ x i | + + λ � w � 2 n i =1 � �� � non-smooth & strongly-convex Consider “left” derivative � � n � 1 1 w t +1 = w t − √ S i ( w t ) + 2 λw t n L t i =1 � if y i w ⊤ x i ≤ 1 − y i x i S i ( w ) = , L = sup � x � + 2 λ. 0 otherwise x ∈ X L.Rosasco, RegML 2020
Subgradient Let F : R p → R convex, Subgradient ∂F ( w 0 ) set of vectors v ∈ R p such that, for every w ∈ R p F ( w ) − F ( w 0 ) ≥ ( w − w 0 ) ⊤ v In one dimension ∂F ( w 0 ) = [ F ′ − ( w 0 ) , F ′ + ( w 0 )] . L.Rosasco, RegML 2020
Subgradient method Let F : R p → R strictly convex, with bounded subdifferential, and γ t = 1 /t then, w t +1 = w t − γ t v t with v t ∈ ∂F ( w t ) converges to the minimizer of F . L.Rosasco, RegML 2020
Subgradient method for SVM n � 1 | 1 − y i w ⊤ x i | + + λ � w � 2 min n w ∈ R p i =1 Consider � � n � w t +1 = w t − 1 1 S i ( w t ) + 2 λw t t n i =1 � if y i w ⊤ x i ≤ 1 − y i x i S i ( w t ) = 0 otherwise L.Rosasco, RegML 2020
Representer theorem of SVM By induction � � � n c t +1 = c t − 1 1 S i ( c t ) + 2 λc t t n i =1 with e i the i -th element of the canonical basis, � n x ⊤ x i ( c t ) i f t ( x ) = i =1 and � − y i e i if y i f t ( x i ) < 1 S i ( c t ) = . 0 otherwise L.Rosasco, RegML 2020
Rewriting SVM by subgradient By induction � � � n c t +1 = c t − 1 1 S i ( c t ) + 2 λc t t n i =1 with e i the i -th element of the canonical basis, � n Φ( x ) ⊤ Φ( x ) i ( c t ) i f t ( x ) = i =1 and � − y i e i if y i f t ( x i ) < 1 S i ( c t ) = . 0 otherwise L.Rosasco, RegML 2020
Optimality condition for SVM Smooth Convex Non-smooth Convex ∇ F ( w ∗ ) = 0 0 ∈ ∂F ( w ) 0 ∈ ∂ | 1 − y i w ⊤ x i | + + λ 2 w 0 ∈ ∂F ( w ∗ ) ⇔ w ∈ ∂ 1 2 λ | 1 − y i w ⊤ x i | + . ⇔ L.Rosasco, RegML 2020
Optimality condition for SVM (cont.) The optimality condition can be rewritten as � n � n 0 = 1 x i ( y i c i ( − y i x i c i ) + 2 λw ⇒ w = 2 λn ) . n i =1 i =1 where c i = c i ( w ) ∈ [ V − ( − y i w ⊤ x i ) , V + ( − y i w ⊤ x i )] . A direct computation gives c i = 1 if yf ( x i ) < 1 0 ≤ c i ≤ 1 if yf ( x i ) = 1 c i = 0 if yf ( x i ) > 1 L.Rosasco, RegML 2020
Support vectors c i = 1 if yf ( x i ) < 1 0 ≤ c i ≤ 1 if yf ( x i ) = 1 c i = 0 if yf ( x i ) > 1 L.Rosasco, RegML 2020
Complexity Without representer Logistic: O ( ndT ) SVM: O ( ndT ) With representer Logistic: O ( n 2 ( d + T )) SVM: O ( n 2 ( d + T )) n number of example, d dimensionality, T number of steps L.Rosasco, RegML 2020
Are loss functions all the same? E ( w ) + λ � w � 2 � min w ◮ each loss has a different target function. . . ◮ . . . and different computations The choice of the loss is problem dependent L.Rosasco, RegML 2020
So far ◮ regularization by penalization ◮ iterative optimization ◮ linear/non-linear parametric models What about nonparametric models? L.Rosasco, RegML 2020
From features to kernels n n � � x ⊤ Φ( x i ) ⊤ Φ( x ) c i f ( x ) = i xc i �→ f ( x ) = i =1 i =1 Kernels Φ( x ) ⊤ Φ( x ′ ) �→ K ( x, x ′ ) � n f ( x ) = K ( x i , x ) c i i =1 L.Rosasco, RegML 2020
LR and SVM with kernels As before: LR � � � n c t +1 = c t − 1 − 1 e i y i 1 + e y i f t ( x i ) + 2 λc t L n i =1 SVM � � � n c t +1 = c t − 1 1 S i ( c t ) + 2 λc t t n i =1 But now � n f t ( x ) = K ( x, x i )( c t ) i i =1 L.Rosasco, RegML 2020
Examples of kernels ◮ Linear K ( x, x ′ ) = x ⊤ x ′ ◮ Polynomial K ( x, x ′ ) = (1 + x ⊤ x ) p , with p ∈ N ◮ Gaussian K ( x, x ′ ) = e − γ � x − x ′ � 2 , with γ > 0 � n f ( x ) = c i K ( x i , x ) i =1 L.Rosasco, RegML 2020
Kernel engineering Kernels for ◮ Strings, ◮ Graphs, ◮ Histograms, ◮ Sets, ◮ ... L.Rosasco, RegML 2020
What is a kernel? K ( x, x ′ ) ◮ Similarity measure ◮ Inner product ◮ Positive definite function L.Rosasco, RegML 2020
Positive definite function K : X × X → R is positive definite , when for any n ∈ N , x 1 , . . . , x n ∈ X , let K n such that K n ∈ R n × n , ( K n ) ij = K ( x i , x j ) , then K n is positive semidefinite, (eigenvalues ≥ 0 ) L.Rosasco, RegML 2020
PD functions and RKHS Each PD Kernel defines a function space called Reproducing kernel Hilbert space (RKHS) H = span { K ( · , x ) | x ∈ X } . L.Rosasco, RegML 2020
Nonparametrics and kernels Number of parameters automatically determined by number of points � n f ( x ) = K ( x i , x ) c i i =1 Compare to p � f ( x ) = φ j ( x ) w j j =1 L.Rosasco, RegML 2020
This class ◮ Learning and Regularization: logistic regression and SVM ◮ Optimization with first order methods ◮ Linear and Non-linear parametric models ◮ Non-parametric models and kernels L.Rosasco, RegML 2020
Next class Beyond penalization Regularization by ◮ subsampling ◮ stochastic projection L.Rosasco, RegML 2020
Recommend
More recommend