RegML 2020 Class 2 Tikhonov regularization and kernels Lorenzo - PowerPoint PPT Presentation

RegML 2020 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT

Learning problem Problem For H ⊂ { f | f : X → Y } , solve � min f ∈H E ( f ) , dρ ( x, y ) L ( f ( x ) , y ) given S n = ( x i , y i ) n i =1 ( ρ , fixed, unknown). L.Rosasco, RegML 2020

Empirical Risk Minimization (ERM) � min f ∈H E ( f ) �→ min E ( f ) f ∈H � n E ( f ) = 1 � L ( f ( x i ) , y i ) n i =1 proxy to E L.Rosasco, RegML 2020

From ERM to regularization ERM can be a bad idea if n is “small” and H is “big” Regularization � � min E ( f ) �→ min E ( f ) + λ R ( f ) � �� f ∈H f ∈H regularization λ regularization parameter L.Rosasco, RegML 2020

Examples of regularizers Let p � f ( x ) = φ j ( x ) w j j =1 ◮ ℓ 2 p � R ( f ) = � w � 2 = | w j | 2 , j =1 ◮ ℓ 1 p � R ( f ) = � w � 1 = | w j | , j =1 ◮ Differential operators � �∇ f ( x ) � 2 dρ ( x ) , R ( f ) = X ◮ ... L.Rosasco, RegML 2020

From statistics to optimization Problem Solve w ∈ R p � E ( w ) + λ � w � 2 min with � n E ( w ) = 1 � L ( w ⊤ x i , y i ) . n i =1 L.Rosasco, RegML 2020

Minimization E ( w ) + λ � w � 2 � min w ◮ Strongly convex functional ◮ Computations depends on the considered function L.Rosasco, RegML 2020

Logistic regression � n E λ ( w ) = 1 � log(1 + e − y i w ⊤ x i ) + λ � w � 2 . n i =1 � �� smooth and strongly convex � n E λ ( w ) = − 1 x i y i ∇ � 1 + e y i w ⊤ x i + 2 λw n i =1 L.Rosasco, RegML 2020

Gradient descent Let F : R d → R differentiable, (strictly) convex and such that �∇ F ( w ) − ∇ F ( w ′ ) � ≤ L � w − w ′ � (e.g. sup w � H ( w ) � ≤ L ) � �� hessian Then w t +1 = w t − 1 w 0 = 0 , L ∇ F ( w t ) , converges to the minimizer of F . L.Rosasco, RegML 2020

Gradient descent for LR � n 1 log(1 + e − y i w ⊤ x i ) + λ � w � 2 min n w ∈ R p i =1 Consider � � n � w t +1 = w t − 1 − 1 x i y i t x i + 2 λw t L n 1 + e y i w ⊤ i =1 L.Rosasco, RegML 2020

Complexity Logistic: O ( ndT ) n number of examples, d dimensionality, T number of steps What if n ≪ d ? Can we get better complexities? L.Rosasco, RegML 2020

Representer theorems Idea Show that n � f ( x ) = w ⊤ x = x ⊤ i xc i , c i ∈ R . i =1 i.e. w = � n i =1 x i c i . Compute c = ( c 1 , . . . , c n ) ∈ R n rather than w ∈ R d . L.Rosasco, RegML 2020

Representer theorem for GD & LR By induction � � n � c t +1 = c t − 1 − 1 e i y i 1 + e y i f t ( x i ) + 2 λc t L n i =1 with e i the i -th element of the canonical basis and � n x ⊤ x i ( c t ) i f t ( x ) = i =1 L.Rosasco, RegML 2020

Non-linear features p � d � f ( x ) = w i x i �→ f ( x ) = w i φ i ( x i ) . i =1 i =1 Φ( x ) = ( φ 1 ( x ) , . . . , φ p ( x )) , Model L.Rosasco, RegML 2020

Non-linear features p � d � f ( x ) = w i x i �→ f ( x ) = w i φ i ( x i ) . i =1 i =1 Φ( x ) = ( φ 1 ( x ) , . . . , φ p ( x )) , Computations Same up-to the change x �→ Φ( x ) L.Rosasco, RegML 2020

Representer theorem with non-linear features � n � n x ⊤ Φ( x i ) ⊤ Φ( x ) c i f ( x ) = i xc i �→ f ( x ) = i =1 i =1 L.Rosasco, RegML 2020

Rewriting logistic regression and gradient descent By induction � � n � c t +1 = c t − 1 − 1 e i y i 1 + e y i f t ( x i ) + 2 λc t L n i =1 with e i the i -th element of the canonical basis and � n Φ( x ) ⊤ Φ( x ) i ( c t ) i f t ( x ) = i =1 L.Rosasco, RegML 2020

Hinge loss and support vector machines � n E λ ( w ) = 1 � | 1 − y i w ⊤ x i | + + λ � w � 2 n i =1 � �� non-smooth & strongly-convex Consider “left” derivative � � n � 1 1 w t +1 = w t − √ S i ( w t ) + 2 λw t n L t i =1 � if y i w ⊤ x i ≤ 1 − y i x i S i ( w ) = , L = sup � x � + 2 λ. 0 otherwise x ∈ X L.Rosasco, RegML 2020

Subgradient Let F : R p → R convex, Subgradient ∂F ( w 0 ) set of vectors v ∈ R p such that, for every w ∈ R p F ( w ) − F ( w 0 ) ≥ ( w − w 0 ) ⊤ v In one dimension ∂F ( w 0 ) = [ F ′ − ( w 0 ) , F ′ + ( w 0 )] . L.Rosasco, RegML 2020

Subgradient method Let F : R p → R strictly convex, with bounded subdifferential, and γ t = 1 /t then, w t +1 = w t − γ t v t with v t ∈ ∂F ( w t ) converges to the minimizer of F . L.Rosasco, RegML 2020

Subgradient method for SVM n � 1 | 1 − y i w ⊤ x i | + + λ � w � 2 min n w ∈ R p i =1 Consider � � n � w t +1 = w t − 1 1 S i ( w t ) + 2 λw t t n i =1 � if y i w ⊤ x i ≤ 1 − y i x i S i ( w t ) = 0 otherwise L.Rosasco, RegML 2020

Representer theorem of SVM By induction � � � n c t +1 = c t − 1 1 S i ( c t ) + 2 λc t t n i =1 with e i the i -th element of the canonical basis, � n x ⊤ x i ( c t ) i f t ( x ) = i =1 and � − y i e i if y i f t ( x i ) < 1 S i ( c t ) = . 0 otherwise L.Rosasco, RegML 2020

Rewriting SVM by subgradient By induction � � � n c t +1 = c t − 1 1 S i ( c t ) + 2 λc t t n i =1 with e i the i -th element of the canonical basis, � n Φ( x ) ⊤ Φ( x ) i ( c t ) i f t ( x ) = i =1 and � − y i e i if y i f t ( x i ) < 1 S i ( c t ) = . 0 otherwise L.Rosasco, RegML 2020

Optimality condition for SVM Smooth Convex Non-smooth Convex ∇ F ( w ∗ ) = 0 0 ∈ ∂F ( w ) 0 ∈ ∂ | 1 − y i w ⊤ x i | + + λ 2 w 0 ∈ ∂F ( w ∗ ) ⇔ w ∈ ∂ 1 2 λ | 1 − y i w ⊤ x i | + . ⇔ L.Rosasco, RegML 2020

Optimality condition for SVM (cont.) The optimality condition can be rewritten as � n � n 0 = 1 x i ( y i c i ( − y i x i c i ) + 2 λw ⇒ w = 2 λn ) . n i =1 i =1 where c i = c i ( w ) ∈ [ V − ( − y i w ⊤ x i ) , V + ( − y i w ⊤ x i )] . A direct computation gives c i = 1 if yf ( x i ) < 1 0 ≤ c i ≤ 1 if yf ( x i ) = 1 c i = 0 if yf ( x i ) > 1 L.Rosasco, RegML 2020

Support vectors c i = 1 if yf ( x i ) < 1 0 ≤ c i ≤ 1 if yf ( x i ) = 1 c i = 0 if yf ( x i ) > 1 L.Rosasco, RegML 2020

Complexity Without representer Logistic: O ( ndT ) SVM: O ( ndT ) With representer Logistic: O ( n 2 ( d + T )) SVM: O ( n 2 ( d + T )) n number of example, d dimensionality, T number of steps L.Rosasco, RegML 2020

Are loss functions all the same? E ( w ) + λ � w � 2 � min w ◮ each loss has a different target function. . . ◮ . . . and different computations The choice of the loss is problem dependent L.Rosasco, RegML 2020

So far ◮ regularization by penalization ◮ iterative optimization ◮ linear/non-linear parametric models What about nonparametric models? L.Rosasco, RegML 2020

From features to kernels n n � � x ⊤ Φ( x i ) ⊤ Φ( x ) c i f ( x ) = i xc i �→ f ( x ) = i =1 i =1 Kernels Φ( x ) ⊤ Φ( x ′ ) �→ K ( x, x ′ ) � n f ( x ) = K ( x i , x ) c i i =1 L.Rosasco, RegML 2020

LR and SVM with kernels As before: LR � � � n c t +1 = c t − 1 − 1 e i y i 1 + e y i f t ( x i ) + 2 λc t L n i =1 SVM � � � n c t +1 = c t − 1 1 S i ( c t ) + 2 λc t t n i =1 But now � n f t ( x ) = K ( x, x i )( c t ) i i =1 L.Rosasco, RegML 2020

Examples of kernels ◮ Linear K ( x, x ′ ) = x ⊤ x ′ ◮ Polynomial K ( x, x ′ ) = (1 + x ⊤ x ) p , with p ∈ N ◮ Gaussian K ( x, x ′ ) = e − γ � x − x ′ � 2 , with γ > 0 � n f ( x ) = c i K ( x i , x ) i =1 L.Rosasco, RegML 2020

Kernel engineering Kernels for ◮ Strings, ◮ Graphs, ◮ Histograms, ◮ Sets, ◮ ... L.Rosasco, RegML 2020

What is a kernel? K ( x, x ′ ) ◮ Similarity measure ◮ Inner product ◮ Positive definite function L.Rosasco, RegML 2020

Positive definite function K : X × X → R is positive definite , when for any n ∈ N , x 1 , . . . , x n ∈ X , let K n such that K n ∈ R n × n , ( K n ) ij = K ( x i , x j ) , then K n is positive semidefinite, (eigenvalues ≥ 0 ) L.Rosasco, RegML 2020

PD functions and RKHS Each PD Kernel defines a function space called Reproducing kernel Hilbert space (RKHS) H = span { K ( · , x ) | x ∈ X } . L.Rosasco, RegML 2020

Nonparametrics and kernels Number of parameters automatically determined by number of points � n f ( x ) = K ( x i , x ) c i i =1 Compare to p � f ( x ) = φ j ( x ) w j j =1 L.Rosasco, RegML 2020

This class ◮ Learning and Regularization: logistic regression and SVM ◮ Optimization with first order methods ◮ Linear and Non-linear parametric models ◮ Non-parametric models and kernels L.Rosasco, RegML 2020

Next class Beyond penalization Regularization by ◮ subsampling ◮ stochastic projection L.Rosasco, RegML 2020

RegML 2020 Class 2 Tikhonov regularization and kernels Lorenzo - PowerPoint PPT Presentation

RegML 2020 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT Learning problem Problem For H { f | f : X Y } , solve min f H E ( f ) , d ( x, y ) L ( f ( x ) , y ) given S n = ( x i , y i ) n i =1 (

Tikhonov regularization Solve the Tikhonov minimization problem x { Ax b 2 + Lx

Evolution equations and Tikhonov regularization Juan PEYPOUQUET Universidad T ecnica Federico

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Extrapolation techniques and multiparameter treatment for Tikhonov regularization A review

Iterative methods for Image Processing Lothar Reichel Como, May 2018. Lecture 2: Tikhonov

RegML 2020 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT Data representation A

RegML 2020 Class 3 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT

RegML 2020 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT

Programming LSQUIC Dmitri Tikhonov / LiteSpeed Technologies Netdev 0x14 Dmitri Tikhonov /

RegML 2016 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT June

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Exploiting

RegML 2016 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Data

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Interconnection boxes : DFH Concept overview and quality assurance UU-CERN-RFR meeting 20 June

!" #$! !"#"$%&'()("+,)"("-./%(

Dualities in dense baryonic (quark) matter with chiral and isospin imbalance Konstantin G.

Statistical Analysis in the You should use you preferred R -environment. Lexis Diagram:

Non-Gaussianity and its Evolution in Multi-field Inflation Gerasimos Rigopoulos (Utrecht) Work

The TAPENADE AD Tool Laurent Hasco et, Val erie Pascual, Rose-Marie Greborio

An Incremental Correction Algorithm for XML Documents and Single Type Tree Grammars Martin

MANAGEMENT MICHAEL SCHULTZ, T. KUNZE, H. FRICKE, J. MUND TU DRESDEN J. LPEZ LEONS, C.

RegML 2020 Class 2 Tikhonov regularization and kernels Lorenzo - PowerPoint PPT Presentation

RegML 2020 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT Learning problem Problem For H { f | f : X Y } , solve min f H E ( f ) , d ( x, y ) L ( f ( x ) , y ) given S n = ( x i , y i ) n i =1 (

Tikhonov regularization Solve the Tikhonov minimization problem x { Ax b 2 + Lx

Evolution equations and Tikhonov regularization Juan PEYPOUQUET Universidad T ecnica Federico

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Extrapolation techniques and multiparameter treatment for Tikhonov regularization A review

Iterative methods for Image Processing Lothar Reichel Como, May 2018. Lecture 2: Tikhonov

RegML 2020 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT Data representation A

RegML 2020 Class 3 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT

RegML 2020 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT

Programming LSQUIC Dmitri Tikhonov / LiteSpeed Technologies Netdev 0x14 Dmitri Tikhonov /

RegML 2016 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT June

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Exploiting

RegML 2016 Class 7 Dictionary learning Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Data

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Interconnection boxes : DFH Concept overview and quality assurance UU-CERN-RFR meeting 20 June

!&quot; #$! !&quot;#&quot;$%&amp;'()*(&quot;+,)&quot;(&quot;-./%(*

Dualities in dense baryonic (quark) matter with chiral and isospin imbalance Konstantin G.

Statistical Analysis in the You should use you preferred R -environment. Lexis Diagram:

Non-Gaussianity and its Evolution in Multi-field Inflation Gerasimos Rigopoulos (Utrecht) Work

The TAPENADE AD Tool Laurent Hasco et, Val erie Pascual, Rose-Marie Greborio

An Incremental Correction Algorithm for XML Documents and Single Type Tree Grammars Martin

MANAGEMENT MICHAEL SCHULTZ, T. KUNZE, H. FRICKE, J. MUND TU DRESDEN J. LPEZ LEONS, C.

!" #$! !"#"$%&'()("+,)"("-./%(