Duality in vv-RKHSs with Infinite Dimensional Outputs: Application to Robust Losses Pierre Laforgue , Alex Lambert, Luc Brogat-Motte, Florence d’Alch´ e-Buc LTCI, T´ el´ ecom Paris, Institut Polytechnique de Paris, France 1/25
Outline Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion 2/25
Motivation 1: structured prediction by surrogate approach Kernel trick in the input space. Kernel trick in the output space [Cortes ’05, Geurts ’06, Brouard ’11, Kadri ’13, Brouard ’16] , Input Output Kernel Regression (IOKR). � n � � φ ( y i ) − h ( x i ) � � h ( x ) � 1 + Λ � 2 � φ ( y ) − ˆ � ˆ 2 � h � 2 h = argmin HK , g ( x ) = argmin 2 n FY FY h ∈HK y ∈Y i =1 2/25
− − − Motivation 2: function to function regression EMG curves Lip acceleration curves 3 2 2 1.5 1 Millivolts Meters/s 2 0 1 − 1 0.5 − 2 − 3 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 seconds seconds � n � � 1 L 2 + Λ � 2 � y i − h ( x i ) 2 � h � 2 min [Kadri et al., 2016] 2 n h ∈H K i =1 And many more! e.g. structured data autoencoding [Laforgue et al., 2019] � n � � 1 � 2 � φ ( x i ) − h 2 ◦ h 1 ( φ ( x i )) min F X + Λ Reg( h 1 , h 2 ) . 2 n h 1 , h 2 ∈H 1 K ×H 2 K i =1 3/25
Purpose of this work Question: Is it possible to extend the previous approaches to different (ideally robust) loss functions? First answer: Yes, possible extension to maximum-margin regression [Brouard et al., 2016], and ǫ -insensitive loss functions for matrix-valued kernels [Sangnier et al., 2017] What about general Operator-Valued Kernels (OVKs)? What about other types of loss functions? 4/25
Outline Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion 5/25
Learning in vector-valued RKHSs (vv-RKHSs) � K ( x , x ′ ) = K ( x ′ , x ) ∗ , • K : X × X → L ( Y ), i , j � y i , K ( x i , x j ) y j � Y ≥ 0 • Unique vv-RKHS H K ⊂ F ( X , Y ) , H K = Span {K ( · , x ) y : x , y ∈ X × Y} • Ex: decomposable OVK K ( x , x ′ ) = k ( x , x ′ ) A , with k scalar, A p.s.d. on Y 5/25
Learning in vector-valued RKHSs (vv-RKHSs) � K ( x , x ′ ) = K ( x ′ , x ) ∗ , • K : X × X → L ( Y ), i , j � y i , K ( x i , x j ) y j � Y ≥ 0 • Unique vv-RKHS H K ⊂ F ( X , Y ) , H K = Span {K ( · , x ) y : x , y ∈ X × Y} • Ex: decomposable OVK K ( x , x ′ ) = k ( x , x ′ ) A , with k scalar, A p.s.d. on Y i =1 ∈ ( X × Y ) n with Y a Hilbert space, we want to find: • For { ( x i , y i ) } n � n ℓ � � 1 + Λ ˆ 2 � h � 2 h ∈ argmin h ( x i ) , y i H K . n h ∈H K i =1 Representer Theorem [Micchelli and Pontil, 2005]: � n i =1 ∈ Y n (infinite dimensional!) ˆ α i ) n ∃ (ˆ s . t . h ( x ) = K ( · , x i )ˆ α i . i =1 α i = � n When ℓ ( · , · ) = 1 2 � · − · � 2 A = ( K + n Λ I n ) − 1 . Y , K = k · I Y : ˆ j =1 A ij y j , 5/25
Applying duality � n � n 1 ℓ i ( h ( x i )) + Λ h = 1 ˆ 2 � h � 2 ˆ h ∈ argmin is given by K ( · , x i )ˆ α i , H K n Λ n h ∈H K i =1 i =1 i =1 ∈ Y n the solutions to the dual problem : α i ) n with (ˆ � n � n 1 ℓ ⋆ min i ( − α i ) + � α i , K ( x i , x j ) α j � Y , 2Λ n ( α i ) n i =1 ∈Y n i =1 i , j =1 with f ⋆ : α ∈ Y �→ sup y ∈Y � α, y � Y − f ( y ) the Fenchel-Legendre transform of f . 6/25
Applying duality � n � n 1 ℓ i ( h ( x i )) + Λ h = 1 ˆ 2 � h � 2 ˆ h ∈ argmin is given by K ( · , x i )ˆ α i , H K n Λ n h ∈H K i =1 i =1 i =1 ∈ Y n the solutions to the dual problem : α i ) n with (ˆ � n � n 1 ℓ ⋆ min i ( − α i ) + � α i , K ( x i , x j ) α j � Y , 2Λ n ( α i ) n i =1 ∈Y n i =1 i , j =1 with f ⋆ : α ∈ Y �→ sup y ∈Y � α, y � Y − f ( y ) the Fenchel-Legendre transform of f . • 1st limitation: the FL transform ℓ ⋆ needs to be computable ( → assumption) • 2nd limitation : the dual variables ( α i ) n i =1 are still infinite dimensional! 6/25
Applying duality � n � n 1 ℓ i ( h ( x i )) + Λ h = 1 ˆ 2 � h � 2 ˆ h ∈ argmin is given by K ( · , x i )ˆ α i , H K n Λ n h ∈H K i =1 i =1 i =1 ∈ Y n the solutions to the dual problem : α i ) n with (ˆ � n � n 1 ℓ ⋆ min i ( − α i ) + � α i , K ( x i , x j ) α j � Y , 2Λ n ( α i ) n i =1 ∈Y n i =1 i , j =1 with f ⋆ : α ∈ Y �→ sup y ∈Y � α, y � Y − f ( y ) the Fenchel-Legendre transform of f . • 1st limitation: the FL transform ℓ ⋆ needs to be computable ( → assumption) • 2nd limitation : the dual variables ( α i ) n i =1 are still infinite dimensional! If Y = Span { y j , j ≤ n } invariant by K , i.e. ∀ ( x , x ′ ) , y ∈ Y ⇒ K ( x , x ′ ) y ∈ Y : α i = � then ˆ α i ∈ Y → possible reparametrization: ˆ j ˆ ω ij y j 6/25
The double representer theorem (1/2) Assume that OVK K and loss ℓ satisfy the appropriate assumptions (see paper for details, verified by standard kernels and losses), then � 1 ℓ ( h ( x i ) , y i ) + Λ ˆ 2 � h � 2 h = argmin H K is given by n H K i � n h = 1 ˆ K ( · , x i ) ˆ ω ij y j , Λ n i , j =1 ω ij ] ∈ R n × n the solution to the finite dimensional problem with ˆ Ω = [ˆ � n � Ω i : , K Y � 2Λ n Tr � ˜ M ⊤ (Ω ⊗ Ω) � 1 min L i + , Ω ∈ R n × n i =1 M the n 2 × n 2 matrix writing of M s.t. M ijkl = � y k , K ( x i , x j ) y l � Y . with ˜ 7/25
The double representer theorem (2/2) If K further satisfies K ( x , x ′ ) = � t k t ( x , x ′ ) A t , then tensor M simplifies to M ijkl = � t [ K X t ] ij [ K Y t ] kl and the problem rewrites � n � T � Ω i : , K Y � Tr � t Ω ⊤ � 1 K X t Ω K Y min L i + . 2Λ n Ω ∈ R n × n i =1 t =1 Rmk. Only need the n 4 tensor � y k , K ( x i , x j ) y l � Y to learn OVKMs. Rmk. Simplifies to 2 n 2 matrices K X ij K Y kl if K is decomposable. How to apply the duality approach? 8/25
Outline Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion 9/25
Infimal convolution and Fenchel-Legendre transforms Infimal-convolution operator � between proper lower semicontinuous functions [Bauschke et al., 2011]: ( f � g )( x ) = inf y f ( y ) + g ( x − y ) . Relation to FL transform: ( f � g ) ⋆ = f ⋆ + g ⋆ Ex: ǫ -insensitive losses. Let ℓ : Y → R be a convex loss with unique minimum at 0, and ǫ > 0. The ǫ -insensitive version of ℓ , denoted ℓ ǫ , is defined by: � ℓ (0) if � y � Y ≤ ǫ ℓ ǫ ( y ) = ( ℓ � χ B ǫ ) ( y ) = , � d � Y ≤ 1 ℓ ( y − ǫ d ) inf otherwise and has FL transform: ǫ ( y ) = ( ℓ � χ B ǫ ) ⋆ ( y ) = ℓ ⋆ ( y ) + ǫ � y � . ℓ ⋆ 9/25
Interesting loss functions: sparsity and robustness ǫ -Ridge ǫ -SVR κ -Huber 5 jj x jj 2 jj x jj 2 1 1 2 jj x jj 2 12 ² -insensitive 4 Huber loss ² -insensitive 4 10 3 8 3 6 2 2 4 1 1 2 0 0 0 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 4 12 12 3 9 9 2 6 6 1 3 3 −3 −3 −3 −1 −1 −1 0 0 0 1 1 1 2 3 2 3 2 3 1 1 1 3 −1 0 3 −1 0 3 −1 0 −3 −2 −2 −3 −2 −3 2 � · � 2 � χ B ǫ 1 κ � · � � 1 2 � · � 2 � · � � χ B ǫ (Sparsity) (Sparsity, Robustness) (Robustness) 10/25
Specific dual problems For the ǫ -ridge, ǫ -SVR and κ -Huber, it holds ˆ Ω = ˆ W V − 1 , with ˆ W the solution to these finite dimensional dual problems: 1 2 � AW − B � 2 ( D 1) min Fro + ǫ � W � 2 , 1 , W ∈ R n × n 1 2 � AW − B � 2 ( D 2) min Fro + ǫ � W � 2 , 1 , W ∈ R n × n s.t. � W � 2 , ∞ ≤ 1 , 1 2 � AW − B � 2 ( D 3) min Fro , W ∈ R n × n s.t. � W � 2 , ∞ ≤ κ, with V , A , B such that: VV ⊤ = K Y , A ⊤ A = K X / (Λ n ) + I n (or A ⊤ A = K X / (Λ n ) for the ǫ -SVR), and A ⊤ B = V . 11/25
Outline Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion 12/25
Surrogate approaches for structured prediction • Experiments on YEAST dataset • Empirically, ǫ -SV-IOKR outperforms ridge-IOKR for a wide range of ǫ • Promotes sparsity and acts as a regularizer Sparsity w.r.t. ¤ for different ² ( ² -SVR) Comparison ² -SVR / KRR 1.0 2.6 1.0 3.5 KRR 0.9 0.9 2.5 3.0 Sparsity (% null components) 0.8 0.8 2.5 0.7 2.4 0.7 Test MSE 2.0 0.6 0.6 2.3 ² ² 0.5 0.5 1.5 2.2 0.4 0.4 1.0 0.3 0.3 0.5 2.1 0.2 0.2 0.0 2.0 0.1 0.1 10 -8 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 -8 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 ¤ ¤ Figure 1: MSEs and sparsity w.r.t. Λ for several ǫ 13/25
Robust function-to-function regression Task from [Kadri et al., 2016]: predict lip acceleration from EMG signals. • Dataset augmented with outliers, model learned with Huber loss • Improvement for every output size M (see paper for approximation) 0 . 900 m=4 m=5 m=6 0 . 875 LOO generalization error m=7 m=15 Ridge Regression ( κ = + ∞ ) 0 . 850 0 . 825 0 . 800 0 . 775 0 . 750 0 . 0 0 . 5 1 . 0 1 . 5 κ Figure 2: LOO generalization error w.r.t. κ 14/25
Outline Motivations A duality theory for general OVKs Robust losses as convolutions Experiments Conclusion 15/25
Recommend
More recommend