Localized Structured Prediction Carlo Ciliberto 1 , Francis Bach 2 , 3 , Alessandro Rudi 2 , 3 1 Department of Electrical and Electronic Engineering, Imperial College London, London 2 Département d’informatique, Ecole normale supérieure, PSL Research University. 3 INRIA, Paris, France
Supervised Learning 101 1 • X input space, Y output space, • ℓ : Y × Y → R loss function, • ρ probability on X × Y . f ⋆ = argmin E [ ℓ ( f ( x ) , y )] , f : X→Y given only the dataset ( x i , y i ) n i =1 sampled independently from ρ .
Structured Prediction 2
• How to choose Protypical Approach: Empirical Risk Minimization If ? How to optimize over it? is a “structured” space: If methods, Neural Networks, etc. easy to choose/optimize: (generalized) linear models, Kernel • is a vector space 3 Solve the problem: n ∑ 1 � f = argmin ℓ ( f ( x i ) , y i ) + λR ( f ) . n f ∈G i =1 Where G ⊆ { f : X → Y} (usually a convex function space)
• How to choose Protypical Approach: Empirical Risk Minimization Solve the problem: ? How to optimize over it? is a “structured” space: If methods, Neural Networks, etc. 3 n ∑ 1 � f = argmin ℓ ( f ( x i ) , y i ) + λR ( f ) . n f ∈G i =1 Where G ⊆ { f : X → Y} (usually a convex function space) If Y is a vector space • G easy to choose/optimize: (generalized) linear models, Kernel
Protypical Approach: Empirical Risk Minimization Solve the problem: methods, Neural Networks, etc. 3 n ∑ 1 � f = argmin ℓ ( f ( x i ) , y i ) + λR ( f ) . n f ∈G i =1 Where G ⊆ { f : X → Y} (usually a convex function space) If Y is a vector space • G easy to choose/optimize: (generalized) linear models, Kernel If Y is a “structured” space: • How to choose G ? How to optimize over it?
State of the art: Structured case Surrogate approaches + Clear theory (e.g. convergence and learning rates) - Only for special cases (classification, ranking, multi-labeling etc.) [Bartlett et al., 2006, Duchi et al., 2010, Mroueh et al., 2012] Score learning techniques + General algorithmic framework (e.g. StructSVM [Tsochantaridis et al., 2005] ) - Limited Theory (no consistency, see e.g. [Bakir et al., 2007] ) 4 Y arbitrary: how do we parametrize G and learn � f ?
Is it possible to have best of both worlds? general algorithmic framework + clear theory 5
Table of contents 1. A General Framework for Structured Prediction [Ciliberto et al., 2016] 2. Leveraging Local Structure [This Work] 6
A General Framework for Structured Prediction
Characterizing the target function Pointwise characterization in terms of the conditional expectation : 7 f ⋆ = argmin E xy [ ℓ ( f ( x ) , y )] . f : X→Y
Characterizing the target function Pointwise characterization in terms of the conditional expectation : 7 f ⋆ = argmin E xy [ ℓ ( f ( x ) , y )] . f : X→Y f ⋆ ( x ) = argmin E y [ ℓ ( z, y ) | x ] . z ∈Y
Deriving an Estimator Idea: approximate 8 f ⋆ ( x ) = argmin E ( z, x ) E ( z, x ) = E y [ ℓ ( z, y ) | x ] z ∈Y by means of an estimator � E ( z, x ) of the ideal E ( z, x ) � � � f ( x ) = argmin E ( z, x ) E ( z, x ) ≈ E ( z, x ) z ∈Y Question: How to choose � E ( z, x ) given the dataset ( x i , y i ) n i =1 ?
Estimating the Conditional Expectation Questions: 9 Idea: for every z perform “regression” over the ℓ ( z, · ) . n ∑ 1 g z = argmin � L ( g ( x i ) , ℓ ( z, y i )) + λR ( g ) n g : X→ R i =1 Then we take � E ( z, x ) = � g z ( x ) . • Models: How to choose L ? • Computations: Do we need to compute � g z for every z ∈ Y ? • Theory: Does � E ( z, x ) → E ( z, x ) ? More generally, does � f → f ⋆ ?
Square Loss! With and 10 Let L be the square loss . Then: ∑ n 1 ( g ( x i ) − ℓ ( z, y i )) 2 + λ ∥ g ∥ 2 � g z = argmin n g i =1 In particular, for linear models g ( x ) = ϕ ( x ) ⊤ w � � � � � 2 + λ g z ( x ) = ϕ ( x ) ⊤ � � 2 � Aw − b � w � w z w z = argmin � w A = [ ϕ ( x 1 ) , . . . , ϕ ( x n )] ⊤ b = [ ℓ ( z, y 1 ) , . . . , ℓ ( z, y n )] ⊤
11 In particular, we can compute Closed form solution Computing the � g z All in Once g z ( x ) = ϕ ( x ) ⊤ � w z = ϕ ( x ) ⊤ ( A ⊤ A + λnI ) − 1 A ⊤ b = α ( x ) ⊤ b � � �� � α ( x ) α i ( x ) = ϕ ( x ) ⊤ ( A ⊤ A + λnI ) − 1 ϕ ( x i ) only once (independently of z ). Then, for any z n n ∑ ∑ g z ( x ) = � α i ( x ) b i = α i ( x ) ℓ ( z, y i ) i =1 i =1
Structured Prediction Algorithm Then, 12 Input: dataset ( x i , y i ) n i =1 . Training: for i = 1 , . . . , n , compute v i = ( A ⊤ A + λnI ) − 1 ϕ ( x i ) Prediction: given a new test point x compute α i ( x ) = ϕ ( x ) ⊤ v i n ∑ � f ( x ) = argmin α i ( x ) ℓ ( z, y i ) z ∈Y i =1
The Proposed Structured Prediction Algorithm Questions: Square loss! No need, Compute them all in once! Yes! Theorem (Rates - [Ciliberto et al., 2016]) Under mild assumption on . Let , then 13 • Models: How to choose L ? • Computations: Do we need to compute � g z for every z ∈ Y ? • Theory: Does � f → f ⋆ ?
The Proposed Structured Prediction Algorithm No need, Compute them all in once! Theorem (Rates - [Ciliberto et al., 2016]) Questions: Yes! 13 Square loss! • Models: How to choose L ? • Computations: Do we need to compute � g z for every z ∈ Y ? • Theory: Does � f → f ⋆ ? Under mild assumption on ℓ . Let λ = n − 1 / 2 , then E [ ℓ ( � f ( x ) , y ) − ℓ ( f ⋆ ( x ) , y )] O ( n − 1 / 4 ) , ≤ w.h.p.
A General Framework for Structured Prediction (General Algorithm + Theory) Is it possible to have best of both worlds? Yes! We introduced an algorithmic framework for structured prediction: • With strong theoretical guarantees. • Recovering many existing algorithms (not seen here). 14 • Directly applicable on a wide family of problems ( Y , ℓ ).
What Am I Hiding? • Theory. The key assumption to achieve consistency and rates is • Similar to the characterization of reproducing kernels. • In principle hard to verify. However lots of ML losses satisfy it! • Computations. We need to solve an optimization problem at prediction time! 15 that ℓ is a Structure Encoding Loss Function (SELF) . ℓ ( z, y ) = ⟨ ψ ( z ) , φ ( y ) ⟩ H ∀ z, y ∈ Y With ψ, φ : Y → H continuous maps into H Hilbert.
Prediction: The Inference Problem In our case it is reminiscient of a weighted barycenter. Solving an optimization problem at prediction time is a standard 16 practice in structured prediction. Known as Inference Problem � � f ( x ) = argmin E ( x, z ) z ∈Y n ∑ � f ( x ) = argmin α i ( x ) ℓ ( z, y i ) z ∈Y i =1 It is *very* problem dependent
Example: Learning to Rank is a Minimum Feedback Arc Set problem on DAGs (NP Hard!) approaches. Still, approximate solutions can 17 Pair-wise Loss: Goal: given a query x , order a set of documents d 1 , . . . , d k according to their relevance scores y 1 , . . . , y k w.r.t. x . k ∑ ℓrank ( f ( x ) , y ) = ( y i − y j ) sign ( f ( x ) i − f ( x ) j ) i,j =1 ∑ n It can be shown that � f ( x ) = argmin z ∈Y i =1 α i ( x ) ℓ ( z, y i ) improve upon non-consistent
Additional Work Case studies: • Learning to rank [Korba et al., 2018] • Output Fisher Embeddings [Djerrab et al., 2018] Refinements of the analysis: • Alternative derivations [Osokin et al., 2017] • Discrete loss [Nowak-Vila et al., 2018, Struminsky et al., 2018] Extensions: • Application to multitask-learning [Ciliberto et al., 2017] • Beyond least squares surrogate [Nowak-Vila et al., 2019] • Regularizing with trace norm [Luise et al., 2019] 18 • Y = manifolds, ℓ = geodesic distance [Rudi et al., 2018] • Y = probability space, ℓ = wasserstein distance [Luise et al., 2018]
Predicting Probability Distributions [Luise, Rudi, Pontil, Ciliberto ’18] Loss: Wasserstein distance Digit Reconstruction 19 Setting: Y = P ( R d ) probability distributions on R d . ∫ ∥ z − y ∥ 2 dτ ( x, y ) ℓ ( µ, ν ) = min τ ∈ Π( µ,ν )
Manifold Regression [Rudi, Ciliberto, Marconi, Rosasco ’18] Loss: (squared) geodesic distance. Optimization: Riemannian GD. Fingerprint Reconstruction Multi-labeling 20 Setting: Y Riemmanian manifold. ( Y = S 1 sphere) ( Y statistical manifold)
Nonlinear Multi-task Learning [Ciliberto, Rudi, Rosasco, Pontil ’17, Luise, Stamos, Pontil, Ciliberto ’19 ] Idea: instead of solving multiple learning problems (tasks) separately, leverage the potential relations among them. Unable to cope with non-linear constraints (e.g. ranking, robotics, etc.). MTL+Structured Prediction separate outputs. structure on the joint output. 21 Previous Methods : only imposing/learning linear tasks relations. − Interpret multiple tasks as − Impose constraints as
Leveraging local structure
Local Structure 22
Motivating Example (Between-Locality) Super-Resolution : However... • Very large output sets (high sample complexity). • Local info might be sufficient to predict output. 23 Learn f : Low res → High res .
Recommend
More recommend