structured prediction via implicit embeddings
play

Structured Prediction via Implicit Embeddings Alessandro Rudi - PowerPoint PPT Presentation

Structured Prediction via Implicit Embeddings Alessandro Rudi Imaging and Machine Learning, April 1st, Paris Inria, cole normale suprieure In collaboration with: Carlo Ciliberto, Lorenzo Rosasco, Francis Bach Structured Prediction 1


  1. Structured Prediction via Implicit Embeddings Alessandro Rudi Imaging and Machine Learning, April 1st, Paris Inria, École normale supérieure In collaboration with: Carlo Ciliberto, Lorenzo Rosasco, Francis Bach

  2. Structured Prediction 1

  3. Structured Prediction 2

  4. 3 Supervised Learning • X input space, Y output space, • ℓ : Y × Y → R loss function, • ρ probability on X × Y . f ⋆ = argmin E ( f ) , E ( f ) := E [ ℓ ( y , f ( x ))] . f : X→Y given only the dataset ( x i , y i ) n i = 1 sampled independently from ρ .

  5. Supervised learning: Goal f n , such that Consistency Learning rates 4 Given the dataset ( x i , y i ) n i = 1 sampled independently from ρ , produce � n →∞ E ( � f n ) = E ( f ⋆ ) , lim a . s . E ( � f n ) − E ( f ⋆ ) ≤ c ( n ) , w . h . p .

  6. State of the art: Vector-valued case 1 • Consistency and (optimal) learning rates for many losses Kernel machines, Kernel SVM. Easy to optimize. • Well known methods: Linear models, generalized linear models, n n space) • solve empirical risk minimization 5 Y is a vector space • choose suitable G ⊆ { f : X → Y} (usually a convex function ∑ � f = argmin ℓ ( f ( x i ) , y i ) + λ R ( f ) . f ∈G i = 1

  7. State of the art: Structured case f ? Surrogate approaches + Clear theory - Only for special cases (e.g. classification, ranking, multi-labeling etc.) [Bartlett et al ’06, Duchi et al ’10, Mroueh et al ’12, Gao et al. ’13] Score learning techniques + General algorithmic framework (e.g. StructSVM [Tsochandaridis et al ’05] ) - Limited Theory ( [McAllester ’06] ) 6 Y arbitrary how do we parametrize G and learn �

  8. Supervised learning with structure Is it possible to (a) have best of both worlds? (general algorithmic framework with clear theory) (b) learn leveraging the local structure of the input and the output? We will address (a), (b) using implicit embeddings (related techniques: Cortes et al. 2005; Geurts, Wehenkel, d’Alché Buc ’06; Kadri et al. ’13; Brouard, Szafranski, d’Alché Buc ’16) 7

  9. Table of contents 1. Structured learning with implicit embeddings 2. Algorithm and properties 3. Leveraging local structure 8

  10. Structured learning with implicit embeddings

  11. Characterizing the target function Pointwise characterization f x y y y x 9 f ⋆ = argmin E [ ℓ ( f ( x ) , y )] . f : X→Y

  12. Characterizing the target function Pointwise characterization 9 f ⋆ = argmin E [ ℓ ( f ( x ) , y )] . f : X→Y f ⋆ ( x ) = argmin E [ ℓ ( y ′ , y ) | x ] y ′ ∈Y

  13. maximum theory for measurable functions). Characterizing the target function 10 ˜ E [ ℓ ( y ′ , y ) | x ] f ( x ) = argmin y ′ ∈Y E [ ℓ (˜ f ( x ) , y )] = E x [ E [ ℓ (˜ f ( x ) , y ) | x ]] y ′ ∈Y E [ ℓ ( y ′ , y ) | x ]] = E x [ inf ≤ E [ ℓ ( f ( x ) , y )] , ∀ f : X → Y . Then E (˜ f ) = inf f : X→Y E ( f ) (measurability issues solved via Berge

  14. Implicit embedding continuous such that Theorem ( Ciliberto, Rosasco, Rudi ’16) A1 is satisfied 11 A1. There exists Hilbert space H and ψ, ϕ : Y → H , bounded ℓ ( y ′ , y ) := ⟨ ψ ( y ′ ) , ϕ ( y ) ⟩ . 1. for any loss ℓ when Y discrete space 2. for any smooth loss ℓ when Y ⊂ R d compact 3. for any smooth loss ℓ when Y ⊆ M with M compact manifold

  15. Idea for a unified approach When A1 holds 12 f ⋆ ( x ) = argmin E [ ℓ ( y ′ , y ) | x ] y ′ ∈Y

  16. Idea for a unified approach When A1 holds 12 f ⋆ ( x ) = argmin E [ ⟨ ψ ( y ′ ) , ϕ ( y ) ⟩ | x ] y ′ ∈Y

  17. Idea for a unified approach When A1 holds 12 f ⋆ ( x ) = argmin ⟨ ψ ( y ′ ) , E [ ϕ ( y ) | x ] ⟩ y ′ ∈Y

  18. Idea for a unified approach When A1 holds 12 f ⋆ ( x ) = argmin ⟨ ψ ( y ′ ) , µ ⋆ ( x ) ⟩ y ′ ∈Y with µ ⋆ ( x ) = E [ ϕ ( y ) | x ] conditional expectation of ϕ ( y ) given x

  19. 13 The estimator µ estimating µ ⋆ , define Given � � ⟨ ψ ( y ′ ) , � f ( x ) = argmin µ ( x ) ⟩ y ′ ∈Y

  20. 14 n 2 2 y x i 1 i n 1 suitable space of functions use standard techniques for vector valued problems. Given How to compute � µ µ ⋆ = E [ ϕ ( y ) | x ] is characterized by µ ⋆ = argmin E [ ∥ µ ( x ) − ϕ ( y ) ∥ 2 ] µ : X→H

  21. 14 suitable space of functions n n 1 How to compute � µ µ ⋆ = E [ ϕ ( y ) | x ] is characterized by µ ⋆ = argmin E [ ∥ µ ( x ) − ϕ ( y ) ∥ 2 ] µ : X→H use standard techniques for vector valued problems. Given G ∑ ∥ µ ( x i ) − ϕ ( y ) ∥ 2 + λ ∥ µ ∥ 2 . � µ = argmin µ ∈G i = 1

  22. n where 15 G space of linear functions Let X be a vector space and G = X ⊗ H , then ∑ � µ ( x ) = α i ( x ) ϕ ( y i ) , i = 1 α i ( x ) := [( K + λ nI ) − 1 v ( x )] i , and v ( x ) = ( x ⊤ x 1 , . . . x ⊤ x n ) ∈ R n , K ∈ R n × n K i , j = x ⊤ i x j .

  23. n non-parametric model where 16 Let k : X × X → R be a kernel on X . Denote by F the reproducing kernel Hilbert space induced by k over X . Let G = F ⊗ H , then ∑ � µ ( x ) = α i ( x ) ϕ ( y i ) , i = 1 α i ( x ) := [( K + λ nI ) − 1 v ( x )] i , and v ( x ) = ( k ( x , x 1 ) , . . . k ( x , x n )) ∈ R n , K ∈ R n × n K i , j = k ( x i , x j ) .

  24. Algorithm and properties

  25. f 17 Explicit representation of � When � µ is a non-parametric model, then � ⟨ ψ ( y ′ ) , � f ( x ) = argmin µ ( x ) ⟩ y ′ ∈Y

  26. f n 17 Explicit representation of � When � µ is a non-parametric model, then ⟨ ⟩ ∑ � ψ ( y ′ ) , f ( x ) = argmin α i ( x ) ϕ ( y i ) y ′ ∈Y i = 1

  27. f n 17 Explicit representation of � When � µ is a non-parametric model, then ∑ � α i ( x ) ⟨ ψ ( y ′ ) , ϕ ( y i ) ⟩ f ( x ) = argmin y ′ ∈Y i = 1

  28. f n 17 Explicit representation of � When � µ is a non-parametric model, then ∑ � α i ( x ) ℓ ( y ′ , y i ) . f ( x ) = argmin y ′ ∈Y i = 1

  29. f n 17 Explicit representation of � When � µ is a non-parametric model, then ∑ � α i ( x ) ℓ ( y ′ , y i ) . f ( x ) = argmin y ′ ∈Y i = 1 No need to know H , ϕ, ψ to run the algorithm!

  30. Recap • Applicable to a wide family of problems (no need to know • Generalization properties? and not on f • Only optimization on ) 18 n The proposed estimator has the form • Given ℓ satisfying A1 • k : X × X → R , kernel on X ∑ � α i ( x ) ℓ ( y ′ , y i ) , f ( x ) = argmin y ′ ∈Y i = 1 with α i ( x ) := [( K + λ nI ) − 1 v ( x )] i , and v ( x ) = ( k ( x , x 1 ) , . . . k ( x , x n )) ∈ R n , K ∈ R n × n K i , j = k ( x i , x j ) .

  31. Recap n • Generalization properties? 18 The proposed estimator has the form • Given ℓ satisfying A1 • k : X × X → R , kernel on X ∑ � α i ( x ) ℓ ( y ′ , y i ) , f ( x ) = argmin y ′ ∈Y i = 1 with α i ( x ) := [( K + λ nI ) − 1 v ( x )] i , and v ( x ) = ( k ( x , x 1 ) , . . . k ( x , x n )) ∈ R n , K ∈ R n × n K i , j = k ( x i , x j ) . • Applicable to a wide family of problems (no need to know H , ϕ, ψ ) • Only optimization on Y and not on { f : X → Y} = Y X

  32. f Theorem (Comparison inequality) 19 Properties of � Let ℓ satisfy A1 . For any � µ : X → H , √ E ( � f ) − E ( f ⋆ ) ≤ 2 c ψ µ ( x ) − µ ⋆ ( x ) ∥ 2 ] . E [ ∥ � with c ψ = sup y ′ ∈Y ∥ ψ ( y ) ∥ .

  33. f Theorem (Universal consistency - Ciliberto, Rosasco, Rudi ’16 ) with probability 1 20 Consistency of � Let ℓ satisfy A1 and k be a universal kernel. Let λ = n − 1 / 4 , then n →∞ E ( � f ) = E ( f ⋆ ) , lim

  34. f Theorem (Rates - Ciliberto, Rosasco, Rudi ’16 ) 21 Learning rates of � Let ℓ satisfy A1 and µ ⋆ ∈ G . Let λ = n − 1 / 2 , then E ( � f ) − E ( f ⋆ ) 2 c ψ n − 1 / 4 , ≤ w . h . p .

  35. Check point We provide a framework for structured prediction with • theoretical guarantees as empirical risk minimization • some important existing algorithms are covered by this framework (not seen here) 22 • explicit algorithm applicable on wide family of problems ( Y , ℓ )

  36. Case studies: • ranking with different losses ( Korba, Garcia, d’Alché-Buc ’18 ) • Output Fisher Embeddings ( Djerrab, Garcia, Sangnier, d’Alché-Buc ’18 ) Refinements of the analysis: • different derivation ( Osokin, Bach, Lacoste-Julien ’17; Goh ’18 ) ( Nowak, Bach, Rudi ’18; Struminsky et al. ’18 ) Extensions: • application to multitask-learning ( Ciliberto, Rosasco, Rudi ’17 ) • beyond least squares surrogate ( Nowak, Bach, Rudi ’19 ) • regularizing with trace norm ( Luise, Stamos, Pontil, Ciliberto ’19 ) • localized structured prediction ( Ciliberto, Bach, Rudi ’18 ) 23 • Y = manifolds, ℓ = geodesic distance ( Ciliberto et al. 18 ) • Y = probability space, ℓ = wasserstein distance ( Luise et al. 18 ) • determination of the constant c ψ in terms of log |Y| for discrete sets

  37. Leveraging local structure

  38. Local Structure 24

Recommend


More recommend