vector valued distribution regression a simple and
play

Vector-valued Distribution Regression: A Simple and Consistent - PowerPoint PPT Presentation

Vector-valued Distribution Regression: A Simple and Consistent Approach Zolt an Szab o Joint work with Arthur Gretton (UCL), Barnab as P oczos (CMU), Bharath K. Sriperumbudur (PSU) Statistical Science Seminars October 9, 2014


  1. Vector-valued Distribution Regression: A Simple and Consistent Approach Zolt´ an Szab´ o Joint work with Arthur Gretton (UCL), Barnab´ as P´ oczos (CMU), Bharath K. Sriperumbudur (PSU) Statistical Science Seminars October 9, 2014 Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  2. Outline Motivation. Previous work. High-level goal. Definitions, algorithm, error guarantee, consistency. Numerical illustration. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  3. Problem: regression on distributions Given: { ( x i , y i ) } l i =1 samples H ∋ f =? such that f ( x i ) ≈ y i . Our interest: x i -s are distributions, but (challenge!), only samples are given from x i -s: { x i , n } N i n =1 . Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  4. Two-stage sampled setting = bag-of-features Examples: image = set of patches/visual descriptors, document = bag of words/sentences/paragraphs, molecule = different configurations/shapes, group of people on a social network: bag of friendship graphs, customer = his/her shopping records, user = set of trial time-series. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  5. Distribution regression: wider context Several problems are covered in machine learning and statistics: multi-instance learning, point estimation tasks without analytical formula. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  6. Existing methods Idea: estimate distribution similarities, 1 plug them into a learning algorithm. 2 Approaches: parametric approaches: Gaussian, MOG, exponential family 1 [Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012]. kernelized Gaussian measures: 2 [Jebara et al., 2004, Zhou and Chellappa, 2006]. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  7. Existing methods+ (Positive definite) kernels: [Cuturi et al., 2005, 1 Martins et al., 2009, Hein and Bousquet, 2005]. Divergence measures (KL, . . . ): [P´ oczos et al., 2011]. 2 Set metric based algorithms: 3 Hausdorff metric [Edgar, 1995], and 1 its variants [Wang and Zucker, 2000, Wu et al., 2010, 2 Zhang and Zhou, 2009, Chen and Wu, 2012]. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  8. Existing methods: summary MIL dates back to [Haussler, 1999, G¨ artner et al., 2002]. There are several multi-instance methods, applications. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  9. Existing methods: summary MIL dates back to [Haussler, 1999, G¨ artner et al., 2002]. There are several multi-instance methods, applications. One ’small’ open question: Does any of these techniques make sense? Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  10. Existing methods: “exceptions” APR (axis-parallel rectangles) and its variants, classification [Auer, 1998, Long and Tan, 1998, Blum and Kalai, 1998, Babenko et al., 2011, Zhang et al., 2013, Sabato and Tishby, 2012]: y i = max( I R ( x i , 1 ) , . . . , I R ( x i , N )) ∈ { 0 , 1 } , where R = unknown rectangle. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  11. Existing methods: “exceptions” APR (axis-parallel rectangles) and its variants, classification [Auer, 1998, Long and Tan, 1998, Blum and Kalai, 1998, Babenko et al., 2011, Zhang et al., 2013, Sabato and Tishby, 2012]: y i = max( I R ( x i , 1 ) , . . . , I R ( x i , N )) ∈ { 0 , 1 } , where R = unknown rectangle. Density based approaches, regression: KDE + kernel smoothing [P´ oczos et al., 2013, Oliva et al., 2014], densities live on compact Euclidean domain, density estimation: nuisance step. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  12. High-level goal: set kernel Given (2 bags): B i := { x i , n } N i n =1 ∼ x i , B j := { x j , m } N j m =1 ∼ x j . Similarity of the bags (set/multi-instance/ensemble-, convolution kernel [Haussler, 1999, G¨ artner et al., 2002]): N j N i 1 � � K ( B i , B j ) = k ( x i , n , x j , m ) . N i N j n =1 m =1 Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  13. High-level goal: consistency of set kernels Are set kernels consistent , when plugged into some regression scheme? Our focus: ridge regression . Motivation (ridge scheme): simple algorithm. 1 recently proved parallelizations [Zhang et al., 2014]. 2 Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  14. Story H : assumed function class to capture the ( x , y ) relation. f ρ : true regression function (might not be in H ). f H : “best” function from H ( l = ∞ , N := N i = ∞ ). ˆ f : estimated function from H based on { ( { x i , n } N n =1 , y i ) } l i =1 . Aim: High probability error guarantees ( λ : reg., E : risk): E [ˆ f ] − E [ f H ] ≤ r 1 ( l , N , λ ) , (1) � ˆ f − f ρ � L 2 ≤ r 2 ( l , N , λ ) + r 3 (richness of H ) . (2) Consistency: ( l , N , λ ) =? such that r i ( l , N , λ ) → 0 ( i = 1 , 2). Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  15. Distribution regression: definition, solution idea i =1 : x i ∈ M + z = { ( x i , y i ) } l 1 ( D ), y i ∈ Y . �� l i . i . d . �� { x i , n } N ˆ z = n =1 , y i i =1 : x i , 1 , . . . , x i , N ∼ x i . Goal: learn the relation between x and y based on ˆ z . Idea: embed the distributions (using µ defined by k ), 1 apply ridge regression (determined by K ). 2 f ∈ H ( K ) µ M + 1 ( D ) − → X ⊆ H ( k ) − − − − − → Y . Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  16. Kernel part ( k , K ): RKHS k : D × D → R kernel on D , if ∃ ϕ : D → H (ilbert space) feature map, k ( a , b ) = � ϕ ( a ) , ϕ ( b ) � H ( ∀ a , b ∈ D ). Kernel examples: D = R d ( p > 0, θ > 0) k ( a , b ) = ( � a , b � + θ ) p : polynomial, k ( a , b ) = e −� a − b � 2 2 / (2 θ 2 ) : Gaussian, k ( a , b ) = e − θ � a − b � 2 : Laplacian. In the H = H ( k ) RKHS ( ∃ !): ϕ ( u ) = k ( · , u ). Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  17. Kernel part: example domains ( D ) Euclidean space: D = R d . Strings, time series, graphs, dynamical systems. Distributions. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  18. µ Embedding step: M + − → X ⊆ H ( k ) 1 ( D ) Given: kernel k : D × D → R . Mean embedding of a distribution x ∈ M + 1 ( D ): � k ( · , u ) d x ( u ) ∈ H ( k ) . µ x = D Mean embedding of the empirical distribution x i = 1 � N n =1 δ x i , n ∈ M + ˆ 1 ( D ): N N � x i ( u ) = 1 � µ ˆ x i = k ( · , u ) d ˆ k ( · , x i , n ) ∈ H ( k ) . N D n =1 Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  19. f ∈ H = H ( K ) Objective function: X − − − − − − → Y Optimal ( H /measurable) in expected risk ( E ) sense: � � f ( µ a ) − y � 2 E [ f H ] = inf f ∈ H E [ f ] = inf Y d ρ ( µ a , y ) , f ∈ H X × Y � f ρ ( µ a ) = E [ y | µ a ] = y d ρ ( y | µ a ) ( µ a ∈ X ) . Y � One-stage ( → z ), two-stage difficulty ( z → ˆ z ): l 1 � � f ( µ x i ) − y i � 2 Y + λ � f � 2 f λ z = arg min H , (3) l f ∈ H i =1 l 1 f λ � x i ) − y i � 2 Y + λ � f � 2 z = arg min � f ( µ ˆ H . (4) ˆ l f ∈ H i =1 Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  20. Algorithmically: ridge regression ⇒ analytical solution Given: training sample: ˆ z , test distribution: t . Prediction: z ◦ µ )( t ) = [ y 1 , . . . , y l ]( K + l λ I l ) − 1 k , ( f λ (5) ˆ x j )] ∈ L ( Y ) l × l , K = [ K ij ] = [ K ( µ ˆ x i , µ ˆ (6)   K ( µ ˆ x 1 , µ t ) . .  ∈ L ( Y ) l . k = (7)   .  K ( µ ˆ x l , µ t ) Specially: Y = R ⇒ L ( Y ) = R ; Y = R d ⇒ L ( Y ) = R d × d . Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  21. Assumption-1 D : separable, topological. Y : separable Hilbert. k : bounded: sup u ∈ D k ( u , u ) ≤ B k ∈ (0 , ∞ ), continuous. M + � � X = µ 1 ( D ) ∈ B ( H ). Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  22. Assumption-1 – continued K [ K µ a := K ( · , µ a )]: bounded: 1 � K µ a � 2 � K ∗ � HS = Tr µ a K µ a ≤ B K ∈ (0 , ∞ ) , ( ∀ µ a ∈ X ) . H¨ older continuous: ∃ L > 0, h ∈ (0 , 1] such that 2 � K µ a − K µ b � L ( Y , H ) ≤ L � µ a − µ b � h ∀ ( µ a , µ b ) ∈ X × X . H , y is bounded: ∃ C < ∞ such that � y � Y ≤ C almost surely. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  23. Assumption-1: remarks (before the ρ assumptions) k : bounded, continuous ⇒ µ : ( M + 1 ( D ) , B ( τ w )) → ( H , B ( H )) measurable. µ measurable, X ∈ B ( H ) ⇒ ρ on X × Y : well-defined. If (*) := D is compact metric, k is universal, then µ is continuous and X ∈ B ( H ). If Y = R , we get the traditional boundedness of K : K ( µ a , µ a ) ≤ B K , ( ∀ µ a ∈ X ) . Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

More recommend