Learning with Approximate Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford RegML Workshop, Simula, Oslo, 06/05/2017 D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 1 / 18
Outline Preliminaries on Kernel Embeddings 1 Testing and Learning on Distributions with Symmetric Noise Invariance 2 D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 1 / 18
Outline Preliminaries on Kernel Embeddings 1 Testing and Learning on Distributions with Symmetric Noise Invariance 2 D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 1 / 18
Reproducing Kernel Hilbert Spaces RKHS: a Hilbert space of functions on X with continuous evaluation f �→ f ( x ) , ∀ x ∈ X (norm convergence implies pointwise convergence). Each RKHS corresponds to a positive definite kernel k : X × X → R , s.t. ∀ x ∈ X , k ( · , x ) ∈ H , and 1 ∀ x ∈ X , ∀ f ∈ H , � f, k ( · , x ) � H = f ( x ) . 2 RKHS can be constructed as H k = span { k ( · , x ) | x ∈ X} and includes functions f ( x ) = � n i =1 α i k ( x, x i ) and their pointwise limits. 1 0.8 0.6 0.4 f(x) 0.2 0 −0.2 −0.4 −6 −4 −2 0 2 4 6 8 x D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 2 / 18
Kernel Trick and Kernel Mean Trick implicit feature map x �→ k ( · , x ) ∈ H k replaces x �→ [ φ 1 ( x ) , . . . , φ s ( x )] ∈ R s � k ( · , x ) , k ( · , y ) � H k = k ( x, y ) inner products readily available • nonlinear decision boundaries, nonlinear regression [Cortes & Vapnik, 1995; Schölkopf & functions, learning on non-Euclidean/structured Smola, 2001] data D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 3 / 18
Kernel Trick and Kernel Mean Trick implicit feature map x �→ k ( · , x ) ∈ H k replaces x �→ [ φ 1 ( x ) , . . . , φ s ( x )] ∈ R s � k ( · , x ) , k ( · , y ) � H k = k ( x, y ) inner products readily available • nonlinear decision boundaries, nonlinear regression [Cortes & Vapnik, 1995; Schölkopf & functions, learning on non-Euclidean/structured Smola, 2001] data RKHS embedding : implicit feature mean [Smola et al, 2007; Sriperumbudur et al, 2010] P �→ µ k ( P ) = E X ∼ P k ( · , X ) ∈ H k replaces P �→ [ E φ 1 ( X ) , . . . , E φ s ( X )] ∈ R s � µ k ( P ) , µ k ( Q ) � H k = E X ∼ P,Y ∼ Q k ( X, Y ) [Gretton et al, 2005; Gretton et al, inner products easy to estimate 2006; Fukumizu et al, 2007; DS et • nonparametric two-sample, independence, al, 2013; Muandet et al, 2012; conditional independence, interaction testing, Szabo et al, 2015] learning on distributions D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 3 / 18
Maximum Mean Discrepancy Maximum Mean Discrepancy (MMD) [Borgwardt et al, 2006; Gretton et al, 2007] between P and Q : 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 6 4 2 0 2 4 6 MMD k ( P, Q ) = � µ k ( P ) − µ k ( Q ) � H k = sup | E f ( X ) − E f ( Y ) | f ∈H k : � f � H k ≤ 1 D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 4 / 18
Maximum Mean Discrepancy Maximum Mean Discrepancy (MMD) [Borgwardt et al, 2006; Gretton et al, 2007] between P and Q : 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 6 4 2 0 2 4 6 MMD k ( P, Q ) = � µ k ( P ) − µ k ( Q ) � H k = sup | E f ( X ) − E f ( Y ) | f ∈H k : � f � H k ≤ 1 Characteristic kernels: MMD k ( P, Q ) = 0 iff P = Q . 2 σ 2 � x − x ′ � 2 • Gaussian RBF exp( − 1 2 ) , Matérn family, inverse multiquadrics. For characteristic kernels on LCH X , MMD metrizes weak* topology on probability measures [Sriperumbudur,2010] , MMD k ( P n , P ) → 0 ⇔ P n � P. D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 4 / 18
Some uses of MMD MMD has been applied to: two-sample tests and independence tests within-sample average similarity – [Gretton et al, 2009, Gretton et al, 2012] between-sample average similarity model criticism and interpretability [Lloyd & Ghahramani, 2015; Kim, Khanna & Koyejo, 2016] analysis of Bayesian quadrature [Briol et al, 2015+] k ( dog i , dog j ) k (dog i , fish j ) ABC summary statistics [Park, Jitkrittum & DS, 2015] summarising streaming data [Paige, DS & k ( fish j , dog i ) k (fish i , fish j ) Wood, 2016] traversal of manifolds learned by convolutional nets [Gardner et al, 2015] training deep generative models [Dziugaite, Figure by Arthur Gretton Roy & Ghahramani, 2015; Sutherland et al, 2017] P k ( X, X ′ ) + E Y ,Y ′ i.i.d. Q k ( Y , Y ′ ) − 2 E X ∼ P ,Y ∼ Q k ( X, Y ) . MMD 2 k ( P, Q ) = E X,X ′ i.i.d. ∼ ∼ D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 5 / 18
Kernel dependence measures 1 0.7 0.3 0.1 0.3 0.8 1 HSIC 2 ( X, Y ; κ ) = � µ κ ( P XY ) − µ κ ( P X P Y ) � 2 H κ 1 1 1 1 1 1 0.3 0.1 0.1 0.3 0.2 0.2 0 Hilbert-Schmidt norm of the feature-space 0.4 0.2 0.2 0.5 0.3 0.2 0 cross-covariance [Gretton et al, 2009] dependence witness is a smooth function in cor vs. dcor the RKHS H κ of functions on X × Y Dependence witness and sample 1.5 l ( ) k ( ) , , !" #" !" #" 0.05 1 0.04 0.03 0.5 0.02 κ ( ) = !" !" #" #" , 0.01 Y 0 k ( ) × l ( ) , , 0 !" #" !" #" −0.01 −0.5 Independence testing framework that −0.02 −1 −0.03 generalises Distance Correlation (dcor) of −0.04 [Szekely et al, 2007] : HSIC with Brownian −1.5 −1.5 −1 −0.5 0 0.5 1 1.5 X motion covariance kernels [DS et al, 2013] Figure by Arthur Gretton D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 6 / 18
Distribution Regression supervised learning where labels are available at the group, rather than at the individual level. y 2 % vote for Obama ? y 3 ? y 1 feature space µ w µ m µ 1 µ 3 µ 2 3 3 men women both x 2 x 3 1 x 2 3 3 x 1 x 1 x 2 1 x 1 2 2 3 x 3 x 3 x 5 x 4 2 1 3 3 region 1 region 2 region 3 Figure from Flaxman et al, 2015 Figure from Mooij et al, 2014 • classifying text based on word features [Yoshikawa et al, 2014; Kusner et al, 2015] • aggregate voting behaviour of demographic groups [Flaxman et al, 2015; 2016] • image labels based on a distribution of small patches [Szabo et al, 2016] • “traditional” parametric statistical inference by learning a function from sets of samples to parameters: ABC [Mitrovic et al, 2016] , EP [Jitkrittum et al, 2015] • identify the cause-effect direction between a pair of variables from a joint sample [Lopez-Paz et al,2015] Possible (distributional) covariate shift? D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 7 / 18
Outline Preliminaries on Kernel Embeddings 1 Testing and Learning on Distributions with Symmetric Noise Invariance 2 D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 7 / 18
All possible differences between generating processes? differences discovered by an MMD two-sample test can be due to different types of measurement noise or data collection artefacts • With a large sample-size, uncovers potentially irrelevant sources of variability: slightly different calibration of the data collecting equipment, different numerical precision, different conventions of dealing with edge-cases Learning on distributions: each label y i in supervised learning is associated to a whole bag of observations B i = { X ij } N i j =1 – assumed to come from a probability distribution P i • Each bag of observations could be impaired by a different measurement noise process. Distributional covariate shift: different measurement noise on test bags? Both problems require encoding the distribution with a representation invariant to symmetric noise. Testing and Learning on Distributions with Symmetric Noise Invariance. Ho Chung Leon Law, Christopher Yau, DS. http://arxiv.org/abs/1703.07596 D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 8 / 18
Random Fourier features: Inverse Kernel Trick Bochner’s representation: Assume that k is a positive definite translation-invariant kernel on R p . Then k can be written as ˆ iω ⊤ ( x − y ) � � k ( x, y ) = R p exp d Λ( ω ) ˆ � � ω ⊤ x � � ω ⊤ y � � ω ⊤ x � � ω ⊤ y �� = 2 cos cos + sin sin d Λ( ω ) R p for some positive measure (w.l.o.g. a probability distribution) Λ . Sample m frequencies Ω = { ω j } m j =1 ∼ Λ and use a Monte Carlo estimator of the kernel function instead [Rahimi & Recht, 2007] : m 2 ˆ � ω ⊤ ω ⊤ ω ⊤ ω ⊤ � � � � � � � � �� k ( x, y ) = cos j x cos j y + sin j x sin j y m j =1 = � ξ Ω ( x ) , ξ Ω ( y ) � R 2 m , � � ⊤ . 2 with an explicit set of features ξ Ω : x �→ � � ω ⊤ � � ω ⊤ � cos 1 x , sin 1 x , . . . m How fast does m need to grow with n ? Can be sublinear for regression [Bach, 2015] . D.Sejdinovic (University of Oxford) Learning with Kernel Embeddings Oslo, 06/05/2017 9 / 18
Recommend
More recommend