Approximate Kernel Methods and Learning on Aggregates Dino Sejdinovic joint work with Leon Law, Seth Flaxman, Dougal Sutherland, Kenji Fukumizu, Ewan Cameron, Tim Lucas, Katherine Battle (and many others) Department of Statistics University of Oxford GPSS Workshop on Advances in Kernel Methods, Sheffield 06/09/2018 D.Sejdinovic (University of Oxford) Approximate Kernel Embeddings Sheffield, 06/09/2018 1 / 24
Learning on Aggregates Supervised learning : obtaining inputs has a lower cost than obtaining outputs/labels, hence we build a (predictive) functional relationship or a conditional probabilistic model of outputs given inputs. Semisupervised learning : because of the lower cost, there is much more unlabelled than labelled inputs. Weakly supervised learning on aggregates : because of the lower cost, inputs are at a much higher resolution than outputs. Figure: left : Malaria incidences reported per administrative unit; centre : land surface temperature at night; centre : topographic wetness index D.Sejdinovic (University of Oxford) Approximate Kernel Embeddings Sheffield, 06/09/2018 2 / 24
Outline Preliminaries on Kernels and GPs 1 Bayesian Approaches to Distribution Regression 2 Variational Learning on Aggregates with GPs 3 D.Sejdinovic (University of Oxford) Approximate Kernel Embeddings Sheffield, 06/09/2018 2 / 24
Outline Preliminaries on Kernels and GPs 1 Bayesian Approaches to Distribution Regression 2 Variational Learning on Aggregates with GPs 3 D.Sejdinovic (University of Oxford) Approximate Kernel Embeddings Sheffield, 06/09/2018 2 / 24
Reproducing Kernel Hilbert Space (RKHS) Definition ( [Aronszajn, 1950; Berlinet & Thomas-Agnan, 2004] ) Let X be a non-empty set and H be a Hilbert space of real-valued functions defined on X . A function k : X × X → R is called a reproducing kernel of H if: 1 ∀ x ∈ X , k ( · , x ) ∈ H , and 2 ∀ x ∈ X , ∀ f ∈ H , � f, k ( · , x ) � H = f ( x ) . If H has a reproducing kernel, it is said to be a reproducing kernel Hilbert space . Equivalent to the notion of kernel as an inner product of features : any function k : X × X → R for which there exists a Hilbert space H and a map ϕ : X → H s.t. k ( x, x ′ ) = � ϕ ( x ) , ϕ ( x ′ ) � H for all x, x ′ ∈ X . In particular, for any x, y ∈ X , k ( x, y ) = � k ( · , y ) , k ( · , x ) � H = � k ( · , x ) , k ( · , y ) � H . Thus H servers as a canonical feature space with feature map x �→ k ( · , x ) . Equivalently, all evaluation functionals f �→ f ( x ) are continuous (norm convergence implies pointwise convergence). Moore-Aronszajn Theorem: every positive semidefinite k : X × X → R is a reproducing kernel and has a unique RKHS H k . D.Sejdinovic (University of Oxford) Approximate Kernel Embeddings Sheffield, 06/09/2018 3 / 24
Reproducing Kernel Hilbert Space (RKHS) Definition ( [Aronszajn, 1950; Berlinet & Thomas-Agnan, 2004] ) Let X be a non-empty set and H be a Hilbert space of real-valued functions defined on X . A function k : X × X → R is called a reproducing kernel of H if: 1 ∀ x ∈ X , k ( · , x ) ∈ H , and 2 ∀ x ∈ X , ∀ f ∈ H , � f, k ( · , x ) � H = f ( x ) . If H has a reproducing kernel, it is said to be a reproducing kernel Hilbert space . � 2 γ 2 � x − x ′ � 2 � Gaussian RBF kernel k ( x, x ′ ) = exp − 1 has an infinite-dimensional H with elements h ( x ) = � n i =1 α i k ( x i , x ) and their limits which give completion with respect to the inner product � n m � � � α i k ( x i , · ) , β j k ( y j , · ) = i =1 j =1 n m � � α i β j k ( x i , y j ) . i =1 j =1 D.Sejdinovic (University of Oxford) Approximate Kernel Embeddings Sheffield, 06/09/2018 3 / 24
Kernel Trick and Kernel Mean Trick implicit feature map x �→ k ( · , x ) ∈ H k replaces x �→ [ φ 1 ( x ) , . . . , φ s ( x )] ∈ R s � k ( · , x ) , k ( · , y ) � H k = k ( x, y ) inner products readily available • nonlinear decision boundaries, nonlinear regression [Cortes & Vapnik, 1995; Schölkopf & functions, learning on non-Euclidean/structured Smola, 2001] data D.Sejdinovic (University of Oxford) Approximate Kernel Embeddings Sheffield, 06/09/2018 4 / 24
Kernel Trick and Kernel Mean Trick implicit feature map x �→ k ( · , x ) ∈ H k replaces x �→ [ φ 1 ( x ) , . . . , φ s ( x )] ∈ R s � k ( · , x ) , k ( · , y ) � H k = k ( x, y ) inner products readily available • nonlinear decision boundaries, nonlinear regression [Cortes & Vapnik, 1995; Schölkopf & functions, learning on non-Euclidean/structured Smola, 2001] data RKHS embedding : implicit feature mean [Smola et al, 2007; Sriperumbudur et al, 2010; Muandet et al, 2017] P �→ µ k ( P ) = E X ∼ P k ( · , X ) ∈ H k replaces P �→ [ E φ 1 ( X ) , . . . , E φ s ( X )] ∈ R s � µ k ( P ) , µ k ( Q ) � H k = E X ∼ P,Y ∼ Q k ( X, Y ) [Gretton et al, 2005; Gretton et al, inner products easy to estimate 2006; Fukumizu et al, 2007; DS et al, 2013; Muandet et al, 2012; • nonparametric two-sample, independence, conditional independence, interaction testing, Szabo et al, 2015] learning on distributions D.Sejdinovic (University of Oxford) Approximate Kernel Embeddings Sheffield, 06/09/2018 4 / 24
Maximum Mean Discrepancy Maximum Mean Discrepancy (MMD) [Borgwardt et al, 2006; Gretton et al, 2007] between P and Q : 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 6 4 2 0 2 4 6 MMD k ( P, Q ) = � µ k ( P ) − µ k ( Q ) � H k = sup | E f ( X ) − E f ( Y ) | f ∈H k : � f � H k ≤ 1 Characteristic kernels: MMD k ( P, Q ) = 0 iff P = Q (also metrizes weak* [Sriperumbudur,2010] ). 2 σ 2 � x − x ′ � 2 1 • Gaussian RBF exp( − 2 ) , Matérn family, inverse multiquadrics. Can encode structural properties in the data: kernels on non-Euclidean domains, networks, images, text... D.Sejdinovic (University of Oxford) Approximate Kernel Embeddings Sheffield, 06/09/2018 5 / 24
GPs and RKHSs: shared mathematical foundations The same notion of a (positive definite) kernel, but conceptual gaps between communities. Orthogonal projection in RKHS ⇔ Conditioning in GPs. Beware! 0/1 laws: GP sample paths with (infinite-dimensional) covariance kernel k almost surely fall outside of H k . • But the space of sample paths is only slightly larger than H k (outer shell). • It is typically also an RKHS (with another kernel). Worst-case in RKHS ⇔ Average-case in GPs. � 2 � � ( P f − Q f ) 2 � MMD 2 ( P, Q ; H k ) = = E f ∼GP (0 ,k ) sup ( Pf − Qf ) . � f � H k ≤ 1 Radford Neal, 1998: “ prior beliefs regarding the true function being modeled and expectations regarding the properties of the best predictor for this function [...] need not be at all similar. ” Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences M. Kanagawa, P. Hennig, DS, and B. K. Sriperumbudur ArXiv e-prints:1807.02582 https://arxiv.org/abs/1807.02582 D.Sejdinovic (University of Oxford) Approximate Kernel Embeddings Sheffield, 06/09/2018 6 / 24
Some uses of MMD MMD has been applied to: two-sample tests and independence tests (on graphs, text, audio...) [Gretton et al, within-sample average similarity – 2009, Gretton et al, 2012] between-sample average similarity model criticism and interpretability [Lloyd & Ghahramani, 2015; Kim, Khanna & Koyejo, 2016] analysis of Bayesian quadrature [Briol et al, 2018] k ( dog i , dog j ) k (dog i , fish j ) ABC summary statistics [Park, Jitkrittum & DS, 2015; Mitrovic, DS & Teh, 2016] summarising streaming data [Paige, DS & k ( fish j , dog i ) k (fish i , fish j ) Wood, 2016] traversal of manifolds learned by convolutional nets [Gardner et al, 2015] MMD-GAN: training deep generative Figure by Arthur Gretton models [Dziugaite, Roy & Ghahramani, 2015; Sutherland et al, 2017; Li et al, 2017] MMD 2 P k ( X, X ′ ) + E Y ,Y ′ i.i.d. Q k ( Y , Y ′ ) − 2 E X ∼ P ,Y ∼ Q k ( X, Y ) . k ( P, Q ) = E X,X ′ i.i.d. ∼ ∼ D.Sejdinovic (University of Oxford) Approximate Kernel Embeddings Sheffield, 06/09/2018 7 / 24
Some uses of MMD MMD has been applied to: two-sample tests and independence tests (on graphs, text, audio...) [Gretton et al, within-sample average similarity – 2009, Gretton et al, 2012] between-sample average similarity model criticism and interpretability [Lloyd & Ghahramani, 2015; Kim, Khanna & Koyejo, 2016] analysis of Bayesian quadrature [Briol et al, 2018] k ( dog i , dog j ) k (dog i , fish j ) ABC summary statistics [Park, Jitkrittum & DS, 2015; Mitrovic, DS & Teh, 2016] summarising streaming data [Paige, DS & k ( fish j , dog i ) k (fish i , fish j ) Wood, 2016] traversal of manifolds learned by convolutional nets [Gardner et al, 2015] MMD-GAN: training deep generative Figure by Arthur Gretton models [Dziugaite, Roy & Ghahramani, 2015; Sutherland et al, 2017; Li et al, 2017] 1 1 2 � � � � MMD 2 k ( P, Q ) = k ( X i , X j )+ k ( Y i , Y j ) − k ( X i , Y j ) . n x ( n x − 1) n y ( n y − 1) n x n y i,j i � = j i � = j D.Sejdinovic (University of Oxford) Approximate Kernel Embeddings Sheffield, 06/09/2018 7 / 24
Recommend
More recommend