Prediction in kernelized output spaces: output kernel trees and ensemble methods Pierre Geurts Florence d’Alché–Buc IBISC CNRS, Université d’Evry, GENOPOLE, Evry, France Department of EE and CS, University of Liège, Belgium 25 janvier 2007 Kernelized output spaces 25 janvier 2007 1 / 49
Motivation In many domains (e.g. text, computational biology), we want to predict complex or structured outputs. e.g. graphs, time series, classes with hierarchical relations, position in a graph, images... The main goal of our research team is to develop machine learning tools to extract structures: We try to address this issue by several ways and through text and systems biology applications : learning structure of BN (Bayesian networks) and DBN (Dynamic bayesian network) : unsupervised approaches learning interactions as a classification concept : supervised and semi-supervised approaches learning mapping between structures when input and output are strongly dependent : supervised approaches learning mapping between input feature vectors and structured outputs (this talk) Kernelized output spaces Motivation 25 janvier 2007 2 / 49
Supervised learning with structured outputs Example 1 : Image reconstruction Example 2 : Find the position of a gene/protein/enzyme in a biological network from various biological descriptors (function of the protein, localization, expression data) Very few solutions exist for these tasks (one precursor: KDE), none are explanatory We present a set of methods for handling complex outputs that have some explanatory power and illustrate it on these two problems with a main focus on the biological network completion Output Kernel Tree: an extension of regression tree to kernelized output spaces Ensemble methods devoted to regressors in kernelized output spaces Kernelized output spaces Motivation 25 janvier 2007 3 / 49
Outline Motivation 1 Supervised learning in kernelized output spaces 2 Output Kernel Tree 3 Ensemble methods 4 Parallel ensemble methods Gradient boosting Experiments 5 Image reconstruction Completion of biological networks Boosting Conclusion and future works 6 Kernelized output spaces Supervised learning in kernelized output spaces 25 janvier 2007 4 / 49
Supervised learning with complex outputs Suppose we have a sample of objects { o i , i = 1 , . . . , N } drawn from a fixed but unknown probability distribution Suppose we have two representations of the objects: an input feature vector representation: x i = x ( o i ) ∈ X an output representation : y i = y ( o i ) ∈ Y where Y is not necessary a vectorial space (it can be a finite set with complex relations between elements) From a learning sample { ( x i , y i ) | i = 1 , . . . , N } with x i ∈ X and y i ∈ Y , find a function h : X → Y that minimizes the expectation of some loss function ℓ : Y × Y → IR over the joint distribution of input/output pairs: E x , y { ℓ ( h ( x ) , y ) } Complex outputs: no constraint (for the moment) on the nature of Y Kernelized output spaces Supervised learning in kernelized output spaces 25 janvier 2007 5 / 49
General approach Kernelized output spaces Supervised learning in kernelized output spaces 25 janvier 2007 6 / 49
Use the kernel trick for the outputs Additional information to the training set: A Gram matrix K = ( k ij ) , with k ij = k ( y i , y j ) and k a Mercer Kernel with the corresponding mapping φ such that k ( y , y ′ ) = < φ ( y ) , φ ( y ′ ) > . Approach: Approximate the feature map φ with a function h φ : X → H defined 1 on the input space Get a prediction in the original output space by approximating the 2 function φ − 1 (pre-image problem) Kernelized output spaces Supervised learning in kernelized output spaces 25 janvier 2007 7 / 49
Possible applications Learning a mapping from an input vector into a structured output (graphs, sequences, trees, time series...) Learning with alternative loss functions (hierarchical classification for instance) Learning a kernel as a function of some inputs Kernelized output spaces Supervised learning in kernelized output spaces 25 janvier 2007 8 / 49
Learning a kernel as a function of some inputs In some applications, we want to learn a relationship between objects rather than an output (e.g. network completion problem) Learning data set: { x i | i = 1 , K = ( k ij ) , i , j = 1 . . . , N } In this case, we can make kernel predictions from predictions in H (without needing pre-images) g ( x , x ′ ) = � h φ ( x ) , h φ ( x ′ ) � . Kernelized output spaces Supervised learning in kernelized output spaces 25 janvier 2007 9 / 49
Outline Motivation 1 Supervised learning in kernelized output spaces 2 Output Kernel Tree 3 Ensemble methods 4 Parallel ensemble methods Gradient boosting Experiments 5 Image reconstruction Completion of biological networks Boosting Conclusion and future works 6 Kernelized output spaces Output Kernel Tree 25 janvier 2007 10 / 49
Standard regression trees A learning algorithm that solves the regression problem ( Y = IR and ℓ ( y 1 , y 2 ) = ( y 1 − y 2 ) 2 ) with a tree structured model Basic idea of the learning procedure: Recursively split the learning sample with tests based on the inputs trying to reduce as much as possible the variance of the output Stop when the output is constant in the leaf (or some stopping criterion is met) Kernelized output spaces Output Kernel Tree 25 janvier 2007 11 / 49
Focus on regression trees on multiple outputs Y = IR n and ℓ ( y 1 , y 2 ) = || y 1 − y 2 || 2 The algorithm is the same but: The best split is the one that maximizes the variance reduction: Score R ( Test , S ) = var { y | S } − N l N var { y | S l } − N r N var { y | S r } , where N is the size of S , N l (resp. N r ) the size of S l (resp. S r ), and var { Y | S } denotes the variance of the output Y in the subset S : N var { y | S } = 1 || y i − y || 2 with y = 1 � � y i N N i = 1 i = 1 which is the average distance to the center of mass (or the Kernelized output spaces Output Kernel Tree 25 janvier 2007 12 / 49
Regression trees in output feature space Let us suppose we have access to an output Gram matrix k ( y i , y j ) with k a kernel defined on Y × Y (with corresponding feature map φ : Y → F such that k ( y i , y j ) = � φ ( y i ) , φ ( y j ) � ) The idea is to grow a multiple output regression tree in the output feature space: The variance becomes: N var { φ ( y ) | S } = 1 || φ ( y i ) − 1 � � φ ( y i ) || 2 N N i = 1 i = 1 Predictions at leaf nodes become pre-images of the centers of mass N L y L = φ − 1 ( 1 � ˆ φ ( y i )) N L i = 1 We need to express everything in terms of kernel values only and return to the original output space Y Kernelized output spaces Output Kernel Tree 25 janvier 2007 13 / 49
Kernelization The variance may be written: N 1 || φ ( y i ) − 1 � � φ ( y i ) || 2 var { φ ( y ) | S } = N N i = 1 i = 1 N N 1 < φ ( y i ) , φ ( y i ) > − 1 � � = < φ ( y i ) , φ ( y j ) >, N 2 N i = 1 i , j = 1 which makes use only of dot products between vectors in the output feature space We can use the kernel trick and replace these dot-products by kernels: N N var { φ ( y ) | S } = 1 k ( y i , y i ) − 1 � � k ( y i , y j ) N 2 N i = 1 i , j = 1 From kernel values only, we can thus grow a regression tree that minimizes output feature space variance Kernelized output spaces Output Kernel Tree 25 janvier 2007 14 / 49
Prediction in the original output space Each leaf is associated with a subset of outputs from the learning sample N L y L = φ − 1 ( 1 X ˆ φ ( y i )) N L i = 1 Generic proposal for the pre-image problem: Find the output in the leaf closest to the center of mass: N L N L y ′ ∈{ y 1 ,..., y NL } || φ ( y ′ ) − 1 y ′ ∈{ y 1 ,..., y NL } k ( y ′ , y ′ ) − 2 φ ( y i ) || 2 = arg X X k ( y i , y ′ ) ˆ y L = arg min min N L N L i = 1 i = 1 Kernelized output spaces Output Kernel Tree 25 janvier 2007 15 / 49
Outline Motivation 1 Supervised learning in kernelized output spaces 2 Output Kernel Tree 3 Ensemble methods 4 Parallel ensemble methods Gradient boosting Experiments 5 Image reconstruction Completion of biological networks Boosting Conclusion and future works 6 Kernelized output spaces Ensemble methods 25 janvier 2007 16 / 49
Ensemble methods Parallel ensemble methods based on randomization: Grow several models in parallel and average their predictions Greatly improve accuracy of single regressors by reducing their variance Usually, they can be applied directly (e.g., bagging, random forests, extra-trees) Boosting algorithms: Grow the models in sequence by focusing on “difficult” examples Need to be extended to regressors with kernelized outputs We propose a kernelization of gradient boosting approaches (Friedman, 2001). Kernelized output spaces Ensemble methods 25 janvier 2007 17 / 49
Recommend
More recommend