Reproducing Kernel Hilbert Spaces for Classification Katarina Domijan and Simon P. Wilson Department of Statistics, University of Dublin, Trinity College, Ireland November 1, 2005 1 Working Group on Statistical Learning
General problem • Regression problem. • Data are available ( X 1 , Y 1 ),...( X n , Y n ); X i ∈ R p and Y i ∈ R . • The aim is to find f ( X ) for predicting Y given the values of X . • Linear model: Y = f ( X ) + ǫ , where E ( ǫ )=0 and ǫ is independent of X . f ( X ) = X T β , for a set of parameters β . • Another approach is to use the linear basis expansions. • Replace X with a transformation of it, and subsequently use a linear model in the new space of input features. November 1, 2005 2 Working Group on Statistical Learning
General problem cont’d • Let h m ( X ) : R p �→ R the m th transformation of X . • Then f ( X ) = � M m =1 h m ( X ) β m . • Examples of h m ( X ) are polynomial and trigonometric expansions, e.g. X 3 1 , X 1 X 2 , sin( X 1 ), etc. • Classical solution: use the least squares to estimate β in f ( X ), β = ( H T H ) − 1 H T y . ˆ • Bayesian solution: place a prior (MVN) on β ’s. Likelihood is given by: n 1 2 σ 2 ( y i − f ( x i )) 2 . 1 � 2 πσ 2 e − √ f ( Y | X , β ) = i =1 November 1, 2005 3 Working Group on Statistical Learning
Example: a cubic spline • Assume X is one dimensional. • Divide the domain of X into contiguous intervals. • f is represented by a separate polynomial in each interval. • Basis functions are: h 1 ( X ) = 1 , h 3 ( X ) = X 2 , h 5 ( X ) = ( X − ψ 1 ) 3 + , h 2 ( X ) = X, h 4 ( X ) = X 3 , h 6 ( X ) = ( X − ψ 2 ) 3 + . • ψ 1 and ψ 2 are knots. November 1, 2005 4 Working Group on Statistical Learning
Example: a cubic spline ψ 1 ψ 1 ψ 2 ψ 2 1.5 1.0 0.5 0.0 f(x) −0.5 −1.0 −1.5 −2.0 1 2 3 4 5 6 7 x November 1, 2005 5 Working Group on Statistical Learning
Use in classification • Let the outputs Y take values in a discrete set. • We want to divide the input space into a collection of regions labelled according to the classification. • For Y ∈ { 0 , 1 } , the model is: log P ( Y = 1 | X = x ) P ( Y = 0 | X = x ) = f ( x ) . Hence: e f ( x ) P ( Y = 1 | X = x ) = 1 + e f ( x ) . November 1, 2005 6 Working Group on Statistical Learning
Regularisation • Let’s move from cubic splines to consider all f that are twice continuously differentiable. i =1 ( y i − f ( x i )) 2 = 0. • Many f will have � n • So we look at penalized RSS: n � ( f ′′ ( t )) 2 dt. ( y i − f ( x i )) 2 + λ � RSS ( f, λ ) = i =1 • The second term encourages splines with a slowly changing slope. November 1, 2005 7 Working Group on Statistical Learning
Regularisation cont’d • λ = 0, f can be any function that interpolates the data. • λ = ∞ , f is a least squares line fit. • Note that this is defined on an infinite-dimensional function space. • However, the solution is finite-dimensional and unique: n � f ( x ) = D j ( x ) β j , j =1 where D j ( x ) are an n-dim set of basis functions representing a family of natural splines. • Natural splines have additional constraints to force the function to be linear beyond the boundary knots. November 1, 2005 8 Working Group on Statistical Learning
Regularisation cont’d • Clearly, all inference about f is inference about β = ( β 0 , β 1 , ...β n ). • The LS solution can be shown to be: β = ( D T D + λ Φ D ) − 1 D T y , ˆ where D and Φ D are matrices with elements: { D } i,j = D j ( x i ) and � ′′ ′′ { Φ D } j,k = D j ( t ) D k ( t ) dt, respectively. November 1, 2005 9 Working Group on Statistical Learning
Generalisation • We can generalise this to higher dimensions. • Suppose X ∈ R 2 � n � ( y i − f ( x i )) 2 + λJ ( f ) � min , f i =1 • J ( f ) is the penalty term an example of it is � ∂ 2 f ( x ) �� ∂ 2 f ( x ) � 2 � 2 � 2 � � ∂ 2 f ( x ) � � J ( f ) = + 2 + d x 1 d x 2 . ∂ x 2 ∂ x 2 ∂ x 1 ∂ x 2 R 2 1 2 November 1, 2005 10 Working Group on Statistical Learning
Generalisation cont’d • Optimizing with this penalty leads to a thin plate spline. • The solution can be written as a linear expansion of basis functions: n f ( x ) = β 0 + β T x + � α j h j ( x ) . j =1 where h j are radial basis functions: h j ( x ) = || x − x j || 2 log ( || x − x j || ) . November 1, 2005 11 Working Group on Statistical Learning
Most general case • The general class of problems can be represented as: � n � � min L ( y i , f ( x i )) + λJ ( f ) , (1) f ∈ H i =1 • L ( y i , f ( x i )) is a loss function, e.g. ( y i − f ( x i )) 2 , • J ( f ) is the penalty term, • H is the space on which J ( f ) is defined. • A general functional form can be used for J ( f ). See Girosi et al. (1995). • The solution can be written in terms of a finite number of coefficients. November 1, 2005 12 Working Group on Statistical Learning
Reproducing Kernel Hilbert Spaces (RKHS) • This is a subclass of problems in the previous slide. • Let φ 1 , φ 2 , ... be an infinite sequence of basis functions. • H K is defined to be space of f ’s such that: ∞ � H K = { f ( x ) | f ( x ) = c i φ i ( x ) } . i =1 • Let K be a positive definite kernel with an eigen-expansion: ∞ � K ( x 1 , x 2 ) = γ i φ i ( x 1 ) φ i ( x 2 ) , (2) i =1 where γ i ≥ 0 , � ∞ i =1 γ 2 i < ∞ . November 1, 2005 13 Working Group on Statistical Learning
RKHS cont’d • Define J ( f ) to be: ∞ c 2 � J ( f ) = || f || 2 i H K = < ∞ γ i i =1 • J ( f ) penalizes functions with small eigenvalues in the expansion (2). • Wahba (1990) shows that (1) with these f and J has a finite-dimensional solution given by: n � f ( x ) = β i K ( x , x i ) . i =1 November 1, 2005 14 Working Group on Statistical Learning
RKHS cont’d • Given this, the problem in (1) reduces to finite-dimensional optimization: � L ( y , Kβ ) + λ β T Kβ � min . β where K is a n × n matrix with elements { K } i,j = K ( x i , x j ). • Hence, the problem is defined in terms of L and K ! November 1, 2005 15 Working Group on Statistical Learning
Bayesian RKHS for classification • Mallick et al. (2005): molecular classification of 2 types of tumour using cDNA microarrays. • Data have undergone within and between slide normalization. • p genes, n tumour samples, so x i,j is a measurement of the expression level of the j th gene, for the i th sample. • They wish to model p ( y | x ) and use it to predict future observations. • Assume latent variables z i such that: n � p ( y | z ) = p ( y i | z i ) , i =1 and z i = f ( x i ) + ǫ i , i = 1 , ..., n, ǫ i ∼ i.i.d. N (0 , σ 2 ) . November 1, 2005 16 Working Group on Statistical Learning
Bayesian RKHS for classification • To develop the complete model, they need to specify p ( y | z ) and f . • f ( x ) is modeled by RKHS approach. • Their kernel choices are Gaussian and polynomial. • Both kernels contain only one parameter θ , e.g. Gaussian: K ( x i , x j ) = exp ( −|| x i − x j || 2 /θ ) • Hence, the random variable z i is modeled by: n � β j K ( x i , x j | θ )+ ǫ i , i = 1 , ..., n, ǫ i ∼ i.i.d. N (0 , σ 2 ) . z i = f ( x i )+ ǫ i = β 0 + j =1 November 1, 2005 17 Working Group on Statistical Learning
Bayesian RKHS for classification • The Bayesian formulation requires priors to be assigned to β , θ , and σ 2 . • The model is specified as: z i | β , θ , σ 2 i β , σ 2 ) N ( z i | K ′ ∼ N ( β | 0 , σ 2 M − 1 ) IG ( σ 2 | γ 1 , γ 2 ) β , σ 2 ∼ p � ∼ U ( a 1 q , a 2 q ) . θ q =1 where K ′ i = (1 , K ( x i , x 1 | θ ) , ..., K ( x i , x n | θ )) and M is a diagonal matrix with elements ξ = ( ξ 1 , ..., ξ n +1 ). • Jeffrey’s independence prior p ( ξ ) ∝ � n +1 i =1 ξ − 1 promotes sparseness i Figueiredo (2002). November 1, 2005 18 Working Group on Statistical Learning
Bayesian RKHS for classification • p ( y | z ) is modeled on the basis of a loss function. • Two models considered in the paper are: logistic regression and SVM. • The logistic regression approach: [ p i ( z i )] y i [1 − p i ( z i )] (1 − y i ) , p ( y i | z i ) = e z i p i ( z i ) = (1 + e z i ) . • It follows that the log-likelihood is equal to: n n � � log (1 + e z i ) . y i z i − i =1 i =1 • So the loss function is given by: L ( y i , z i ) = y i z i − log (1 + e z i ). November 1, 2005 19 Working Group on Statistical Learning
Bayesian RKHS for classification • MCMC sampling is used for sampling from the posterior p ( β , θ , z , λ , σ 2 | y ). • Proposed work: - variable selection: - kernel selection (which β i = 0?) - regressor selection (which x i to ignore?) - more than two classes (multivariate logistic regression). November 1, 2005 20 Working Group on Statistical Learning
Recommend
More recommend