reproducing kernel hilbert spaces for classification
play

Reproducing Kernel Hilbert Spaces for Classification Katarina - PowerPoint PPT Presentation

Reproducing Kernel Hilbert Spaces for Classification Katarina Domijan and Simon P. Wilson Department of Statistics, University of Dublin, Trinity College, Ireland November 1, 2005 1 Working Group on Statistical Learning General problem


  1. Reproducing Kernel Hilbert Spaces for Classification Katarina Domijan and Simon P. Wilson Department of Statistics, University of Dublin, Trinity College, Ireland November 1, 2005 1 Working Group on Statistical Learning

  2. General problem • Regression problem. • Data are available ( X 1 , Y 1 ),...( X n , Y n ); X i ∈ R p and Y i ∈ R . • The aim is to find f ( X ) for predicting Y given the values of X . • Linear model: Y = f ( X ) + ǫ , where E ( ǫ )=0 and ǫ is independent of X . f ( X ) = X T β , for a set of parameters β . • Another approach is to use the linear basis expansions. • Replace X with a transformation of it, and subsequently use a linear model in the new space of input features. November 1, 2005 2 Working Group on Statistical Learning

  3. General problem cont’d • Let h m ( X ) : R p �→ R the m th transformation of X . • Then f ( X ) = � M m =1 h m ( X ) β m . • Examples of h m ( X ) are polynomial and trigonometric expansions, e.g. X 3 1 , X 1 X 2 , sin( X 1 ), etc. • Classical solution: use the least squares to estimate β in f ( X ), β = ( H T H ) − 1 H T y . ˆ • Bayesian solution: place a prior (MVN) on β ’s. Likelihood is given by: n 1 2 σ 2 ( y i − f ( x i )) 2 . 1 � 2 πσ 2 e − √ f ( Y | X , β ) = i =1 November 1, 2005 3 Working Group on Statistical Learning

  4. Example: a cubic spline • Assume X is one dimensional. • Divide the domain of X into contiguous intervals. • f is represented by a separate polynomial in each interval. • Basis functions are: h 1 ( X ) = 1 , h 3 ( X ) = X 2 , h 5 ( X ) = ( X − ψ 1 ) 3 + , h 2 ( X ) = X, h 4 ( X ) = X 3 , h 6 ( X ) = ( X − ψ 2 ) 3 + . • ψ 1 and ψ 2 are knots. November 1, 2005 4 Working Group on Statistical Learning

  5. Example: a cubic spline ψ 1 ψ 1 ψ 2 ψ 2 1.5 1.0 0.5 0.0 f(x) −0.5 −1.0 −1.5 −2.0 1 2 3 4 5 6 7 x November 1, 2005 5 Working Group on Statistical Learning

  6. Use in classification • Let the outputs Y take values in a discrete set. • We want to divide the input space into a collection of regions labelled according to the classification. • For Y ∈ { 0 , 1 } , the model is: log P ( Y = 1 | X = x ) P ( Y = 0 | X = x ) = f ( x ) . Hence: e f ( x ) P ( Y = 1 | X = x ) = 1 + e f ( x ) . November 1, 2005 6 Working Group on Statistical Learning

  7. Regularisation • Let’s move from cubic splines to consider all f that are twice continuously differentiable. i =1 ( y i − f ( x i )) 2 = 0. • Many f will have � n • So we look at penalized RSS: n � ( f ′′ ( t )) 2 dt. ( y i − f ( x i )) 2 + λ � RSS ( f, λ ) = i =1 • The second term encourages splines with a slowly changing slope. November 1, 2005 7 Working Group on Statistical Learning

  8. Regularisation cont’d • λ = 0, f can be any function that interpolates the data. • λ = ∞ , f is a least squares line fit. • Note that this is defined on an infinite-dimensional function space. • However, the solution is finite-dimensional and unique: n � f ( x ) = D j ( x ) β j , j =1 where D j ( x ) are an n-dim set of basis functions representing a family of natural splines. • Natural splines have additional constraints to force the function to be linear beyond the boundary knots. November 1, 2005 8 Working Group on Statistical Learning

  9. Regularisation cont’d • Clearly, all inference about f is inference about β = ( β 0 , β 1 , ...β n ). • The LS solution can be shown to be: β = ( D T D + λ Φ D ) − 1 D T y , ˆ where D and Φ D are matrices with elements: { D } i,j = D j ( x i ) and � ′′ ′′ { Φ D } j,k = D j ( t ) D k ( t ) dt, respectively. November 1, 2005 9 Working Group on Statistical Learning

  10. Generalisation • We can generalise this to higher dimensions. • Suppose X ∈ R 2 � n � ( y i − f ( x i )) 2 + λJ ( f ) � min , f i =1 • J ( f ) is the penalty term an example of it is � ∂ 2 f ( x ) �� ∂ 2 f ( x ) � 2 � 2 � 2 � � ∂ 2 f ( x ) � � J ( f ) = + 2 + d x 1 d x 2 . ∂ x 2 ∂ x 2 ∂ x 1 ∂ x 2 R 2 1 2 November 1, 2005 10 Working Group on Statistical Learning

  11. Generalisation cont’d • Optimizing with this penalty leads to a thin plate spline. • The solution can be written as a linear expansion of basis functions: n f ( x ) = β 0 + β T x + � α j h j ( x ) . j =1 where h j are radial basis functions: h j ( x ) = || x − x j || 2 log ( || x − x j || ) . November 1, 2005 11 Working Group on Statistical Learning

  12. Most general case • The general class of problems can be represented as: � n � � min L ( y i , f ( x i )) + λJ ( f ) , (1) f ∈ H i =1 • L ( y i , f ( x i )) is a loss function, e.g. ( y i − f ( x i )) 2 , • J ( f ) is the penalty term, • H is the space on which J ( f ) is defined. • A general functional form can be used for J ( f ). See Girosi et al. (1995). • The solution can be written in terms of a finite number of coefficients. November 1, 2005 12 Working Group on Statistical Learning

  13. Reproducing Kernel Hilbert Spaces (RKHS) • This is a subclass of problems in the previous slide. • Let φ 1 , φ 2 , ... be an infinite sequence of basis functions. • H K is defined to be space of f ’s such that: ∞ � H K = { f ( x ) | f ( x ) = c i φ i ( x ) } . i =1 • Let K be a positive definite kernel with an eigen-expansion: ∞ � K ( x 1 , x 2 ) = γ i φ i ( x 1 ) φ i ( x 2 ) , (2) i =1 where γ i ≥ 0 , � ∞ i =1 γ 2 i < ∞ . November 1, 2005 13 Working Group on Statistical Learning

  14. RKHS cont’d • Define J ( f ) to be: ∞ c 2 � J ( f ) = || f || 2 i H K = < ∞ γ i i =1 • J ( f ) penalizes functions with small eigenvalues in the expansion (2). • Wahba (1990) shows that (1) with these f and J has a finite-dimensional solution given by: n � f ( x ) = β i K ( x , x i ) . i =1 November 1, 2005 14 Working Group on Statistical Learning

  15. RKHS cont’d • Given this, the problem in (1) reduces to finite-dimensional optimization: � L ( y , Kβ ) + λ β T Kβ � min . β where K is a n × n matrix with elements { K } i,j = K ( x i , x j ). • Hence, the problem is defined in terms of L and K ! November 1, 2005 15 Working Group on Statistical Learning

  16. Bayesian RKHS for classification • Mallick et al. (2005): molecular classification of 2 types of tumour using cDNA microarrays. • Data have undergone within and between slide normalization. • p genes, n tumour samples, so x i,j is a measurement of the expression level of the j th gene, for the i th sample. • They wish to model p ( y | x ) and use it to predict future observations. • Assume latent variables z i such that: n � p ( y | z ) = p ( y i | z i ) , i =1 and z i = f ( x i ) + ǫ i , i = 1 , ..., n, ǫ i ∼ i.i.d. N (0 , σ 2 ) . November 1, 2005 16 Working Group on Statistical Learning

  17. Bayesian RKHS for classification • To develop the complete model, they need to specify p ( y | z ) and f . • f ( x ) is modeled by RKHS approach. • Their kernel choices are Gaussian and polynomial. • Both kernels contain only one parameter θ , e.g. Gaussian: K ( x i , x j ) = exp ( −|| x i − x j || 2 /θ ) • Hence, the random variable z i is modeled by: n � β j K ( x i , x j | θ )+ ǫ i , i = 1 , ..., n, ǫ i ∼ i.i.d. N (0 , σ 2 ) . z i = f ( x i )+ ǫ i = β 0 + j =1 November 1, 2005 17 Working Group on Statistical Learning

  18. Bayesian RKHS for classification • The Bayesian formulation requires priors to be assigned to β , θ , and σ 2 . • The model is specified as: z i | β , θ , σ 2 i β , σ 2 ) N ( z i | K ′ ∼ N ( β | 0 , σ 2 M − 1 ) IG ( σ 2 | γ 1 , γ 2 ) β , σ 2 ∼ p � ∼ U ( a 1 q , a 2 q ) . θ q =1 where K ′ i = (1 , K ( x i , x 1 | θ ) , ..., K ( x i , x n | θ )) and M is a diagonal matrix with elements ξ = ( ξ 1 , ..., ξ n +1 ). • Jeffrey’s independence prior p ( ξ ) ∝ � n +1 i =1 ξ − 1 promotes sparseness i Figueiredo (2002). November 1, 2005 18 Working Group on Statistical Learning

  19. Bayesian RKHS for classification • p ( y | z ) is modeled on the basis of a loss function. • Two models considered in the paper are: logistic regression and SVM. • The logistic regression approach: [ p i ( z i )] y i [1 − p i ( z i )] (1 − y i ) , p ( y i | z i ) = e z i p i ( z i ) = (1 + e z i ) . • It follows that the log-likelihood is equal to: n n � � log (1 + e z i ) . y i z i − i =1 i =1 • So the loss function is given by: L ( y i , z i ) = y i z i − log (1 + e z i ). November 1, 2005 19 Working Group on Statistical Learning

  20. Bayesian RKHS for classification • MCMC sampling is used for sampling from the posterior p ( β , θ , z , λ , σ 2 | y ). • Proposed work: - variable selection: - kernel selection (which β i = 0?) - regressor selection (which x i to ignore?) - more than two classes (multivariate logistic regression). November 1, 2005 20 Working Group on Statistical Learning

Recommend


More recommend