Application: Discrete Events Simple Data Discrete random variables (e.g. tossing a dice). Outcome 1 2 3 4 5 6 Counts 3 6 2 1 4 4 Probabilities 0.15 0.30 0.10 0.05 0.20 0.20 Maximum Likelihood Solution Count the number of outcomes and use the relative fre- quency of occurrence as estimates for the probability: p emp ( x ) = # x m Problems Bad idea if we have few data. Bad idea if we have continuous random variables. Alexander J. Smola: Exponential Families and Kernels, Page 26
Tossing a dice Alexander J. Smola: Exponential Families and Kernels, Page 27
Fisher Information and Efficiency Fisher Score V θ ( x ) := ∂ θ log p ( x ; θ ) This tells us the influence of x on estimating θ . Its ex- pected value vanishes, since � E [ ∂ θ log p ( X ; θ )] = p ( X ; θ ) ∂ θ log p ( X ; θ ) dX � = ∂ θ p ( X ; θ ) dX = 0 . Fisher Information Matrix It is the covariance matrix of the Fisher scores, that is I := Cov[ V θ ( x )] Alexander J. Smola: Exponential Families and Kernels, Page 28
Cramer Rao Theorem Efficiency Covariance of estimator ˆ θ ( X ) rescaled by I : 1 /e := det Cov[ˆ θ ( X )]Cov[ ∂ θ log p ( X ; θ )] Theorem The efficiency for unbiased estimators is never better (i.e. larger) than 1 . Equality is achieved for MLEs. Proof (scalar case only) By Cauchy-Schwartz we have � � � � ���� 2 ˆ ˆ ( V θ ( X ) − E θ [ V θ ( X )]) θ ( X ) − E θ θ ( X ) E θ �� �� 2 � � ( V θ ( X ) − E θ [ V θ ( X )]) 2 � � ˆ ˆ ≤ E θ θ ( X ) − E θ θ ( X ) = IB E θ Alexander J. Smola: Exponential Families and Kernels, Page 29
Cramer Rao Theorem Proof At the same time, E θ [ V θ ( X )] = 0 implies that � � � ��� ˆ ˆ ( V θ ( X ) − E θ [ V θ ( X )]) θ ( X ) − E θ θ ( X ) E θ � � V θ ( X )ˆ = E θ θ ( X ) �� � p ( X | θ ) ∂ θ log p ( X | θ )ˆ = θ ( X ) dX � p ( X | θ )ˆ = ∂ θ θ ( X ) dX = ∂ θ θ = 1 . Cautionary Note This does not imply that a biased estimator might not have lower variance. Alexander J. Smola: Exponential Families and Kernels, Page 30
Fisher and Exponential Families Fisher Score V θ ( x ) = ∂ θ log p ( x ; θ ) = φ ( x ) − ∂ θ g ( θ ) Fisher Information I = Cov[ V θ ( x )] = Cov[ φ ( x ) − ∂ θ g ( θ )] = ∂ 2 θ g ( θ ) Efficiency of estimator can be obtained directly from log- partition function. Outer Product Matrix It is given (up to an offset) by � φ ( x ) , φ ( x ′ ) . This leads to Kernel-PCA . . . Alexander J. Smola: Exponential Families and Kernels, Page 31
Priors Problems with Maximum Likelihood With not enough data, parameter estimates will be bad. Prior to the rescue Often we know where the solution should be. So we encode the latter by means of a prior p ( θ ) . Normal Prior Simply set p ( θ ) ∝ exp( − 1 2 σ 2 � θ � 2 ) . Posterior � m � � φ ( x i ) , θ � − g ( θ ) − 1 � 2 σ 2 � θ � 2 p ( θ | X ) ∝ exp i =1 Alexander J. Smola: Exponential Families and Kernels, Page 32
Tossing a dice with priors Alexander J. Smola: Exponential Families and Kernels, Page 33
Conjugate Priors Problem with Normal Prior The posterior looks different from the likelihood. So many of the Maximum Likelihood optimization algorithms may not work ... Idea What if we had a prior which looked like additional data, that is p ( θ | X ) ∼ p ( X | θ ) For exponential families this is easy. Simply set p ( θ | a ) ∝ exp( � θ, m 0 a � − m 0 g ( θ )) Posterior � �� mµ + m 0 a � �� p ( θ | X ) ∝ exp ( m + m 0 ) , θ − g ( θ ) m + m 0 Alexander J. Smola: Exponential Families and Kernels, Page 34
Example: Multinomial Distribution Laplace Rule A conjugate prior with parameters ( a, m 0 ) in the multino- mial family could be to set a = ( 1 n , 1 n , . . . , 1 n ) . This is often also called the Dirichlet prior . It leads to p ( x ) = # x + m 0 /n instead of p ( x ) = # x m + m 0 m Example Outcome 1 2 3 4 5 6 Counts 3 6 2 1 4 4 MLE 0.15 0.30 0.10 0.05 0.20 0.20 MAP ( m 0 = 6 ) 0.25 0.27 0.12 0.08 0.19 0.19 MAP ( m 0 = 100 ) 0.16 0.19 0.16 0.15 0.17 0.17 Alexander J. Smola: Exponential Families and Kernels, Page 35
Optimization Problems Maximum Likelihood m m ⇒ ∂ θ g ( θ ) = 1 � � minimize g ( θ ) − � φ ( x i ) , θ � = φ ( x i ) m θ i =1 i =1 Normal Prior m g ( θ ) − � φ ( x i ) , θ � + 1 � 2 σ 2 � θ � 2 minimize θ i =1 Conjugate Prior m � minimize g ( θ ) − � φ ( x i ) , θ � + m 0 g ( θ ) − m 0 � ˜ µ, θ � θ i =1 m 1 m 0 � equivalently solve ∂ θ g ( θ ) = φ ( x i ) + µ ˜ m + m 0 m + m 0 i =1 Alexander J. Smola: Exponential Families and Kernels, Page 36
Summary Model Log partition function Expectations and derivatives Maximum entropy formulation A Zoo of Densities Estimation Maximum Likelihood Estimator Fisher Information Matrix and Cramer Rao Theorem Normal Priors and Conjugate Priors Fisher information and log-partition function Alexander J. Smola: Exponential Families and Kernels, Page 37
Exponential Families and Kernels Lecture 2 Alexander J. Smola Alex.Smola@nicta.com.au Machine Learning Program National ICT Australia RSISE, The Australian National University Alexander J. Smola: Exponential Families and Kernels, Page 1
Outline Exponential Families Maximum likelihood and Fisher information Priors (conjugate and normal) Conditioning and Feature Spaces Conditional distributions and inner products Clifford Hammersley Decomposition Applications Classification and novelty detection Regression Applications Conditional random fields Intractable models and semidefinite approximations Alexander J. Smola: Exponential Families and Kernels, Page 2
Lecture 2 Clifford Hammersley Theorem and Graphical Models Decomposition results Key connection Conditional Distributions Log partition function Expectations and derivatives Inner product formulation and kernels Gaussian Processes Applications Classification + Regression Conditional Random Fields Spatial Poisson Models Alexander J. Smola: Exponential Families and Kernels, Page 3
Graphical Model Conditional Independence x, x ′ are conditionally independent given c , if p ( x, x ′ | c ) = p ( x | c ) p ( x ′ | c ) Distributions can be simplified greatly by conditional independence assumptions. Markov Network Given a graph G ( V, E ) with vertices V and edges E associate a random variable x ∈ R | V | with G . Subsets of random variables x S , x S ′ are conditionally independent given x C if removing the vertices C from G ( V, E ) decomposes the graph into disjoint subsets containing S, S ′ . Alexander J. Smola: Exponential Families and Kernels, Page 4
Conditional Independence Alexander J. Smola: Exponential Families and Kernels, Page 5
Cliques Definition Subset of the graph which is fully connected Maximal Cliques (they define the graph) Advantage Easy to specify dependencies between variables Use graph algorithms for inference Alexander J. Smola: Exponential Families and Kernels, Page 6
Hammersley Clifford Theorem Problem Specify p ( x ) with conditional independence properties. Theorem �� � p ( x ) = 1 Z exp ψ c ( x c ) c ∈ C whenever p ( x ) is nonzero on the entire domain. Application Apply decomposition for exponential families where p ( x ) = exp( � φ ( x ) , θ � − g ( θ )) . Corollary The sufficient statistics φ ( x ) decompose according to � ⇒ � φ ( x ) , φ ( x ′ ) � = � φ c ( x c ) , φ c ( x ′ φ ( x ) = ( . . . , φ c ( x c ) , . . . ) = c ) � c Alexander J. Smola: Exponential Families and Kernels, Page 7
Proof Step 1: Obtain linear functional Combing the exponential setting with the CH theorem: � � Φ( x ) , θ � = ψ c ( x c ) − log Z + g ( θ ) for all x , θ . c ∈ C Step 2: Orthonormal basis in θ Pick an orthonormal basis and swallow Z, g . This gives � η i c ( x c ) for some η i � Φ( x ) , e i � = c ( x c ) . c ∈ C Step 3: Reconstruct sufficient statistics Φ c ( x c ) := ( η 1 c ( x c ) , η 2 c ( x c ) , . . . ) which allows us to compute � � θ i Φ i � Φ( x ) , θ � = c ( x c ) . c ∈ C i Alexander J. Smola: Exponential Families and Kernels, Page 8
Example: Normal Distributions Sufficient Statistics Recall that for normal distributions φ ( x ) = ( x, xx ⊤ ) . Clifford Hammersley Application φ ( x ) must decompose into subsets involving only vari- ables from each maximal clique. The linear term x is OK by default. The only nonzero terms coupling x i x j are those corre- sponding to an edge in the graph G ( V, E ) . Inverse Covariance Matrix The natural parameter aligned with xx ⊤ is the inverse covariance matrix. Its sparsity mirrors G ( V, E ) . Hence a sparse inverse kernel matrix corresponds to graphical model! Alexander J. Smola: Exponential Families and Kernels, Page 9
Example: Normal Distributions Density n n � � p ( x | θ ) = exp x i θ 1 i + x i x j θ 2 ij − g ( θ ) i =1 i,j =1 Here θ 2 = Σ − 1 , is the inverse covariance matrix. We have that (Σ − 1 ) [ ij ] � = 0 only if ( i, j ) share an edge. Alexander J. Smola: Exponential Families and Kernels, Page 10
Conditional Distributions Conditional Density p ( x | θ ) = exp( � φ ( x ) , θ � − g ( θ )) p ( y | x, θ ) = exp( � φ ( x, y ) , θ � − g ( θ | x )) Log-partition function � g ( θ | x ) = log exp( � φ ( x, y ) , θ � ) dy y Sufficient Criterion p ( x, y | θ ) is a member of the exponential family itself. Key Idea Avoid computing φ ( x, y ) directly, only evaluate inner products via k (( x, y ) , ( x ′ , y ′ )) := � φ ( x, y ) , φ ( x ′ , y ′ ) � Alexander J. Smola: Exponential Families and Kernels, Page 11
Conditional Distributions Maximum a Posteriori Estimation m −� φ ( x i ) , θ � + mg ( θ ) + 1 � 2 σ 2 � θ � 2 + c − log p ( θ | X ) = i =1 m −� φ ( x i , y i ) , θ � + g ( θ | x i ) + 1 � 2 σ 2 � θ � 2 + c − log p ( θ | X, Y ) = i =1 Solving the Problem The problem is strictly convex in θ . Direct solution is impossible if we cannot compute φ ( x, y ) directly. Solve convex problem in expansion coefficients. Expand θ in a linear combination of φ ( x i , y ) . Alexander J. Smola: Exponential Families and Kernels, Page 12
Joint Feature Map Alexander J. Smola: Exponential Families and Kernels, Page 13
Representer Theorem Objective Function m −� φ ( x i , y i ) , θ � + g ( θ | x i ) + 1 � 2 σ 2 � θ � 2 + c − log p ( θ | X, Y ) = i =1 Decomposition Decompose θ into θ = θ � + θ ⊥ where θ � ∈ span { φ ( x i , y ) where 1 ≤ i ≤ m and y ∈ Y } Both g ( θ | x i ) and � φ ( x i , y i ) , θ � are independent of θ ⊥ . Theorem − log p ( θ | X, Y ) is minimized for θ ⊥ = 0 , hence θ = θ � . Consequence If span { φ ( x i , y ) where 1 ≤ i ≤ m and y ∈ Y } is finite di- mensional, we have a parametric optimization problem. Alexander J. Smola: Exponential Families and Kernels, Page 14
Using It Expansion m � � θ = α iy φ ( x i , y ) i =1 y ∈ Y Inner Product m � � � φ ( x, y ) , θ � = α iy k (( x, y ) , ( x i , y )) i =1 y ∈ Y Norm m � � � θ � 2 = α iy α jy ′ k (( x i , y ) , ( x j , y ′ )) y,y ′ ∈ Y i,j =1 Log-partition function � g ( θ | x ) = log exp ( � φ ( x, y ) , θ � ) y ∈ Y Alexander J. Smola: Exponential Families and Kernels, Page 15
The Gaussian Process Link Normal Prior on θ . . . θ ∼ N (0 , σ 2 1 ) . . . yields Normal Prior on t ( x, y ) = � φ ( x, y ) , θ � Distribution of projected Gaussian is Gaussian. The mean vanishes E θ [ t ( x, y )] = � φ ( x, y ) , E θ [ θ ] � = 0 The covariance yields Cov[ t ( x, y ) , t ( x ′ , y ′ )] = E θ [ � φ ( x, y ) , θ �� θ, φ ( x ′ , y ′ ) � ] = σ 2 � φ ( x, y ) , φ ( x ′ , y ′ ) � � �� � := k (( x,y ) , ( x ′ ,y ′ )) . . . so we have a Gaussian Process on x . . . with kernel k (( x, y ) , ( x ′ , y ′ )) = σ 2 � φ ( x, y ) , φ ( x ′ , y ′ ) � . Alexander J. Smola: Exponential Families and Kernels, Page 16
Linear Covariance Alexander J. Smola: Exponential Families and Kernels, Page 17
Laplacian Covariance Alexander J. Smola: Exponential Families and Kernels, Page 18
Gaussian Covariance Alexander J. Smola: Exponential Families and Kernels, Page 19
Polynomial (Order 3) Alexander J. Smola: Exponential Families and Kernels, Page 20
B 3 -Spline Covariance Alexander J. Smola: Exponential Families and Kernels, Page 21
Sample from Gaussian RBF Alexander J. Smola: Exponential Families and Kernels, Page 22
Sample from Gaussian RBF Alexander J. Smola: Exponential Families and Kernels, Page 23
Sample from Gaussian RBF Alexander J. Smola: Exponential Families and Kernels, Page 24
Sample from Gaussian RBF Alexander J. Smola: Exponential Families and Kernels, Page 25
Sample from Gaussian RBF Alexander J. Smola: Exponential Families and Kernels, Page 26
Sample from linear kernel Alexander J. Smola: Exponential Families and Kernels, Page 27
Sample from linear kernel Alexander J. Smola: Exponential Families and Kernels, Page 28
Sample from linear kernel Alexander J. Smola: Exponential Families and Kernels, Page 29
Sample from linear kernel Alexander J. Smola: Exponential Families and Kernels, Page 30
Sample from linear kernel Alexander J. Smola: Exponential Families and Kernels, Page 31
General Strategy Choose a suitable sufficient statistic φ ( x, y ) Conditionally multinomial distribution leads to Gaus- sian Process multiclass estimator: we have a distribu- tion over n classes which depends on x . Conditionally Gaussian leads to Gaussian Process re- gression: we have a normal distribution over a random variable which depends on the location. Note: we estimate mean and variance. Conditionally Poisson distributions yield locally vary- ing Poisson processes. This has no name yet ... Solve the optimization problem This is typically convex. The bottom line Instead of choosing k ( x, x ′ ) choose k (( x, y ) , ( x ′ , y ′ )) . Alexander J. Smola: Exponential Families and Kernels, Page 32
Example: GP Classification Sufficient Statistic We pick φ ( x, y ) = φ ( x ) ⊗ e y , that is k (( x, y ) , ( x ′ , y ′ )) = k ( x, x ′ ) δ yy ′ where y, y ′ ∈ { 1 , . . . , n } Kernel Expansion By the representer theorem we get that m � � θ = α iy φ ( x i , y ) i =1 y Optimization Problem Big mess . . . but convex. Alexander J. Smola: Exponential Families and Kernels, Page 33
A Toy Example Alexander J. Smola: Exponential Families and Kernels, Page 34
Noisy Data Alexander J. Smola: Exponential Families and Kernels, Page 35
Summary Clifford Hammersley Theorem and Graphical Models Decomposition results Key connection Normal distribution Conditional Distributions Log partition function Expectations and derivatives Inner product formulation and kernels Gaussian Processes Applications Generalized kernel trick Conditioning gives existing estimation methods back Alexander J. Smola: Exponential Families and Kernels, Page 36
Exponential Families and Kernels Lecture 3 Alexander J. Smola Alex.Smola@nicta.com.au Machine Learning Program National ICT Australia RSISE, The Australian National University Alexander J. Smola: Exponential Families and Kernels, Page 1
Outline Exponential Families Maximum likelihood and Fisher information Priors (conjugate and normal) Conditioning and Feature Spaces Conditional distributions and inner products Clifford Hammersley Decomposition Applications Classification and novelty detection Regression Applications Conditional random fields Intractable models and semidefinite approximations Alexander J. Smola: Exponential Families and Kernels, Page 2
Lecture 3 Novelty Detection Density estimation Thresholding and likelihood ratio Classification Log partition function Optimization problem Examples Clustering and transduction Regression Conditional normal distribution Estimating the covariance Heteroscedastic estimators Alexander J. Smola: Exponential Families and Kernels, Page 3
Density Estimation Maximum a Posteriori m g ( θ ) − � φ ( x i ) , θ � + 1 � 2 σ 2 � θ � 2 minimize θ i =1 Advantages Convex optimization problem Concentration of measure Problems Normalization g ( θ ) may be painful to compute For density estimation we need no normalized p ( x | θ ) No need to perform particularly well in high density regions Alexander J. Smola: Exponential Families and Kernels, Page 4
Novelty Detection Alexander J. Smola: Exponential Families and Kernels, Page 5
Novelty Detection Optimization Problem m − log p ( x i | θ ) + 1 � 2 σ 2 � θ � 2 MAP i =1 m � � p ( x i | θ ) + 1 � 2 � θ � 2 Novelty max − log exp( ρ − g ( θ )) , 0 i =1 m max( ρ − � φ ( x i ) , θ � , 0) + 1 � 2 � θ � 2 i =1 Advantages No normalization g ( θ ) needed No need to perform particularly well in high density regions (estimator focuses on low-density regions) Quadratic program Alexander J. Smola: Exponential Families and Kernels, Page 6
Geometric Interpretation Idea Find hyperplane that has maximum distance from ori- gin , yet is still closer to the origin than the observations. Hard Margin 1 2 � θ � 2 minimize subject to � θ, x i � ≥ 1 Soft Margin m 1 2 � θ � 2 + C � minimize ξ i i =1 subject to � θ, x i � ≥ 1 − ξ i ξ i ≥ 0 Alexander J. Smola: Exponential Families and Kernels, Page 7
Dual Optimization Problem Primal Problem m 1 2 � θ � 2 + C � minimize ξ i i =1 subject to � θ, x i � − 1 + ξ i ≥ 0 and ξ i ≥ 0 Lagrange Function We construct a Lagrange Function L by subtracting the constraints, multiplied by Lagrange multipliers ( α i and η i ), from the Primal Objective Function . L has a saddlepoint at the optimal solution. m m m L = 1 2 � θ � 2 + C � � � ξ i − α i ( � θ, x i � − 1 + ξ i ) − η i ξ i i =1 i =1 i =1 where α i , η i ≥ 0 . For instance, if ξ i < 0 we could increase L without bound via η i . Alexander J. Smola: Exponential Families and Kernels, Page 8
Dual Problem, Part II Optimality Conditions m m � � ∂ θ L = θ − α i x i = 0 = ⇒ θ = α i x i i =1 i =1 ∂ ξ i L = C − α i − η i = 0 = ⇒ α i ∈ [0 , C ] Now we substitute the two optimality conditions back into L and eliminate the primal variables . Dual Problem m m 1 � � minimize α i α j � x i , x j � − α i 2 i =1 i =1 subject to α i ∈ [0 , C ] Convexity ensures uniqueness of the optimum. Alexander J. Smola: Exponential Families and Kernels, Page 9
The ν -Trick Problem Depending on how we choose C , the number of points selected as lying on the “wrong” side of the hyperplane H := { x |� θ, x � = 1 } will vary. We would like to specify a certain fraction ν before- hand. We want to make the setting more adaptive to the data. Solution Use adaptive hyperplane that separates data from the origin, i.e. find H := { x |� θ, x � = ρ } , where the threshold ρ is adaptive . Alexander J. Smola: Exponential Families and Kernels, Page 10
The ν -Trick Primal Problem m 1 2 � θ � 2 + � minimize ξ i − mνρ i =1 subject to � θ, x i � − ρ + ξ i ≥ 0 and ξ i ≥ 0 Dual Problem m 1 � minimize α i α j � x i , x j � 2 i =1 m � α i ∈ [0 , 1] and α i = νm. subject to i =1 Difference to before The � i α i term vanishes from the objective function but we get one more constraint, namely � i α i = νm . Alexander J. Smola: Exponential Families and Kernels, Page 11
The ν -Property Optimization Problem m 1 2 � θ � 2 + � minimize ξ i − mνρ i =1 subject to � θ, x i � − ρ + ξ i ≥ 0 and ξ i ≥ 0 Theorem At most a fraction of ν points will lie on the “wrong” side of the margin, i.e., y i f ( x i ) < 1 . At most a fraction of 1 − ν points will lie on the “right” side of the margin, i.e., y i f ( x i ) > 1 . In the limit, those fractions will become exact. Proof Idea At optimum, shift ρ slightly: only the active constraints will have an influence on the objective function. Alexander J. Smola: Exponential Families and Kernels, Page 12
Classification Maximum a Posteriori Estimation m −� φ ( x i , y i ) , θ � + g ( θ | x i ) + 1 2 σ 2 � θ � 2 + c � − log p ( θ | X, Y ) = i =1 Domain Finite set of observations Y = { 1 , . . . , m } Log-partition function g ( θ | x ) easy to compute. Optional centering φ ( x, y ) → φ ( x, y ) + c leaves p ( y | x, θ ) unchanged (offsets both terms). Gaussian Process Connection Inner product t ( x, y ) = � φ ( x, y ) , θ � is drawn from Gaus- sian process, so same setting as in literature. Alexander J. Smola: Exponential Families and Kernels, Page 13
Classification Sufficient Statistic We pick φ ( x, y ) = φ ( x ) ⊗ e y , that is k (( x, y ) , ( x ′ , y ′ )) = k ( x, x ′ ) δ yy ′ where y, y ′ ∈ { 1 , . . . , n } Kernel Expansion By the representer theorem we get that m � � θ = α iy φ ( x i , y ) i =1 y Optimization Problem Big mess . . . but convex. Solve by Newton or Block-Jacobi method. Alexander J. Smola: Exponential Families and Kernels, Page 14
A Toy Example Alexander J. Smola: Exponential Families and Kernels, Page 15
Noisy Data Alexander J. Smola: Exponential Families and Kernels, Page 16
SVM Connection Problems with GP Classification Optimize even where classification is good Only sign of classification needed Only “strongest” wrong class matters Want to classify with a margin Optimization Problem m − log p ( y i | x i , θ ) + 1 � 2 σ 2 � θ � 2 MAP i =1 m � p ( y i | x i , θ ) � + 1 � 2 � θ � 2 max ρ − log max y � = y i p ( y | x i , θ ) , 0 SVM i =1 m y � = y i � φ ( x i , y ) , θ � , 0) + 1 � 2 � θ � 2 max( ρ − � φ ( x i , y i ) , θ � + max i =1 Alexander J. Smola: Exponential Families and Kernels, Page 17
Binary Classification Sufficient Statistics Offset in φ ( x, y ) can be arbitrary Pick such that φ ( x, y ) = yφ ( x ) where y ∈ {± 1 } . Kernel matrix becomes K ij = k (( x i , y i ) , ( x j , y j )) = y i y j k ( x i , x j ) Optimization Problem The max over other classes becomes max y � = y i � φ ( x i , y ) , θ � = − y � φ ( x i ) , θ � Overall problem m max( ρ − 2 y i � φ ( x i ) , θ � , 0) + 1 � 2 � θ � 2 i =1 Alexander J. Smola: Exponential Families and Kernels, Page 18
Geometrical Interpretation Minimize 1 2 � θ � 2 subject to y i ( � θ, x i � + b ) ≥ 1 for all i . Alexander J. Smola: Exponential Families and Kernels, Page 19
Optimization Problem Linear Function f ( x ) = � θ, x � + b Mathematical Programming Setting If we require error-free classification with a margin, i.e., yf ( x ) ≥ 1 , we obtain: 1 2 � θ � 2 minimize subject to y i ( � θ, x i � + b ) − 1 ≥ 0 for all 1 ≤ i ≤ m Result The dual of the optimization problem is a simple quadratic program (more later ...). Connection back to conditional probabilities Offset b takes care of bias towards one of the classes. Alexander J. Smola: Exponential Families and Kernels, Page 20
Regression Maximum a Posteriori Estimation m −� φ ( x i , y i ) , θ � + g ( θ | x i ) + 1 2 σ 2 � θ � 2 + c � − log p ( θ | X, Y ) = i =1 Domain Continuous domain of observations Y = R Log-partition function g ( θ | x ) easy to compute in closed form as normal distribution. Gaussian Process Connection Inner product t ( x, y ) = � φ ( x, y ) , θ is drawn from Gaussian process. In particular also rescaled mean and covari- ance. Alexander J. Smola: Exponential Families and Kernels, Page 21
Regression Sufficient Statistic (Standard Model) We pick φ ( x, y ) = ( yφ ( x ) , y 2 ) , that is k (( x, y ) , ( x ′ , y ′ )) = k ( x, x ′ ) yy ′ + y 2 y ′ 2 where y, y ′ ∈ R Traditionally the variance is fixed, that is θ 2 = const . . Sufficient Statistic (Fancy Model) We pick φ ( x, y ) = ( yφ 1 ( x ) , y 2 φ 2 ( x )) , that is k (( x, y ) , ( x ′ , y ′ )) = k 1 ( x, x ′ ) yy ′ + k 2 ( x, x ′ ) y 2 y ′ 2 where y, y ′ ∈ R We estimate mean and variance simultaneously . Kernel Expansion By the representer theorem (and more algebra) we get � m m � � � θ = α i 1 φ 1 ( x i ) , α i 2 φ 2 ( x i ) i =1 i =1 Alexander J. Smola: Exponential Families and Kernels, Page 22
Training Data Alexander J. Smola: Exponential Families and Kernels, Page 23
Mean � k ⊤ ( x )( K + σ 2 1 ) − 1 y Alexander J. Smola: Exponential Families and Kernels, Page 24
Variance k ( x, x ) + σ 2 − � k ⊤ ( x )( K + σ 2 1 ) − 1 � k ( x ) Alexander J. Smola: Exponential Families and Kernels, Page 25
Putting everything together . . . Alexander J. Smola: Exponential Families and Kernels, Page 26
Another Example Alexander J. Smola: Exponential Families and Kernels, Page 27
Recommend
More recommend