Exponential Families and Kernels Lecture 1 Alexander J. Smola - PowerPoint PPT Presentation

Application: Discrete Events Simple Data Discrete random variables (e.g. tossing a dice). Outcome 1 2 3 4 5 6 Counts 3 6 2 1 4 4 Probabilities 0.15 0.30 0.10 0.05 0.20 0.20 Maximum Likelihood Solution Count the number of outcomes and use the relative fre- quency of occurrence as estimates for the probability: p emp ( x ) = # x m Problems Bad idea if we have few data. Bad idea if we have continuous random variables. Alexander J. Smola: Exponential Families and Kernels, Page 26

Tossing a dice Alexander J. Smola: Exponential Families and Kernels, Page 27

Fisher Information and Efficiency Fisher Score V θ ( x ) := ∂ θ log p ( x ; θ ) This tells us the influence of x on estimating θ . Its ex- pected value vanishes, since � E [ ∂ θ log p ( X ; θ )] = p ( X ; θ ) ∂ θ log p ( X ; θ ) dX � = ∂ θ p ( X ; θ ) dX = 0 . Fisher Information Matrix It is the covariance matrix of the Fisher scores, that is I := Cov[ V θ ( x )] Alexander J. Smola: Exponential Families and Kernels, Page 28

Cramer Rao Theorem Efficiency Covariance of estimator ˆ θ ( X ) rescaled by I : 1 /e := det Cov[ˆ θ ( X )]Cov[ ∂ θ log p ( X ; θ )] Theorem The efficiency for unbiased estimators is never better (i.e. larger) than 1 . Equality is achieved for MLEs. Proof (scalar case only) By Cauchy-Schwartz we have � � � � �� 2 ˆ ˆ ( V θ ( X ) − E θ [ V θ ( X )]) θ ( X ) − E θ θ ( X ) E θ �� 2 � � ( V θ ( X ) − E θ [ V θ ( X )]) 2 � � ˆ ˆ ≤ E θ θ ( X ) − E θ θ ( X ) = IB E θ Alexander J. Smola: Exponential Families and Kernels, Page 29

Cramer Rao Theorem Proof At the same time, E θ [ V θ ( X )] = 0 implies that � � � �� ˆ ˆ ( V θ ( X ) − E θ [ V θ ( X )]) θ ( X ) − E θ θ ( X ) E θ � � V θ ( X )ˆ = E θ θ ( X ) �� p ( X | θ ) ∂ θ log p ( X | θ )ˆ = θ ( X ) dX � p ( X | θ )ˆ = ∂ θ θ ( X ) dX = ∂ θ θ = 1 . Cautionary Note This does not imply that a biased estimator might not have lower variance. Alexander J. Smola: Exponential Families and Kernels, Page 30

Fisher and Exponential Families Fisher Score V θ ( x ) = ∂ θ log p ( x ; θ ) = φ ( x ) − ∂ θ g ( θ ) Fisher Information I = Cov[ V θ ( x )] = Cov[ φ ( x ) − ∂ θ g ( θ )] = ∂ 2 θ g ( θ ) Efficiency of estimator can be obtained directly from log- partition function. Outer Product Matrix It is given (up to an offset) by � φ ( x ) , φ ( x ′ ) . This leads to Kernel-PCA . . . Alexander J. Smola: Exponential Families and Kernels, Page 31

Priors Problems with Maximum Likelihood With not enough data, parameter estimates will be bad. Prior to the rescue Often we know where the solution should be. So we encode the latter by means of a prior p ( θ ) . Normal Prior Simply set p ( θ ) ∝ exp( − 1 2 σ 2 � θ � 2 ) . Posterior � m � � φ ( x i ) , θ � − g ( θ ) − 1 � 2 σ 2 � θ � 2 p ( θ | X ) ∝ exp i =1 Alexander J. Smola: Exponential Families and Kernels, Page 32

Tossing a dice with priors Alexander J. Smola: Exponential Families and Kernels, Page 33

Conjugate Priors Problem with Normal Prior The posterior looks different from the likelihood. So many of the Maximum Likelihood optimization algorithms may not work ... Idea What if we had a prior which looked like additional data, that is p ( θ | X ) ∼ p ( X | θ ) For exponential families this is easy. Simply set p ( θ | a ) ∝ exp( � θ, m 0 a � − m 0 g ( θ )) Posterior � �� mµ + m 0 a � �� p ( θ | X ) ∝ exp ( m + m 0 ) , θ − g ( θ ) m + m 0 Alexander J. Smola: Exponential Families and Kernels, Page 34

Example: Multinomial Distribution Laplace Rule A conjugate prior with parameters ( a, m 0 ) in the multinomial family could be to set a = ( 1 n , 1 n , . . . , 1 n ) . This is often also called the Dirichlet prior . It leads to p ( x ) = # x + m 0 /n instead of p ( x ) = # x m + m 0 m Example Outcome 1 2 3 4 5 6 Counts 3 6 2 1 4 4 MLE 0.15 0.30 0.10 0.05 0.20 0.20 MAP ( m 0 = 6 ) 0.25 0.27 0.12 0.08 0.19 0.19 MAP ( m 0 = 100 ) 0.16 0.19 0.16 0.15 0.17 0.17 Alexander J. Smola: Exponential Families and Kernels, Page 35

Optimization Problems Maximum Likelihood m m ⇒ ∂ θ g ( θ ) = 1 � � minimize g ( θ ) − � φ ( x i ) , θ � = φ ( x i ) m θ i =1 i =1 Normal Prior m g ( θ ) − � φ ( x i ) , θ � + 1 � 2 σ 2 � θ � 2 minimize θ i =1 Conjugate Prior m � minimize g ( θ ) − � φ ( x i ) , θ � + m 0 g ( θ ) − m 0 � ˜ µ, θ � θ i =1 m 1 m 0 � equivalently solve ∂ θ g ( θ ) = φ ( x i ) + µ ˜ m + m 0 m + m 0 i =1 Alexander J. Smola: Exponential Families and Kernels, Page 36

Summary Model Log partition function Expectations and derivatives Maximum entropy formulation A Zoo of Densities Estimation Maximum Likelihood Estimator Fisher Information Matrix and Cramer Rao Theorem Normal Priors and Conjugate Priors Fisher information and log-partition function Alexander J. Smola: Exponential Families and Kernels, Page 37

Exponential Families and Kernels Lecture 2 Alexander J. Smola Alex.Smola@nicta.com.au Machine Learning Program National ICT Australia RSISE, The Australian National University Alexander J. Smola: Exponential Families and Kernels, Page 1

Outline Exponential Families Maximum likelihood and Fisher information Priors (conjugate and normal) Conditioning and Feature Spaces Conditional distributions and inner products Clifford Hammersley Decomposition Applications Classification and novelty detection Regression Applications Conditional random fields Intractable models and semidefinite approximations Alexander J. Smola: Exponential Families and Kernels, Page 2

Lecture 2 Clifford Hammersley Theorem and Graphical Models Decomposition results Key connection Conditional Distributions Log partition function Expectations and derivatives Inner product formulation and kernels Gaussian Processes Applications Classification + Regression Conditional Random Fields Spatial Poisson Models Alexander J. Smola: Exponential Families and Kernels, Page 3

Graphical Model Conditional Independence x, x ′ are conditionally independent given c , if p ( x, x ′ | c ) = p ( x | c ) p ( x ′ | c ) Distributions can be simplified greatly by conditional independence assumptions. Markov Network Given a graph G ( V, E ) with vertices V and edges E associate a random variable x ∈ R | V | with G . Subsets of random variables x S , x S ′ are conditionally independent given x C if removing the vertices C from G ( V, E ) decomposes the graph into disjoint subsets containing S, S ′ . Alexander J. Smola: Exponential Families and Kernels, Page 4

Conditional Independence Alexander J. Smola: Exponential Families and Kernels, Page 5

Cliques Definition Subset of the graph which is fully connected Maximal Cliques (they define the graph) Advantage Easy to specify dependencies between variables Use graph algorithms for inference Alexander J. Smola: Exponential Families and Kernels, Page 6

Hammersley Clifford Theorem Problem Specify p ( x ) with conditional independence properties. Theorem �� p ( x ) = 1 Z exp ψ c ( x c ) c ∈ C whenever p ( x ) is nonzero on the entire domain. Application Apply decomposition for exponential families where p ( x ) = exp( � φ ( x ) , θ � − g ( θ )) . Corollary The sufficient statistics φ ( x ) decompose according to � ⇒ � φ ( x ) , φ ( x ′ ) � = � φ c ( x c ) , φ c ( x ′ φ ( x ) = ( . . . , φ c ( x c ) , . . . ) = c ) � c Alexander J. Smola: Exponential Families and Kernels, Page 7

Proof Step 1: Obtain linear functional Combing the exponential setting with the CH theorem: � � Φ( x ) , θ � = ψ c ( x c ) − log Z + g ( θ ) for all x , θ . c ∈ C Step 2: Orthonormal basis in θ Pick an orthonormal basis and swallow Z, g . This gives � η i c ( x c ) for some η i � Φ( x ) , e i � = c ( x c ) . c ∈ C Step 3: Reconstruct sufficient statistics Φ c ( x c ) := ( η 1 c ( x c ) , η 2 c ( x c ) , . . . ) which allows us to compute � � θ i Φ i � Φ( x ) , θ � = c ( x c ) . c ∈ C i Alexander J. Smola: Exponential Families and Kernels, Page 8

Example: Normal Distributions Sufficient Statistics Recall that for normal distributions φ ( x ) = ( x, xx ⊤ ) . Clifford Hammersley Application φ ( x ) must decompose into subsets involving only variables from each maximal clique. The linear term x is OK by default. The only nonzero terms coupling x i x j are those corre- sponding to an edge in the graph G ( V, E ) . Inverse Covariance Matrix The natural parameter aligned with xx ⊤ is the inverse covariance matrix. Its sparsity mirrors G ( V, E ) . Hence a sparse inverse kernel matrix corresponds to graphical model! Alexander J. Smola: Exponential Families and Kernels, Page 9

Example: Normal Distributions Density   n n � � p ( x | θ ) = exp x i θ 1 i + x i x j θ 2 ij − g ( θ )   i =1 i,j =1 Here θ 2 = Σ − 1 , is the inverse covariance matrix. We have that (Σ − 1 ) [ ij ] � = 0 only if ( i, j ) share an edge. Alexander J. Smola: Exponential Families and Kernels, Page 10

Conditional Distributions Conditional Density p ( x | θ ) = exp( � φ ( x ) , θ � − g ( θ )) p ( y | x, θ ) = exp( � φ ( x, y ) , θ � − g ( θ | x )) Log-partition function � g ( θ | x ) = log exp( � φ ( x, y ) , θ � ) dy y Sufficient Criterion p ( x, y | θ ) is a member of the exponential family itself. Key Idea Avoid computing φ ( x, y ) directly, only evaluate inner products via k (( x, y ) , ( x ′ , y ′ )) := � φ ( x, y ) , φ ( x ′ , y ′ ) � Alexander J. Smola: Exponential Families and Kernels, Page 11

Conditional Distributions Maximum a Posteriori Estimation m −� φ ( x i ) , θ � + mg ( θ ) + 1 � 2 σ 2 � θ � 2 + c − log p ( θ | X ) = i =1 m −� φ ( x i , y i ) , θ � + g ( θ | x i ) + 1 � 2 σ 2 � θ � 2 + c − log p ( θ | X, Y ) = i =1 Solving the Problem The problem is strictly convex in θ . Direct solution is impossible if we cannot compute φ ( x, y ) directly. Solve convex problem in expansion coefficients. Expand θ in a linear combination of φ ( x i , y ) . Alexander J. Smola: Exponential Families and Kernels, Page 12

Joint Feature Map Alexander J. Smola: Exponential Families and Kernels, Page 13

Representer Theorem Objective Function m −� φ ( x i , y i ) , θ � + g ( θ | x i ) + 1 � 2 σ 2 � θ � 2 + c − log p ( θ | X, Y ) = i =1 Decomposition Decompose θ into θ = θ � + θ ⊥ where θ � ∈ span { φ ( x i , y ) where 1 ≤ i ≤ m and y ∈ Y } Both g ( θ | x i ) and � φ ( x i , y i ) , θ � are independent of θ ⊥ . Theorem − log p ( θ | X, Y ) is minimized for θ ⊥ = 0 , hence θ = θ � . Consequence If span { φ ( x i , y ) where 1 ≤ i ≤ m and y ∈ Y } is finite di- mensional, we have a parametric optimization problem. Alexander J. Smola: Exponential Families and Kernels, Page 14

Using It Expansion m � � θ = α iy φ ( x i , y ) i =1 y ∈ Y Inner Product m � � � φ ( x, y ) , θ � = α iy k (( x, y ) , ( x i , y )) i =1 y ∈ Y Norm m � � � θ � 2 = α iy α jy ′ k (( x i , y ) , ( x j , y ′ )) y,y ′ ∈ Y i,j =1 Log-partition function � g ( θ | x ) = log exp ( � φ ( x, y ) , θ � ) y ∈ Y Alexander J. Smola: Exponential Families and Kernels, Page 15

The Gaussian Process Link Normal Prior on θ . . . θ ∼ N (0 , σ 2 1 ) . . . yields Normal Prior on t ( x, y ) = � φ ( x, y ) , θ � Distribution of projected Gaussian is Gaussian. The mean vanishes E θ [ t ( x, y )] = � φ ( x, y ) , E θ [ θ ] � = 0 The covariance yields Cov[ t ( x, y ) , t ( x ′ , y ′ )] = E θ [ � φ ( x, y ) , θ �� θ, φ ( x ′ , y ′ ) � ] = σ 2 � φ ( x, y ) , φ ( x ′ , y ′ ) � � �� := k (( x,y ) , ( x ′ ,y ′ )) . . . so we have a Gaussian Process on x . . . with kernel k (( x, y ) , ( x ′ , y ′ )) = σ 2 � φ ( x, y ) , φ ( x ′ , y ′ ) � . Alexander J. Smola: Exponential Families and Kernels, Page 16

Linear Covariance Alexander J. Smola: Exponential Families and Kernels, Page 17

Laplacian Covariance Alexander J. Smola: Exponential Families and Kernels, Page 18

Gaussian Covariance Alexander J. Smola: Exponential Families and Kernels, Page 19

Polynomial (Order 3) Alexander J. Smola: Exponential Families and Kernels, Page 20

B 3 -Spline Covariance Alexander J. Smola: Exponential Families and Kernels, Page 21

Sample from Gaussian RBF Alexander J. Smola: Exponential Families and Kernels, Page 22

Sample from linear kernel Alexander J. Smola: Exponential Families and Kernels, Page 27

General Strategy Choose a suitable sufficient statistic φ ( x, y ) Conditionally multinomial distribution leads to Gaus- sian Process multiclass estimator: we have a distribution over n classes which depends on x . Conditionally Gaussian leads to Gaussian Process regression: we have a normal distribution over a random variable which depends on the location. Note: we estimate mean and variance. Conditionally Poisson distributions yield locally vary- ing Poisson processes. This has no name yet ... Solve the optimization problem This is typically convex. The bottom line Instead of choosing k ( x, x ′ ) choose k (( x, y ) , ( x ′ , y ′ )) . Alexander J. Smola: Exponential Families and Kernels, Page 32

Example: GP Classification Sufficient Statistic We pick φ ( x, y ) = φ ( x ) ⊗ e y , that is k (( x, y ) , ( x ′ , y ′ )) = k ( x, x ′ ) δ yy ′ where y, y ′ ∈ { 1 , . . . , n } Kernel Expansion By the representer theorem we get that m � � θ = α iy φ ( x i , y ) i =1 y Optimization Problem Big mess . . . but convex. Alexander J. Smola: Exponential Families and Kernels, Page 33

A Toy Example Alexander J. Smola: Exponential Families and Kernels, Page 34

Noisy Data Alexander J. Smola: Exponential Families and Kernels, Page 35

Summary Clifford Hammersley Theorem and Graphical Models Decomposition results Key connection Normal distribution Conditional Distributions Log partition function Expectations and derivatives Inner product formulation and kernels Gaussian Processes Applications Generalized kernel trick Conditioning gives existing estimation methods back Alexander J. Smola: Exponential Families and Kernels, Page 36

Exponential Families and Kernels Lecture 3 Alexander J. Smola Alex.Smola@nicta.com.au Machine Learning Program National ICT Australia RSISE, The Australian National University Alexander J. Smola: Exponential Families and Kernels, Page 1

Outline Exponential Families Maximum likelihood and Fisher information Priors (conjugate and normal) Conditioning and Feature Spaces Conditional distributions and inner products Clifford Hammersley Decomposition Applications Classification and novelty detection Regression Applications Conditional random fields Intractable models and semidefinite approximations Alexander J. Smola: Exponential Families and Kernels, Page 2

Lecture 3 Novelty Detection Density estimation Thresholding and likelihood ratio Classification Log partition function Optimization problem Examples Clustering and transduction Regression Conditional normal distribution Estimating the covariance Heteroscedastic estimators Alexander J. Smola: Exponential Families and Kernels, Page 3

Density Estimation Maximum a Posteriori m g ( θ ) − � φ ( x i ) , θ � + 1 � 2 σ 2 � θ � 2 minimize θ i =1 Advantages Convex optimization problem Concentration of measure Problems Normalization g ( θ ) may be painful to compute For density estimation we need no normalized p ( x | θ ) No need to perform particularly well in high density regions Alexander J. Smola: Exponential Families and Kernels, Page 4

Novelty Detection Alexander J. Smola: Exponential Families and Kernels, Page 5

Novelty Detection Optimization Problem m − log p ( x i | θ ) + 1 � 2 σ 2 � θ � 2 MAP i =1 m � � p ( x i | θ ) + 1 � 2 � θ � 2 Novelty max − log exp( ρ − g ( θ )) , 0 i =1 m max( ρ − � φ ( x i ) , θ � , 0) + 1 � 2 � θ � 2 i =1 Advantages No normalization g ( θ ) needed No need to perform particularly well in high density regions (estimator focuses on low-density regions) Quadratic program Alexander J. Smola: Exponential Families and Kernels, Page 6

Geometric Interpretation Idea Find hyperplane that has maximum distance from origin , yet is still closer to the origin than the observations. Hard Margin 1 2 � θ � 2 minimize subject to � θ, x i � ≥ 1 Soft Margin m 1 2 � θ � 2 + C � minimize ξ i i =1 subject to � θ, x i � ≥ 1 − ξ i ξ i ≥ 0 Alexander J. Smola: Exponential Families and Kernels, Page 7

Dual Optimization Problem Primal Problem m 1 2 � θ � 2 + C � minimize ξ i i =1 subject to � θ, x i � − 1 + ξ i ≥ 0 and ξ i ≥ 0 Lagrange Function We construct a Lagrange Function L by subtracting the constraints, multiplied by Lagrange multipliers ( α i and η i ), from the Primal Objective Function . L has a saddlepoint at the optimal solution. m m m L = 1 2 � θ � 2 + C � � � ξ i − α i ( � θ, x i � − 1 + ξ i ) − η i ξ i i =1 i =1 i =1 where α i , η i ≥ 0 . For instance, if ξ i < 0 we could increase L without bound via η i . Alexander J. Smola: Exponential Families and Kernels, Page 8

Dual Problem, Part II Optimality Conditions m m � � ∂ θ L = θ − α i x i = 0 = ⇒ θ = α i x i i =1 i =1 ∂ ξ i L = C − α i − η i = 0 = ⇒ α i ∈ [0 , C ] Now we substitute the two optimality conditions back into L and eliminate the primal variables . Dual Problem m m 1 � � minimize α i α j � x i , x j � − α i 2 i =1 i =1 subject to α i ∈ [0 , C ] Convexity ensures uniqueness of the optimum. Alexander J. Smola: Exponential Families and Kernels, Page 9

The ν -Trick Problem Depending on how we choose C , the number of points selected as lying on the “wrong” side of the hyperplane H := { x |� θ, x � = 1 } will vary. We would like to specify a certain fraction ν before- hand. We want to make the setting more adaptive to the data. Solution Use adaptive hyperplane that separates data from the origin, i.e. find H := { x |� θ, x � = ρ } , where the threshold ρ is adaptive . Alexander J. Smola: Exponential Families and Kernels, Page 10

The ν -Trick Primal Problem m 1 2 � θ � 2 + � minimize ξ i − mνρ i =1 subject to � θ, x i � − ρ + ξ i ≥ 0 and ξ i ≥ 0 Dual Problem m 1 � minimize α i α j � x i , x j � 2 i =1 m � α i ∈ [0 , 1] and α i = νm. subject to i =1 Difference to before The � i α i term vanishes from the objective function but we get one more constraint, namely � i α i = νm . Alexander J. Smola: Exponential Families and Kernels, Page 11

The ν -Property Optimization Problem m 1 2 � θ � 2 + � minimize ξ i − mνρ i =1 subject to � θ, x i � − ρ + ξ i ≥ 0 and ξ i ≥ 0 Theorem At most a fraction of ν points will lie on the “wrong” side of the margin, i.e., y i f ( x i ) < 1 . At most a fraction of 1 − ν points will lie on the “right” side of the margin, i.e., y i f ( x i ) > 1 . In the limit, those fractions will become exact. Proof Idea At optimum, shift ρ slightly: only the active constraints will have an influence on the objective function. Alexander J. Smola: Exponential Families and Kernels, Page 12

Classification Maximum a Posteriori Estimation m −� φ ( x i , y i ) , θ � + g ( θ | x i ) + 1 2 σ 2 � θ � 2 + c � − log p ( θ | X, Y ) = i =1 Domain Finite set of observations Y = { 1 , . . . , m } Log-partition function g ( θ | x ) easy to compute. Optional centering φ ( x, y ) → φ ( x, y ) + c leaves p ( y | x, θ ) unchanged (offsets both terms). Gaussian Process Connection Inner product t ( x, y ) = � φ ( x, y ) , θ � is drawn from Gaus- sian process, so same setting as in literature. Alexander J. Smola: Exponential Families and Kernels, Page 13

Classification Sufficient Statistic We pick φ ( x, y ) = φ ( x ) ⊗ e y , that is k (( x, y ) , ( x ′ , y ′ )) = k ( x, x ′ ) δ yy ′ where y, y ′ ∈ { 1 , . . . , n } Kernel Expansion By the representer theorem we get that m � � θ = α iy φ ( x i , y ) i =1 y Optimization Problem Big mess . . . but convex. Solve by Newton or Block-Jacobi method. Alexander J. Smola: Exponential Families and Kernels, Page 14

A Toy Example Alexander J. Smola: Exponential Families and Kernels, Page 15

Noisy Data Alexander J. Smola: Exponential Families and Kernels, Page 16

SVM Connection Problems with GP Classification Optimize even where classification is good Only sign of classification needed Only “strongest” wrong class matters Want to classify with a margin Optimization Problem m − log p ( y i | x i , θ ) + 1 � 2 σ 2 � θ � 2 MAP i =1 m � p ( y i | x i , θ ) � + 1 � 2 � θ � 2 max ρ − log max y � = y i p ( y | x i , θ ) , 0 SVM i =1 m y � = y i � φ ( x i , y ) , θ � , 0) + 1 � 2 � θ � 2 max( ρ − � φ ( x i , y i ) , θ � + max i =1 Alexander J. Smola: Exponential Families and Kernels, Page 17

Binary Classification Sufficient Statistics Offset in φ ( x, y ) can be arbitrary Pick such that φ ( x, y ) = yφ ( x ) where y ∈ {± 1 } . Kernel matrix becomes K ij = k (( x i , y i ) , ( x j , y j )) = y i y j k ( x i , x j ) Optimization Problem The max over other classes becomes max y � = y i � φ ( x i , y ) , θ � = − y � φ ( x i ) , θ � Overall problem m max( ρ − 2 y i � φ ( x i ) , θ � , 0) + 1 � 2 � θ � 2 i =1 Alexander J. Smola: Exponential Families and Kernels, Page 18

Geometrical Interpretation Minimize 1 2 � θ � 2 subject to y i ( � θ, x i � + b ) ≥ 1 for all i . Alexander J. Smola: Exponential Families and Kernels, Page 19

Optimization Problem Linear Function f ( x ) = � θ, x � + b Mathematical Programming Setting If we require error-free classification with a margin, i.e., yf ( x ) ≥ 1 , we obtain: 1 2 � θ � 2 minimize subject to y i ( � θ, x i � + b ) − 1 ≥ 0 for all 1 ≤ i ≤ m Result The dual of the optimization problem is a simple quadratic program (more later ...). Connection back to conditional probabilities Offset b takes care of bias towards one of the classes. Alexander J. Smola: Exponential Families and Kernels, Page 20

Regression Maximum a Posteriori Estimation m −� φ ( x i , y i ) , θ � + g ( θ | x i ) + 1 2 σ 2 � θ � 2 + c � − log p ( θ | X, Y ) = i =1 Domain Continuous domain of observations Y = R Log-partition function g ( θ | x ) easy to compute in closed form as normal distribution. Gaussian Process Connection Inner product t ( x, y ) = � φ ( x, y ) , θ is drawn from Gaussian process. In particular also rescaled mean and covariance. Alexander J. Smola: Exponential Families and Kernels, Page 21

Regression Sufficient Statistic (Standard Model) We pick φ ( x, y ) = ( yφ ( x ) , y 2 ) , that is k (( x, y ) , ( x ′ , y ′ )) = k ( x, x ′ ) yy ′ + y 2 y ′ 2 where y, y ′ ∈ R Traditionally the variance is fixed, that is θ 2 = const . . Sufficient Statistic (Fancy Model) We pick φ ( x, y ) = ( yφ 1 ( x ) , y 2 φ 2 ( x )) , that is k (( x, y ) , ( x ′ , y ′ )) = k 1 ( x, x ′ ) yy ′ + k 2 ( x, x ′ ) y 2 y ′ 2 where y, y ′ ∈ R We estimate mean and variance simultaneously . Kernel Expansion By the representer theorem (and more algebra) we get � m m � � � θ = α i 1 φ 1 ( x i ) , α i 2 φ 2 ( x i ) i =1 i =1 Alexander J. Smola: Exponential Families and Kernels, Page 22

Training Data Alexander J. Smola: Exponential Families and Kernels, Page 23

Mean � k ⊤ ( x )( K + σ 2 1 ) − 1 y Alexander J. Smola: Exponential Families and Kernels, Page 24

Variance k ( x, x ) + σ 2 − � k ⊤ ( x )( K + σ 2 1 ) − 1 � k ( x ) Alexander J. Smola: Exponential Families and Kernels, Page 25

Putting everything together . . . Alexander J. Smola: Exponential Families and Kernels, Page 26

Another Example Alexander J. Smola: Exponential Families and Kernels, Page 27

Exponential Families and Kernels Lecture 1 Alexander J. Smola - PowerPoint PPT Presentation

Exponential Families and Kernels Lecture 1 Alexander J. Smola Alex.Smola@nicta.com.au Machine Learning Program National ICT Australia RSISE, The Australian National University Alexander J. Smola: Exponential Families and Kernels, Page 1

Exponential Families Leila Wehbe March 19, 2013 Leila Wehbe Exponential Families Exponential

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Beyond the exponential family Eric Pedersen, Gavin Simpson, David Miller August 6th, 2016 Away

Exponential Growth Exponential Growth Introduction Exponential Growth vs. Linear Growth

Applications of exponential functions Applications of exponential functions abound throughout the

Exponential Family Distributions CMSC 691 UMBC Exponential Family Form Exponential Family Form

CSci 8980: Advanced Topics in Graphical Models Mixture Models, EM, Exponential Families

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

Math 211 Math 211 Lecture #31 Exponential of a Matrix Stability of Solutions November 8, 2002

Jay DeYoung : Hamnett Donald Families Exponential An family has form exponential

Learning, Markets, and Exponential Families Financialization of ML Outline Market Making OLO

The SABRE Proof of Principle Simone Copello on behalf of the SABRE collaboration *Gran Sasso

Glenn Stevens Governor References Butlin MW (1977), A Preliminary Annual Database 1900/01 to

Foundations of Artificial Intelligence 5. Constraint Satisfaction Problems CSPs as Search

Stratified Space Learning Reconstructing Embedded Graphs Y. Bokor 1 Mathematical Sciences

Direct Gradient-Based Reinforcement Learning Jonathan Baxter Research School of Information

INVARIANTS FROM KK-THEORY Joint work with Chris Bourne and Adam Rennie The Australian National

Certification for autonomous vehicles James Martin Micaiah Chrisholm jamesml@cs.unc.edu

ABSTRACT DISCRETION, DIRECTION AND THE OMBUDSMAN: TO STEER THE SHIP OR TO CHOOSE THE SHIP? In