Intractability The M-step for a graphical model is usually (relatively) easy. A B A B ⇔ C C D D E E P ( A , B , C , D , E ) = P ( A ) P ( B ) P ( C | A , B ) P ( D | B , C ) P ( E | C , D ) � �� � � �� � � �� � f 1 ( A , B , C ) f 2 ( B , C , D ) f 3 ( C , D , E ) ◮ Need expected sufficient stats from marginal posteriors on each factor group. ◮ Then (at least for a DAG) can optimise each factor parameter vector separately. ◮ Intractability in EM comes from the difficulty of computing marginal posteriors in graphs with large tree-width or non-linear/non-conjugate conditionals. ◮ [For non-DAG models, partition function (normalising constant) may also be intractable.]
Free-energy-based variational approximation What if finding expected sufficient stats under P ( Y|X , θ ) is computationally intractable?
Free-energy-based variational approximation What if finding expected sufficient stats under P ( Y|X , θ ) is computationally intractable? For the generalised EM algorithm, we argued that intractable maximisations could be replaced by gradient M-steps. ◮ Each step increases the likelihood. ◮ A fixed point of the gradient M-step must be at a mode of the expected log-joint.
Free-energy-based variational approximation What if finding expected sufficient stats under P ( Y|X , θ ) is computationally intractable? For the generalised EM algorithm, we argued that intractable maximisations could be replaced by gradient M-steps. ◮ Each step increases the likelihood. ◮ A fixed point of the gradient M-step must be at a mode of the expected log-joint. For the E-step we could: ◮ Parameterise q = q ρ ( Y ) and take a gradient step in ρ . ◮ Assume some simplified form for q , usually factored: q = � i q i ( Y i ) where Y i partition Y , and maximise within this form.
Free-energy-based variational approximation What if finding expected sufficient stats under P ( Y|X , θ ) is computationally intractable? For the generalised EM algorithm, we argued that intractable maximisations could be replaced by gradient M-steps. ◮ Each step increases the likelihood. ◮ A fixed point of the gradient M-step must be at a mode of the expected log-joint. For the E-step we could: ◮ Parameterise q = q ρ ( Y ) and take a gradient step in ρ . ◮ Assume some simplified form for q , usually factored: q = � i q i ( Y i ) where Y i partition Y , and maximise within this form. In either case, we choose q from within a limited set Q : VE step : maximise F ( q , θ ) wrt constrained latent distribution given parameters: � q ( Y ) , θ ( k − 1 ) � q ( k ) ( Y ) := argmax F . q ( Y ) ∈Q← Constraint M step : unchanged � θ ( k ) := argmax � � q ( k ) ( Y ) , θ q ( k ) ( Y ) log p ( Y , X| θ ) d Y , F = argmax θ θ Unlike in GEM, the fixed point may not be at an unconstrained optimum of F .
What do we lose? What does restricting q to Q cost us?
What do we lose? What does restricting q to Q cost us? ◮ Recall that the free-energy is bounded above by Jensen: F ( q , θ ) ≤ ℓ ( θ ML ) Thus, as long as every step increases F , convergence is still guaranteed.
What do we lose? What does restricting q to Q cost us? ◮ Recall that the free-energy is bounded above by Jensen: F ( q , θ ) ≤ ℓ ( θ ML ) Thus, as long as every step increases F , convergence is still guaranteed. ◮ But, since P ( Y|X , θ ( k ) ) may not lie in Q , we no longer saturate the bound after the E-step. Thus, the likelihood may not increase on each full EM step. � � � θ ( k − 1 ) � � q ( k ) , θ ( k − 1 ) � � q ( k ) , θ ( k ) � � θ ( k ) � ℓ = F ≤ F ≤ ℓ , E step M step Jensen
What do we lose? What does restricting q to Q cost us? ◮ Recall that the free-energy is bounded above by Jensen: F ( q , θ ) ≤ ℓ ( θ ML ) Thus, as long as every step increases F , convergence is still guaranteed. ◮ But, since P ( Y|X , θ ( k ) ) may not lie in Q , we no longer saturate the bound after the E-step. Thus, the likelihood may not increase on each full EM step. � � � θ ( k − 1 ) � � q ( k ) , θ ( k − 1 ) � � q ( k ) , θ ( k ) � � θ ( k ) � ℓ = F ≤ F ≤ ℓ , E step M step Jensen ◮ This means we may not converge to a maximum of ℓ .
What do we lose? What does restricting q to Q cost us? ◮ Recall that the free-energy is bounded above by Jensen: F ( q , θ ) ≤ ℓ ( θ ML ) Thus, as long as every step increases F , convergence is still guaranteed. ◮ But, since P ( Y|X , θ ( k ) ) may not lie in Q , we no longer saturate the bound after the E-step. Thus, the likelihood may not increase on each full EM step. � � � θ ( k − 1 ) � � q ( k ) , θ ( k − 1 ) � � q ( k ) , θ ( k ) � � θ ( k ) � ℓ = F ≤ F ≤ ℓ , E step M step Jensen ◮ This means we may not converge to a maximum of ℓ . The hope is that by increasing a lower bound on ℓ we will find a decent solution.
What do we lose? What does restricting q to Q cost us? ◮ Recall that the free-energy is bounded above by Jensen: F ( q , θ ) ≤ ℓ ( θ ML ) Thus, as long as every step increases F , convergence is still guaranteed. ◮ But, since P ( Y|X , θ ( k ) ) may not lie in Q , we no longer saturate the bound after the E-step. Thus, the likelihood may not increase on each full EM step. � � � θ ( k − 1 ) � � q ( k ) , θ ( k − 1 ) � � q ( k ) , θ ( k ) � � θ ( k ) � ℓ = F ≤ F ≤ ℓ , E step M step Jensen ◮ This means we may not converge to a maximum of ℓ . The hope is that by increasing a lower bound on ℓ we will find a decent solution. [Note that if P ( Y|X , θ ML ) ∈ Q , then θ ML is a fixed point of the variational algorithm.]
KL divergence Recall that F ( q , θ ) = � log P ( X , Y| θ ) � q ( Y ) + H [ q ] = � log P ( X| θ ) + log P ( Y|X , θ ) � q ( Y ) − � log q ( Y ) � q ( Y ) = � log P ( X| θ ) � q ( Y ) − KL [ q � P ( Y|X , θ )] . Thus, E step maximise F ( q , θ ) wrt the distribution over latents, given parameters: � q ( Y ) , θ ( k − 1 ) � q ( k ) ( Y ) := argmax F . q ( Y ) ∈Q is equivalent to: E step minimise KL [ q � p ( Y|X , θ )] wrt distribution over latents, given parameters: � q ( Y ) q ( k ) ( Y ) := argmin q ( Y ) log p ( Y|X , θ ( k − 1 ) ) d Y q ( Y ) ∈Q So, in each E step, the algorithm is trying to find the best approximation to P ( Y|X ) in Q in a KL sense. This is related to ideas in information geometry . It also suggests generalisations to other distance measures.
Factored Variational E-step The most common form of variational approximation partitions Y into disjoint sets Y i with � � � � � q ( Y ) = Q = q i ( Y i ) . q i
Factored Variational E-step The most common form of variational approximation partitions Y into disjoint sets Y i with � � � � � q ( Y ) = Q = q i ( Y i ) . q i In this case the E-step is itself iterative: (Factored VE step) i : maximise F ( q , θ ) wrt q i ( Y i ) given other q j and parameters: � � q j ( Y j ) , θ ( k − 1 ) � q ( k ) ( Y i ) := argmax F q i ( Y i ) . i q i ( Y i ) j � = i
Factored Variational E-step The most common form of variational approximation partitions Y into disjoint sets Y i with � � � � � q ( Y ) = Q = q i ( Y i ) . q i In this case the E-step is itself iterative: (Factored VE step) i : maximise F ( q , θ ) wrt q i ( Y i ) given other q j and parameters: � � q j ( Y j ) , θ ( k − 1 ) � q ( k ) ( Y i ) := argmax F q i ( Y i ) . i q i ( Y i ) j � = i ◮ q i updates iterated to convergence to “complete” VE-step.
Factored Variational E-step The most common form of variational approximation partitions Y into disjoint sets Y i with � � � � � q ( Y ) = Q = q i ( Y i ) . q i In this case the E-step is itself iterative: (Factored VE step) i : maximise F ( q , θ ) wrt q i ( Y i ) given other q j and parameters: � � q j ( Y j ) , θ ( k − 1 ) � q ( k ) ( Y i ) := argmax F q i ( Y i ) . i q i ( Y i ) j � = i ◮ q i updates iterated to convergence to “complete” VE-step. ◮ In fact, every (VE) i -step separately increases F , so any schedule of (VE) i - and M-steps will converge. Choice can be dictated by practical issues (rarely efficient to fully converge E-step before updating parameters).
Factored Variational E-step The Factored Variational E-step has a general form.
Factored Variational E-step The Factored Variational E-step has a general form. The free energy is: � � q j ( Y j ) , θ ( k − 1 ) � � � �� � log P ( X , Y| θ ( k − 1 ) ) F = j q j ( Y j ) + H q j ( Y j ) � j j
Factored Variational E-step The Factored Variational E-step has a general form. The free energy is: � � q j ( Y j ) , θ ( k − 1 ) � � � �� � log P ( X , Y| θ ( k − 1 ) ) F = j q j ( Y j ) + H q j ( Y j ) � j j � � � � log P ( X , Y| θ ( k − 1 ) ) = d Y i q i ( Y i ) j � = i q j ( Y j ) + H [ q i ] + H [ q j ] � j � = i
Factored Variational E-step The Factored Variational E-step has a general form. The free energy is: � � q j ( Y j ) , θ ( k − 1 ) � � � �� � log P ( X , Y| θ ( k − 1 ) ) F = j q j ( Y j ) + H q j ( Y j ) � j j � � � � log P ( X , Y| θ ( k − 1 ) ) = d Y i q i ( Y i ) j � = i q j ( Y j ) + H [ q i ] + H [ q j ] � j � = i Now, taking the variational derivative of the Lagrangian (enforcing normalisation of q i ): � �� �� δ F + λ q i − 1 = δ q i
Factored Variational E-step The Factored Variational E-step has a general form. The free energy is: � � q j ( Y j ) , θ ( k − 1 ) � � � �� � log P ( X , Y| θ ( k − 1 ) ) F = j q j ( Y j ) + H q j ( Y j ) � j j � � � � log P ( X , Y| θ ( k − 1 ) ) = d Y i q i ( Y i ) j � = i q j ( Y j ) + H [ q i ] + H [ q j ] � j � = i Now, taking the variational derivative of the Lagrangian (enforcing normalisation of q i ): � �� �� � � δ j � = i q j ( Y j ) − log q i ( Y i ) − q i ( Y i ) log P ( X , Y| θ ( k − 1 ) ) F + λ q i − 1 = q i ( Y i ) + λ � δ q i
Factored Variational E-step The Factored Variational E-step has a general form. The free energy is: � � q j ( Y j ) , θ ( k − 1 ) � � � �� � log P ( X , Y| θ ( k − 1 ) ) F = j q j ( Y j ) + H q j ( Y j ) � j j � � � � log P ( X , Y| θ ( k − 1 ) ) = d Y i q i ( Y i ) j � = i q j ( Y j ) + H [ q i ] + H [ q j ] � j � = i Now, taking the variational derivative of the Lagrangian (enforcing normalisation of q i ): � �� �� � � δ j � = i q j ( Y j ) − log q i ( Y i ) − q i ( Y i ) log P ( X , Y| θ ( k − 1 ) ) F + λ q i − 1 = q i ( Y i ) + λ � δ q i � � log P ( X , Y| θ ( k − 1 ) ) ⇒ q i ( Y i ) ∝ exp (= 0 ) � j � = i q j ( Y j )
Factored Variational E-step The Factored Variational E-step has a general form. The free energy is: � � q j ( Y j ) , θ ( k − 1 ) � � � �� � log P ( X , Y| θ ( k − 1 ) ) F = j q j ( Y j ) + H q j ( Y j ) � j j � � � � log P ( X , Y| θ ( k − 1 ) ) = d Y i q i ( Y i ) j � = i q j ( Y j ) + H [ q i ] + H [ q j ] � j � = i Now, taking the variational derivative of the Lagrangian (enforcing normalisation of q i ): � �� �� � � δ j � = i q j ( Y j ) − log q i ( Y i ) − q i ( Y i ) log P ( X , Y| θ ( k − 1 ) ) F + λ q i − 1 = q i ( Y i ) + λ � δ q i � � log P ( X , Y| θ ( k − 1 ) ) ⇒ q i ( Y i ) ∝ exp (= 0 ) � j � = i q j ( Y j ) In general, this depends only on the expected sufficient statistics under q j . Thus, again, we don’t actually need the entire distributions, just the relevant expectations (now for approximate inference as well as learning).
Mean-field approximations If Y i = y i ( i.e. , q is factored over all variables) then the variational technique is often called a “mean field” approximation.
Mean-field approximations If Y i = y i ( i.e. , q is factored over all variables) then the variational technique is often called a “mean field” approximation. ◮ Suppose P ( X , Y ) has sufficient statistics that are separable in the latent variables: e.g. the Boltzmann machine � � � P ( X , Y ) = 1 � Z exp W ij s i s j + b i s i ij i with some s i ∈ Y and others observed.
Mean-field approximations If Y i = y i ( i.e. , q is factored over all variables) then the variational technique is often called a “mean field” approximation. ◮ Suppose P ( X , Y ) has sufficient statistics that are separable in the latent variables: e.g. the Boltzmann machine � � � P ( X , Y ) = 1 � Z exp W ij s i s j + b i s i ij i with some s i ∈ Y and others observed. ◮ Expectations wrt a fully-factored q distribute over all s i ∈ Y � � � log P ( X , Y ) � � q i = W ij � s i � q i � s j � q j + b i � s i � q i ij i (where q i for s i ∈ X is a delta function on the observed value).
Mean-field approximations If Y i = y i ( i.e. , q is factored over all variables) then the variational technique is often called a “mean field” approximation. ◮ Suppose P ( X , Y ) has sufficient statistics that are separable in the latent variables: e.g. the Boltzmann machine � � � P ( X , Y ) = 1 � Z exp W ij s i s j + b i s i ij i with some s i ∈ Y and others observed. ◮ Expectations wrt a fully-factored q distribute over all s i ∈ Y � � � log P ( X , Y ) � � q i = W ij � s i � q i � s j � q j + b i � s i � q i ij i (where q i for s i ∈ X is a delta function on the observed value). ◮ Thus, we can update each q i in turn given the means (or, in general, mean sufficient statistics) of the others.
Mean-field approximations If Y i = y i ( i.e. , q is factored over all variables) then the variational technique is often called a “mean field” approximation. ◮ Suppose P ( X , Y ) has sufficient statistics that are separable in the latent variables: e.g. the Boltzmann machine � � � P ( X , Y ) = 1 � Z exp W ij s i s j + b i s i ij i with some s i ∈ Y and others observed. ◮ Expectations wrt a fully-factored q distribute over all s i ∈ Y � � � log P ( X , Y ) � � q i = W ij � s i � q i � s j � q j + b i � s i � q i ij i (where q i for s i ∈ X is a delta function on the observed value). ◮ Thus, we can update each q i in turn given the means (or, in general, mean sufficient statistics) of the others. ◮ Each variable sees the mean field imposed by its neighbours, and we update these fields until they all agree.
Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) • • • s ( 3 ) 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) • • • s ( 2 ) 1 2 3 T s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) 1 2 3 T x 1 x 2 x 3 x T
Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • � 1 2 3 T q ( s 1 : M q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T
Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • � 1 2 3 T q ( s 1 : M q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T � � q m t ( s m log P ( s 1 : M t ) ∝ exp 1 : T , x 1 : T ) � q m ′ t ′ ( s m ′ t ′ ) ¬ ( m , t )
Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • � 1 2 3 T q ( s 1 : M q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T � � q m t ( s m log P ( s 1 : M t ) ∝ exp 1 : T , x 1 : T ) � q m ′ t ′ ( s m ′ t ′ ) ¬ ( m , t ) �� � � � log P ( x τ | s 1 : M log P ( s µ τ | s µ = exp τ –1 ) + τ ) µ τ τ � q m ′ t ′ ¬ ( m , t )
Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • � 1 2 3 T q ( s 1 : M q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T � � q m t ( s m log P ( s 1 : M t ) ∝ exp 1 : T , x 1 : T ) � q m ′ t ′ ( s m ′ t ′ ) ¬ ( m , t ) �� � � � log P ( x τ | s 1 : M log P ( s µ τ | s µ = exp τ –1 ) + τ ) µ τ τ � q m ′ t ′ ¬ ( m , t ) �� � � � � � � log P ( s m t | s m log P ( x t | s 1 : M log P ( s m t + 1 | s m ∝ exp t –1 ) t –1 + ) t + t ) � q m ′ t q m q m t + 1 ¬ m
Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • � 1 2 3 T q ( s 1 : M q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T � � q m t ( s m log P ( s 1 : M t ) ∝ exp 1 : T , x 1 : T ) � q m ′ t ′ ( s m ′ t ′ ) ¬ ( m , t ) �� � � � log P ( x τ | s 1 : M log P ( s µ τ | s µ = exp τ –1 ) + τ ) µ τ τ � q m ′ t ′ ¬ ( m , t ) �� � � � � � � log P ( s m t | s m log P ( x t | s 1 : M log P ( s m t + 1 | s m ∝ exp t –1 ) t –1 + ) + t ) � q m ′ t q m q m t t + 1 ¬ m � �� � � �� � � � t –1 ( j ) · e j log Φ m ji q m j log Φ m ji q m � log A i ( x t ) � q ¬ m β m t + 1 ( j ) α m t ( i ) ∝ e t ( i ) ∝ e t β t ( i ) ∝ � α t ( i ) ∝ � Cf. forward-backward: j Φ ij A j ( x t + 1 ) β t + 1 ( j ) j α t –1 ( j )Φ ji · A i ( x t )
Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • � 1 2 3 T q ( s 1 : M q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T �� � � � � � � q m t ( s m log P ( s m t | s m log P ( x t | s 1 : M log P ( s m t + 1 | s m t ) ∝ exp t –1 ) t –1 + ) + t ) � q m ′ t q m q m t t + 1 ¬ m � �� � � �� � � � t –1 ( j ) · e j log Φ m ji q m j log Φ m ji q m � log A i ( x t ) � q ¬ m β m t + 1 ( j ) α m t ( i ) ∝ e t ( i ) ∝ e t β t ( i ) ∝ � α t ( i ) ∝ � Cf. forward-backward: j Φ ij A j ( x t + 1 ) β t + 1 ( j ) j α t –1 ( j )Φ ji · A i ( x t ) ◮ Yields a message-passing algorithm like forward-backward
Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • � 1 2 3 T q ( s 1 : M q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T �� � � � � � � q m t ( s m log P ( s m t | s m log P ( x t | s 1 : M log P ( s m t + 1 | s m t ) ∝ exp t –1 ) t –1 + ) + t ) � q m ′ t q m q m t t + 1 ¬ m � �� � � �� � � � t –1 ( j ) · e j log Φ m ji q m j log Φ m ji q m � log A i ( x t ) � q ¬ m β m t + 1 ( j ) α m t ( i ) ∝ e t ( i ) ∝ e t β t ( i ) ∝ � α t ( i ) ∝ � Cf. forward-backward: j Φ ij A j ( x t + 1 ) β t + 1 ( j ) j α t –1 ( j )Φ ji · A i ( x t ) ◮ Yields a message-passing algorithm like forward-backward ◮ Updates depend only on immediate neighbours in chain
Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • � 1 2 3 T q ( s 1 : M q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T �� � � � � � � q m t ( s m log P ( s m t | s m log P ( x t | s 1 : M log P ( s m t + 1 | s m t ) ∝ exp t –1 ) t –1 + ) + t ) � q m ′ t q m q m t t + 1 ¬ m � �� � � �� � � � t –1 ( j ) · e j log Φ m ji q m j log Φ m ji q m � log A i ( x t ) � q ¬ m β m t + 1 ( j ) α m t ( i ) ∝ e t ( i ) ∝ e t β t ( i ) ∝ � α t ( i ) ∝ � Cf. forward-backward: j Φ ij A j ( x t + 1 ) β t + 1 ( j ) j α t –1 ( j )Φ ji · A i ( x t ) ◮ Yields a message-passing algorithm like forward-backward ◮ Updates depend only on immediate neighbours in chain ◮ Chains couple only through joint output
Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • � 1 2 3 T q ( s 1 : M q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T �� � � � � � � q m t ( s m log P ( s m t | s m log P ( x t | s 1 : M log P ( s m t + 1 | s m t ) ∝ exp t –1 ) t –1 + ) + t ) � q m ′ t q m q m t t + 1 ¬ m � �� � � �� � � � t –1 ( j ) · e j log Φ m ji q m j log Φ m ji q m � log A i ( x t ) � q ¬ m β m t + 1 ( j ) α m t ( i ) ∝ e t ( i ) ∝ e t β t ( i ) ∝ � α t ( i ) ∝ � Cf. forward-backward: j Φ ij A j ( x t + 1 ) β t + 1 ( j ) j α t –1 ( j )Φ ji · A i ( x t ) ◮ Yields a message-passing algorithm like forward-backward ◮ Updates depend only on immediate neighbours in chain ◮ Chains couple only through joint output ◮ Multiple passes; messages depend on (approximate) marginals
Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • � 1 2 3 T q ( s 1 : M q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T �� � � � � � � q m t ( s m log P ( s m t | s m log P ( x t | s 1 : M log P ( s m t + 1 | s m t ) ∝ exp t –1 ) t –1 + ) + t ) � q m ′ t q m q m t t + 1 ¬ m � �� � � �� � � � t –1 ( j ) · e j log Φ m ji q m j log Φ m ji q m � log A i ( x t ) � q ¬ m β m t + 1 ( j ) α m t ( i ) ∝ e t ( i ) ∝ e t β t ( i ) ∝ � α t ( i ) ∝ � Cf. forward-backward: j Φ ij A j ( x t + 1 ) β t + 1 ( j ) j α t –1 ( j )Φ ji · A i ( x t ) ◮ Yields a message-passing algorithm like forward-backward ◮ Updates depend only on immediate neighbours in chain ◮ Chains couple only through joint output ◮ Multiple passes; messages depend on (approximate) marginals ◮ Evidence does not appear explicitly in backward message (cf Kalman smoothing)
Structured variational approximation ◮ q ( Y ) need not be completely factorized. A t� A t+1 A t+2 B t� B t+1 B t+2 ... C t� C t+1 C t+2 D t� D t+1 D t+2
Structured variational approximation ◮ q ( Y ) need not be completely factorized. ◮ For example, suppose Y can be partitioned into sets Y 1 and Y 2 such that computing the expected sufficient statistics under P ( Y 1 |Y 2 , X ) and P ( Y 2 |Y 1 , X ) would be tractable. ⇒ Then the factored approximation q ( Y ) = q ( Y 1 ) q ( Y 2 ) is tractable. A t� A t+1 A t+2 B t� B t+1 B t+2 ... C t� C t+1 C t+2 D t� D t+1 D t+2
Structured variational approximation ◮ q ( Y ) need not be completely factorized. ◮ For example, suppose Y can be partitioned into sets Y 1 and Y 2 such that computing the expected sufficient statistics under P ( Y 1 |Y 2 , X ) and P ( Y 2 |Y 1 , X ) would be tractable. ⇒ Then the factored approximation q ( Y ) = q ( Y 1 ) q ( Y 2 ) is tractable. ◮ In particular, any factorisation of q ( Y ) into a product of distributions on trees, yields a tractable approximation. A t� A t+1 A t+2 B t� B t+1 B t+2 ... C t� C t+1 C t+2 D t� D t+1 D t+2
Stuctured FHMM s ( 3 ) s ( 3 ) s ( 3 ) • • • s ( 3 ) 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) • • • s ( 2 ) 1 2 3 T s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) 1 2 3 T x 1 x 2 x 3 x T
Stuctured FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • For the FHMM we can factor the chains: 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • � q ( s 1 : M 1 2 3 T q m ( s m 1 : T ) = 1 : T ) m s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) 1 2 3 T x 1 x 2 x 3 x T
Stuctured FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • For the FHMM we can factor the chains: 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • � q ( s 1 : M 1 2 3 T q m ( s m 1 : T ) = 1 : T ) m s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) 1 2 3 T x 1 x 2 x 3 x T � � q m ( s m log P ( s 1 : M 1 : T ) ∝ exp 1 : T , x 1 : T ) � q m ′ ( s m ′ 1 : T ) ¬ m �� � � � log P ( s µ t | s µ log P ( x t | s 1 : M = exp t − 1 ) + ) t � µ t t q m ′ ¬ m � � � � � � log P ( s m t | s m log P ( x t | s 1 : M ∝ exp t − 1 ) + ) t � q m ′ ( s m ′ ) t t t ¬ m � log P ( x t | s 1 : M ) � � qm ′ ( sm ′ t ) � � t P ( s m t | s m = t − 1 ) e ¬ m t t
Stuctured FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • For the FHMM we can factor the chains: 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • � q ( s 1 : M 1 2 3 T q m ( s m 1 : T ) = 1 : T ) m s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) 1 2 3 T x 1 x 2 x 3 x T � � q m ( s m log P ( s 1 : M 1 : T ) ∝ exp 1 : T , x 1 : T ) � q m ′ ( s m ′ 1 : T ) ¬ m �� � � � log P ( s µ t | s µ log P ( x t | s 1 : M = exp t − 1 ) + ) t � µ t t q m ′ ¬ m � � � � � � log P ( s m t | s m log P ( x t | s 1 : M ∝ exp t − 1 ) + ) t � q m ′ ( s m ′ ) t t t ¬ m � log P ( x t | s 1 : M ) � � qm ′ ( sm ′ t ) � � t P ( s m t | s m = t − 1 ) e ¬ m t t This looks like a standard HMM joint, with a modified likelihood term ⇒ cycle through multiple forward-backward passes, updating likelihood terms each time.
Messages on an arbitrary graph A B Consider a DAG: � P ( X , Y ) = P ( Z k | pa ( Z k )) C k D E and let q ( Y ) = � i q i ( Y i ) for disjoint sets {Y i } .
Messages on an arbitrary graph A B Consider a DAG: � P ( X , Y ) = P ( Z k | pa ( Z k )) C k D E and let q ( Y ) = � i q i ( Y i ) for disjoint sets {Y i } . We have that the VE update for q i is given by q ∗ i ( Y i ) ∝ exp � log p ( Y , X ) � q ¬ i ( Y ) where �·� q ¬ i ( Y ) denotes averaging with respect to q j ( Y j ) for all j � = i
Messages on an arbitrary graph A B Consider a DAG: � P ( X , Y ) = P ( Z k | pa ( Z k )) C k D E and let q ( Y ) = � i q i ( Y i ) for disjoint sets {Y i } . We have that the VE update for q i is given by q ∗ i ( Y i ) ∝ exp � log p ( Y , X ) � q ¬ i ( Y ) where �·� q ¬ i ( Y ) denotes averaging with respect to q j ( Y j ) for all j � = i Then: �� � log q ∗ i ( Y i ) = log P ( Z k | pa ( Z k )) + const k q ¬ i ( Y ) � � = � log P ( Y j | pa ( Y j )) � q ¬ i ( Y ) + � log P ( Z j | pa ( Z j )) � q ¬ i ( Y ) + const j ∈Y i j ∈ ch ( Y i )
Messages on an arbitrary graph A B Consider a DAG: � P ( X , Y ) = P ( Z k | pa ( Z k )) C k D E and let q ( Y ) = � i q i ( Y i ) for disjoint sets {Y i } . We have that the VE update for q i is given by q ∗ i ( Y i ) ∝ exp � log p ( Y , X ) � q ¬ i ( Y ) where �·� q ¬ i ( Y ) denotes averaging with respect to q j ( Y j ) for all j � = i Then: �� � log q ∗ i ( Y i ) = log P ( Z k | pa ( Z k )) + const k q ¬ i ( Y ) � � = � log P ( Y j | pa ( Y j )) � q ¬ i ( Y ) + � log P ( Z j | pa ( Z j )) � q ¬ i ( Y ) + const j ∈Y i j ∈ ch ( Y i ) This defines messages that are passed between nodes in the graph. Each node receives messages from its Markov boundary: parents, children and parents of children (all neighbours in the corresponding factor graph).
Non-factored variational methods The term variational approximation is used whenever a bound on the likelihood (or on another estimation cost function) is optimised, but does not necessarily become tight. Many further variational approximations have been developed, including: ◮ parametric forms (e.g. Gaussian) for non-linear models ◮ non-free-energy-based bounds (both upper and lower) on the likelihood. We can also see MAP- or zero-temperature EM and recognition models as parametric forms of variational inference.
Non-factored variational methods The term variational approximation is used whenever a bound on the likelihood (or on another estimation cost function) is optimised, but does not necessarily become tight. Many further variational approximations have been developed, including: ◮ parametric forms (e.g. Gaussian) for non-linear models ◮ non-free-energy-based bounds (both upper and lower) on the likelihood. We can also see MAP- or zero-temperature EM and recognition models as parametric forms of variational inference. Variational methods can also be used to find an approximate posterior on the parameters.
Variational Bayes So far, we have applied Jensen’s bound and factorisations to help with integrals over latent variables.
Variational Bayes So far, we have applied Jensen’s bound and factorisations to help with integrals over latent variables. We can do the same for integrals over parameters in order to bound the log marginal likelihood or evidence. �� log P ( X|M ) = log d Y d θ P ( X , Y| θ , M ) P ( θ |M ) �� d Y d θ Q ( Y , θ ) log P ( X , Y , θ |M ) = argmax Q ( Y , θ ) Q
Variational Bayes So far, we have applied Jensen’s bound and factorisations to help with integrals over latent variables. We can do the same for integrals over parameters in order to bound the log marginal likelihood or evidence. �� log P ( X|M ) = log d Y d θ P ( X , Y| θ , M ) P ( θ |M ) �� d Y d θ Q ( Y , θ ) log P ( X , Y , θ |M ) = argmax Q ( Y , θ ) Q �� d Y d θ Q Y ( Y ) Q θ ( θ ) log P ( X , Y , θ |M ) ≥ argmax Q Y ( Y ) Q θ ( θ ) Q Y , Q θ
Variational Bayes So far, we have applied Jensen’s bound and factorisations to help with integrals over latent variables. We can do the same for integrals over parameters in order to bound the log marginal likelihood or evidence. �� log P ( X|M ) = log d Y d θ P ( X , Y| θ , M ) P ( θ |M ) �� d Y d θ Q ( Y , θ ) log P ( X , Y , θ |M ) = argmax Q ( Y , θ ) Q �� d Y d θ Q Y ( Y ) Q θ ( θ ) log P ( X , Y , θ |M ) ≥ argmax Q Y ( Y ) Q θ ( θ ) Q Y , Q θ The constraint that the distribution Q must factor into the product Q y ( Y ) Q θ ( θ ) leads to the variational Bayesian EM algorithm or just “Variational Bayes” .
Variational Bayesian EM . . . Coordinate maximization of the VB free-energy lower bound �� d Y d θ Q Y ( Y ) Q θ ( θ ) log p ( X , Y , θ |M ) F ( Q Y , Q θ ) = Q Y ( Y ) Q θ ( θ ) leads to EM-like updates:
Variational Bayesian EM . . . Coordinate maximization of the VB free-energy lower bound �� d Y d θ Q Y ( Y ) Q θ ( θ ) log p ( X , Y , θ |M ) F ( Q Y , Q θ ) = Q Y ( Y ) Q θ ( θ ) leads to EM-like updates: Q ∗ Y ( Y ) ∝ exp � log P ( Y , X| θ ) � Q θ ( θ ) E-like step Q ∗ θ ( θ ) ∝ P ( θ ) exp � log P ( Y , X| θ ) � Q Y ( Y ) M-like step
Variational Bayesian EM . . . Coordinate maximization of the VB free-energy lower bound �� d Y d θ Q Y ( Y ) Q θ ( θ ) log p ( X , Y , θ |M ) F ( Q Y , Q θ ) = Q Y ( Y ) Q θ ( θ ) leads to EM-like updates: Q ∗ Y ( Y ) ∝ exp � log P ( Y , X| θ ) � Q θ ( θ ) E-like step Q ∗ θ ( θ ) ∝ P ( θ ) exp � log P ( Y , X| θ ) � Q Y ( Y ) M-like step Maximizing F is equivalent to minimizing KL-divergence between the approximate posterior , Q ( θ ) Q ( Y ) and the true posterior , P ( θ , Y|X ) . �� P ( X , Y , θ ) log P ( X ) − F ( Q Y , Q θ ) = log P ( X ) − d Y d θ Q Y ( Y ) Q θ ( θ ) log Q Y ( Y ) Q θ ( θ ) �� d Y d θ Q Y ( Y ) Q θ ( θ ) log Q Y ( Y ) Q θ ( θ ) = = KL ( Q || P ) P ( Y , θ |X )
Conjugate-Exponential models Let’s focus on conjugate-exponential ( CE ) latent-variable models: ◮ Condition (1) . The joint probability over variables is in the exponential family: � � φ ( θ ) T T ( Y , X ) P ( Y , X| θ ) = f ( Y , X ) g ( θ ) exp where φ ( θ ) is the vector of natural parameters , T are sufficient statistics ◮ Condition (2) . The prior over parameters is conjugate to this joint probability: � � P ( θ | ν, τ ) = h ( ν, τ ) g ( θ ) ν exp φ ( θ ) T τ where ν and τ are hyperparameters of the prior. Conjugate priors are computationally convenient and have an intuitive interpretation: ◮ ν : number of pseudo-observations ◮ τ : values of pseudo-observations
Conjugate-Exponential examples In the CE family: ◮ Gaussian mixtures ◮ factor analysis, probabilistic PCA ◮ hidden Markov models and factorial HMMs ◮ linear dynamical systems and switching models ◮ discrete-variable belief networks Other as yet undreamt-of models combinations of Gaussian, Gamma, Poisson, Dirichlet, Wishart, Multinomial and others. Not in the CE family: ◮ Boltzmann machines, MRFs (no simple conjugacy) ◮ logistic regression (no simple conjugacy) ◮ sigmoid belief networks (not exponential) ◮ independent components analysis (not exponential) Note: one can often approximate such models with a suitable choice from the CE family.
Conjugate-exponential VB Given an iid data set D = ( x 1 , . . . x n ) , if the model is CE then:
Conjugate-exponential VB Given an iid data set D = ( x 1 , . . . x n ) , if the model is CE then: ◮ Q θ ( θ ) is also conjugate, i.e. �� � Q θ ( θ ) ∝ P ( θ ) i log P ( y i , x i | θ ) exp Q Y � � φ ( θ ) T � � � log f ( Y , X ) i T ( y i , x i ) = h ( ν, τ ) g ( θ ) ν e φ ( θ ) T τ g ( θ ) n e Q Y e Q Y ν e φ ( θ ) T ˜ τ ) g ( θ ) ˜ τ ∝ h (˜ ν, ˜ τ = τ + � i � T ( y i , x i ) � Q Y with ˜ ν = ν + n and ˜
Conjugate-exponential VB Given an iid data set D = ( x 1 , . . . x n ) , if the model is CE then: ◮ Q θ ( θ ) is also conjugate, i.e. �� � Q θ ( θ ) ∝ P ( θ ) i log P ( y i , x i | θ ) exp Q Y � � φ ( θ ) T � � � log f ( Y , X ) i T ( y i , x i ) = h ( ν, τ ) g ( θ ) ν e φ ( θ ) T τ g ( θ ) n e Q Y e Q Y ν e φ ( θ ) T ˜ τ ) g ( θ ) ˜ τ ∝ h (˜ ν, ˜ τ = τ + � i � T ( y i , x i ) � Q Y with ˜ ν = ν + n and ˜ ◮ Q Y ( Y ) = � n i = 1 Q y i ( y i ) takes the same form as in the E-step of regular EM Q y i ( y i ) ∝ exp � log P ( y i , x i | θ ) � Q θ Q θ T ( y i , x i ) = P ( y i | x i , φ ( θ )) � φ ( θ ) � T ∝ f ( y i , x i ) e with natural parameters φ ( θ ) = � φ ( θ ) � Q θ
Conjugate-exponential VB Given an iid data set D = ( x 1 , . . . x n ) , if the model is CE then: ◮ Q θ ( θ ) is also conjugate, i.e. �� � Q θ ( θ ) ∝ P ( θ ) i log P ( y i , x i | θ ) exp Q Y � � φ ( θ ) T � � � log f ( Y , X ) i T ( y i , x i ) = h ( ν, τ ) g ( θ ) ν e φ ( θ ) T τ g ( θ ) n e Q Y e Q Y ν e φ ( θ ) T ˜ τ ) g ( θ ) ˜ τ ∝ h (˜ ν, ˜ τ = τ + � i � T ( y i , x i ) � Q Y ⇒ only need to track ˜ with ˜ ν = ν + n and ˜ ν, ˜ τ . ◮ Q Y ( Y ) = � n i = 1 Q y i ( y i ) takes the same form as in the E-step of regular EM Q y i ( y i ) ∝ exp � log P ( y i , x i | θ ) � Q θ Q θ T ( y i , x i ) = P ( y i | x i , φ ( θ )) � φ ( θ ) � T ∝ f ( y i , x i ) e with natural parameters φ ( θ ) = � φ ( θ ) � Q θ ⇒ inference unchanged from regular EM.
The Variational Bayesian EM algorithm EM for MAP estimation Variational Bayesian EM Goal: maximize P ( θ |X , m ) wrt θ Goal: maximise bound on P ( X| m ) wrt Q θ E Step: compute VB-E Step: compute Q Y ( Y ) ← p ( Y|X , ¯ Q Y ( Y ) ← p ( Y|X , θ ) φ ) M Step: VB-M Step: � � θ ← argmax d Y Q Y ( Y ) log P ( Y , X , θ ) Q θ ( θ ) ← exp d Y Q Y ( Y ) log P ( Y , X , θ ) θ
The Variational Bayesian EM algorithm EM for MAP estimation Variational Bayesian EM Goal: maximize P ( θ |X , m ) wrt θ Goal: maximise bound on P ( X| m ) wrt Q θ E Step: compute VB-E Step: compute Q Y ( Y ) ← p ( Y|X , ¯ Q Y ( Y ) ← p ( Y|X , θ ) φ ) M Step: VB-M Step: � � θ ← argmax d Y Q Y ( Y ) log P ( Y , X , θ ) Q θ ( θ ) ← exp d Y Q Y ( Y ) log P ( Y , X , θ ) θ Properties: ◮ Reduces to the EM algorithm if Q θ ( θ ) = δ ( θ − θ ∗ ) .
The Variational Bayesian EM algorithm EM for MAP estimation Variational Bayesian EM Goal: maximize P ( θ |X , m ) wrt θ Goal: maximise bound on P ( X| m ) wrt Q θ E Step: compute VB-E Step: compute Q Y ( Y ) ← p ( Y|X , ¯ Q Y ( Y ) ← p ( Y|X , θ ) φ ) M Step: VB-M Step: � � θ ← argmax d Y Q Y ( Y ) log P ( Y , X , θ ) Q θ ( θ ) ← exp d Y Q Y ( Y ) log P ( Y , X , θ ) θ Properties: ◮ Reduces to the EM algorithm if Q θ ( θ ) = δ ( θ − θ ∗ ) . ◮ F m increases monotonically, and incorporates the model complexity penalty.
The Variational Bayesian EM algorithm EM for MAP estimation Variational Bayesian EM Goal: maximize P ( θ |X , m ) wrt θ Goal: maximise bound on P ( X| m ) wrt Q θ E Step: compute VB-E Step: compute Q Y ( Y ) ← p ( Y|X , ¯ Q Y ( Y ) ← p ( Y|X , θ ) φ ) M Step: VB-M Step: � � θ ← argmax d Y Q Y ( Y ) log P ( Y , X , θ ) Q θ ( θ ) ← exp d Y Q Y ( Y ) log P ( Y , X , θ ) θ Properties: ◮ Reduces to the EM algorithm if Q θ ( θ ) = δ ( θ − θ ∗ ) . ◮ F m increases monotonically, and incorporates the model complexity penalty. ◮ Analytical parameter distributions (but not constrained to be Gaussian).
The Variational Bayesian EM algorithm EM for MAP estimation Variational Bayesian EM Goal: maximize P ( θ |X , m ) wrt θ Goal: maximise bound on P ( X| m ) wrt Q θ E Step: compute VB-E Step: compute Q Y ( Y ) ← p ( Y|X , ¯ Q Y ( Y ) ← p ( Y|X , θ ) φ ) M Step: VB-M Step: � � θ ← argmax d Y Q Y ( Y ) log P ( Y , X , θ ) Q θ ( θ ) ← exp d Y Q Y ( Y ) log P ( Y , X , θ ) θ Properties: ◮ Reduces to the EM algorithm if Q θ ( θ ) = δ ( θ − θ ∗ ) . ◮ F m increases monotonically, and incorporates the model complexity penalty. ◮ Analytical parameter distributions (but not constrained to be Gaussian). ◮ VB-E step has same complexity as corresponding E step.
The Variational Bayesian EM algorithm EM for MAP estimation Variational Bayesian EM Goal: maximize P ( θ |X , m ) wrt θ Goal: maximise bound on P ( X| m ) wrt Q θ E Step: compute VB-E Step: compute Q Y ( Y ) ← p ( Y|X , ¯ Q Y ( Y ) ← p ( Y|X , θ ) φ ) M Step: VB-M Step: � � θ ← argmax d Y Q Y ( Y ) log P ( Y , X , θ ) Q θ ( θ ) ← exp d Y Q Y ( Y ) log P ( Y , X , θ ) θ Properties: ◮ Reduces to the EM algorithm if Q θ ( θ ) = δ ( θ − θ ∗ ) . ◮ F m increases monotonically, and incorporates the model complexity penalty. ◮ Analytical parameter distributions (but not constrained to be Gaussian). ◮ VB-E step has same complexity as corresponding E step. ◮ We can use the junction tree, belief propagation, Kalman filter, etc, algorithms in the VB-E step of VB-EM, but using expected natural parameters , ¯ φ .
VB and model selection ◮ Variational Bayesian EM yields an approximate posterior Q θ over model parameters.
VB and model selection ◮ Variational Bayesian EM yields an approximate posterior Q θ over model parameters. ◮ It also yields an optimised lower bound on the model evidence max F M ( Q Y , Q θ ) ≤ P ( D|M )
VB and model selection ◮ Variational Bayesian EM yields an approximate posterior Q θ over model parameters. ◮ It also yields an optimised lower bound on the model evidence max F M ( Q Y , Q θ ) ≤ P ( D|M ) ◮ These lower bounds can be compared amongst models to learn the right (structure, connectivity . . . of the) model
VB and model selection ◮ Variational Bayesian EM yields an approximate posterior Q θ over model parameters. ◮ It also yields an optimised lower bound on the model evidence max F M ( Q Y , Q θ ) ≤ P ( D|M ) ◮ These lower bounds can be compared amongst models to learn the right (structure, connectivity . . . of the) model ◮ If a continuous domain of models is specified by a hyperparameter η , then the VB free energy depends on that parameter: �� d Y d θ Q Y ( Y ) Q θ ( θ ) log P ( X , Y , θ | η ) F ( Q Y , Q θ , η ) = Q Y ( Y ) Q θ ( θ ) ≤ P ( X| η ) A hyper-M step maximises the current bound wrt η : �� η ← argmax d Y d θ Q Y ( Y ) Q θ ( θ ) log P ( X , Y , θ | η ) η
ARD for unsupervised learning Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression. ◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality.
ARD for unsupervised learning Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression. ◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality. ◮ Consider factor analysis: � � 0 , α − 1 x ∼ N (Λ y , Ψ) y ∼ N ( 0 , I ) Λ : i ∼ N with a column-wise prior I i
ARD for unsupervised learning Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression. ◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality. ◮ Consider factor analysis: � � 0 , α − 1 x ∼ N (Λ y , Ψ) y ∼ N ( 0 , I ) Λ : i ∼ N with a column-wise prior I i ◮ The VB free energy is � � F ( Q Y ( Y ) , Q Λ (Λ) , Ψ , α ) = log P ( X , Y| Λ , Ψ) + log P (Λ | α ) + log P (Ψ) Q Y Q Λ + . . . and so hyperparameter optimisation requires α ← argmax � log P (Λ | α ) � Q Λ
ARD for unsupervised learning Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression. ◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality. ◮ Consider factor analysis: � � 0 , α − 1 x ∼ N (Λ y , Ψ) y ∼ N ( 0 , I ) Λ : i ∼ N with a column-wise prior I i ◮ The VB free energy is � � F ( Q Y ( Y ) , Q Λ (Λ) , Ψ , α ) = log P ( X , Y| Λ , Ψ) + log P (Λ | α ) + log P (Ψ) Q Y Q Λ + . . . and so hyperparameter optimisation requires α ← argmax � log P (Λ | α ) � Q Λ ◮ Now Q Λ is Gaussian, with the same form as in linear regression, but with expected moments of y appearing in place of the inputs.
ARD for unsupervised learning Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression. ◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality. ◮ Consider factor analysis: � � 0 , α − 1 x ∼ N (Λ y , Ψ) y ∼ N ( 0 , I ) Λ : i ∼ N with a column-wise prior I i ◮ The VB free energy is � � F ( Q Y ( Y ) , Q Λ (Λ) , Ψ , α ) = log P ( X , Y| Λ , Ψ) + log P (Λ | α ) + log P (Ψ) Q Y Q Λ + . . . and so hyperparameter optimisation requires α ← argmax � log P (Λ | α ) � Q Λ ◮ Now Q Λ is Gaussian, with the same form as in linear regression, but with expected moments of y appearing in place of the inputs. ◮ Optimisation wrt the distributions, Ψ and α in turn causes some α i to diverge as in regression ARD.
ARD for unsupervised learning Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression. ◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality. ◮ Consider factor analysis: � � 0 , α − 1 x ∼ N (Λ y , Ψ) y ∼ N ( 0 , I ) Λ : i ∼ N with a column-wise prior I i ◮ The VB free energy is � � F ( Q Y ( Y ) , Q Λ (Λ) , Ψ , α ) = log P ( X , Y| Λ , Ψ) + log P (Λ | α ) + log P (Ψ) Q Y Q Λ + . . . and so hyperparameter optimisation requires α ← argmax � log P (Λ | α ) � Q Λ ◮ Now Q Λ is Gaussian, with the same form as in linear regression, but with expected moments of y appearing in place of the inputs. ◮ Optimisation wrt the distributions, Ψ and α in turn causes some α i to diverge as in regression ARD. ◮ In this case, these parameters select “relevant” latent dimensions, effectively learning the dimensionality of y .
Recommend
More recommend