Exact inference and learning for cumulative distribution functions on loopy graphs Jim C. Huang, Nebojsa Jojic and Christopher Meek NIPS 2010 Presented by Jenny Lam
Previous work ◮ Cumulative distribution networks and the derivative-sum- product algorithm. Huang and Frey, 2008. UAI. ◮ Cumulative distribution networks: Inference, estimation and applications of graphical models for cumulative distribution functions. Huang, 2009. Ph.D. Thesis. ◮ Maximum-likelihood learning of cumulative distribution functions on graphs. Huang and Jojic, 2010. Journal of ML research.
Cumulative Distribution Network: definition A CDN G is a bipartite graph ( V , S , E ) where ◮ V is the set of variable nodes, ◮ S is the set of function nodes, with φ : R | N ( φ ) | → [0 , 1] is a CDF, ◮ E is the set of edges, connecting functions to their variables. � $$$$$$$$$$$$ # � � � � � # The joint CDF of this CDN is F ( x ) = � φ ∈ S φ . #
CDNs: what are they for? ◮ PDF models must enforce a normalization constraint. ◮ PDFs are made more tractable by restricting to, e.g., Gaussians. ◮ Many non-Gaussian distributions are conveniently parametrized as CDFs. ◮ CDNs can be used to model heavy-tailed distributions, which are important in climatology and epidemiology.
Inference from joint CDF Conditional CDF F ( x B | x A ) = ∂ x A F ( x A , x B ) ∂ x A F ( x A ) Likelihood P ( x | θ ) = ∂ x F ( x | θ ) For MLE, need gradient of log likelihood 1 ∇ θ log P ( x | θ ) = P ( x | θ ) ∇ θ P ( x | θ )
Mixed derivative of a product � ∂ x [ f · g ] = ∂ U f · ∂ U g U ⊆ x which has 2 | x | terms. More generally, k k � � � f i = ∂ x ∂ U i f i i =1 U 1 ,... U k i =1 where we sum over all partitions U 1 , . . . U k of x into k subsets. There are k | x | terms in this sum.
Mixed derivative over a separation Partition the functions of a CDN into M 1 and M 2 ◮ with variable sets C 1 and C 2 and S 1 , 2 = C 1 ∩ C 2 ◮ and G 1 and G 2 the products of functions in M 1 and M 2 . Then � � � � � ∂ x [ G 1 G 2 ] = ∂ x C 1 \ S 1 , 2 ∂ x A G 1 ∂ x C 2 \ S 1 , 2 ∂ x S 1 , 2 \ A G 2 A ⊆ S 1 , 2
Junction Tree: definition Let G = ( V , S , E ) be a CDN. A tree T = ( C , E ) is a junction tree for G if 1. C is a cover for V : each C j ∈ C is a subset of V and � j C j = V 2. family preservation holds: for each φ ∈ S , there is a C j ∈ C such that scope ( φ ) ⊆ C j 3. running intersection property holds: if C i ∈ C is on the path between C j and C k , then C j ∩ C k ⊆ C i
Junction Tree: example � $$$$$$$$$$$$ # � � � � � # # (b)
Construction of the junction tree In implementation ◮ greedily eliminate the variables with the minimal fill-in algorithm ◮ construct elimination subsets for nodes in the junction tree using the MATLAB Bayes Net Toolbox (Murphy, 2001)
Decomposition of the joint CDF Partitioning function of S into M j , the joint CDF is � � F ( x ) = ψ j ( x C j ) , where ψ j ≡ φ C j ∈C φ ∈ M j Let r be a chosen root of the joint tree. Then � T r F ( x ) = ψ r ( x C r ) k ( x ) k ∈E r where T r � k ( x ) = ψ j ( x C j ) j ∈ τ r k and τ r k is the subtree rooted at k .
Derivative of the joint CDF � T r ∂ x F ( x ) = ∂ x ψ r ( x C r ) k ( x ) k ∈E r � T r = ∂ x Cr ∂ x Cr ψ r ( x C r ) k ( x ) k ∈E r � T r = ∂ x Cr ψ r ( x C r ) ∂ x Cr k ( x ) k ∈E r � k \ Cr T r = ∂ x Cr ψ r ( x C r ) k ( x ) ∂ x τ r k ∈E r the last equality follows from the running intersection property
Messages to the root of the junction tree Message from children k to root r , where A ⊆ C r � � k \ Cr T r m k → r ( A ) ≡ ∂ x A ∂ x τ r k ( x ) In particular k \ Cr T r m k → r ( ∅ ) = ∂ x τ r k ( x ) At the root, if U r ⊆ E r , and A ⊆ C r � m r ( A , U r ) ≡ ∂ x A ψ r ( x C r ) m k → r ( ∅ ) k ∈E r
Messages in the rest of the junction tree � m i ( A , U i ) ≡ ∂ x A ψ i ( x C i ) m j → i ( ∅ ) j ∈ U i where A ⊆ C i and U i ⊆ E i . � � j \ Si , j T i m j → i ( A ) ≡ ∂ x A ∂ x τ i j ( x ) where A ⊆ S i , j .
Messages in the rest of the junction tree In terms of messages � m i ( A , U i ) = ∂ x A ψ i ( x C i ) m k → i ( ∅ ) m j → i ( ∅ ) j ∈ U i \{ k } � = m k → i ( B ) m i ( A \ B , U i \ { k } ) B ⊆ A ∩ S i , k T j � m j → i ( A ) = ∂ x A , Cj \ Si , j ψ j ( x C j ) l ( x ) l ∈E j \{ i } = m j ( A ∪ ( C j \ S i , j ) , E j \ { i } )
Gradient of the likelihood Likelihood P ( x | θ ) = ∂ x [ F ( x | θ )] = m r ( C r , E r ) Gradient likelihood ∇ θ m r ( C r , E r ) decomposed similarly to m r ( C r , E r ) in the junction tree: ◮ g i ≡ ∇ θ m i ◮ g j → i ≡ ∇ θ m j → i
JDiff algorithm: outline for each cluster (from leaf to root): 1. compute derivative within cluster 2. compute messages from children 3. send messages to parent
Complexity of JDiff O-notation of number of steps/terms in each inner loop for fixed j : | C j | � | C j | � | M j | k = ( | M j | + 1) | C j | � 1. k k =1 | S j , k | � | S j , k | � 2 | C j \ S j , k | 2 l � 2. ( |E j | − 1) max l k ∈E j l =0 3. 2 | S j , k | Total. Exponential in tree-width of graph � � j ( | M j | + 1) | C j | + max ( j , k ) ∈E ( |E j | − 1)2 | C j \ S j , k | 3 | S j , k | O max
Application: symbolic differentiation on graphs Computation of ∂ x F ( x ) on CDNs ◮ Grids: 3 × 3 to 9 × 9 ◮ Cycles: 10 to 20 nodes =>&??( @#0;"/#0&-#( >A( @=>%#=%2% � % A=>%#=%2% � % ;-'+#% <%#=% � %>?%5',=% B17&"#% ?=C<%#=% � %>=CD%#=% <=>%#=% � %EC?%#=% @=F%#=% � %<>=F%#=%
Application: modeling heavy-tailed data ◮ Rainfall: 61 daily measurements of rainfall at 22 sites in China ◮ H1N1: 29 weekly mortality rates in 11 cities in the Northeastern US during the 2008-2009 epidemic ∩ 1(b). (c) (d)
Application: modeling heavy-tailed data Average test log-likelihoods on leave-one-out cross-validation % % % % % % % % G.',8.&&%+.$.% H<I<%5*-$.&'$1% % � � � � � �
Future work ◮ Develop compact models (bounded treewidth) for applications in other areas (seismology) ◮ Study connection between CDNs and other copula-based algorithms ◮ Develop faster approximate algorithms
Recommend
More recommend