exact inference and learning for cumulative distribution
play

Exact inference and learning for cumulative distribution functions - PowerPoint PPT Presentation

Exact inference and learning for cumulative distribution functions on loopy graphs Jim C. Huang, Nebojsa Jojic and Christopher Meek NIPS 2010 Presented by Jenny Lam Previous work Cumulative distribution networks and the derivative-sum-


  1. Exact inference and learning for cumulative distribution functions on loopy graphs Jim C. Huang, Nebojsa Jojic and Christopher Meek NIPS 2010 Presented by Jenny Lam

  2. Previous work ◮ Cumulative distribution networks and the derivative-sum- product algorithm. Huang and Frey, 2008. UAI. ◮ Cumulative distribution networks: Inference, estimation and applications of graphical models for cumulative distribution functions. Huang, 2009. Ph.D. Thesis. ◮ Maximum-likelihood learning of cumulative distribution functions on graphs. Huang and Jojic, 2010. Journal of ML research.

  3. Cumulative Distribution Network: definition A CDN G is a bipartite graph ( V , S , E ) where ◮ V is the set of variable nodes, ◮ S is the set of function nodes, with φ : R | N ( φ ) | → [0 , 1] is a CDF, ◮ E is the set of edges, connecting functions to their variables. � $$$$$$$$$$$$ # � � � � � # The joint CDF of this CDN is F ( x ) = � φ ∈ S φ . #

  4. CDNs: what are they for? ◮ PDF models must enforce a normalization constraint. ◮ PDFs are made more tractable by restricting to, e.g., Gaussians. ◮ Many non-Gaussian distributions are conveniently parametrized as CDFs. ◮ CDNs can be used to model heavy-tailed distributions, which are important in climatology and epidemiology.

  5. Inference from joint CDF Conditional CDF F ( x B | x A ) = ∂ x A F ( x A , x B ) ∂ x A F ( x A ) Likelihood P ( x | θ ) = ∂ x F ( x | θ ) For MLE, need gradient of log likelihood 1 ∇ θ log P ( x | θ ) = P ( x | θ ) ∇ θ P ( x | θ )

  6. Mixed derivative of a product � ∂ x [ f · g ] = ∂ U f · ∂ U g U ⊆ x which has 2 | x | terms. More generally, k k � � � f i = ∂ x ∂ U i f i i =1 U 1 ,... U k i =1 where we sum over all partitions U 1 , . . . U k of x into k subsets. There are k | x | terms in this sum.

  7. Mixed derivative over a separation Partition the functions of a CDN into M 1 and M 2 ◮ with variable sets C 1 and C 2 and S 1 , 2 = C 1 ∩ C 2 ◮ and G 1 and G 2 the products of functions in M 1 and M 2 . Then � � � � � ∂ x [ G 1 G 2 ] = ∂ x C 1 \ S 1 , 2 ∂ x A G 1 ∂ x C 2 \ S 1 , 2 ∂ x S 1 , 2 \ A G 2 A ⊆ S 1 , 2

  8. Junction Tree: definition Let G = ( V , S , E ) be a CDN. A tree T = ( C , E ) is a junction tree for G if 1. C is a cover for V : each C j ∈ C is a subset of V and � j C j = V 2. family preservation holds: for each φ ∈ S , there is a C j ∈ C such that scope ( φ ) ⊆ C j 3. running intersection property holds: if C i ∈ C is on the path between C j and C k , then C j ∩ C k ⊆ C i

  9. Junction Tree: example � $$$$$$$$$$$$ # � � � � � # # (b)

  10. Construction of the junction tree In implementation ◮ greedily eliminate the variables with the minimal fill-in algorithm ◮ construct elimination subsets for nodes in the junction tree using the MATLAB Bayes Net Toolbox (Murphy, 2001)

  11. Decomposition of the joint CDF Partitioning function of S into M j , the joint CDF is � � F ( x ) = ψ j ( x C j ) , where ψ j ≡ φ C j ∈C φ ∈ M j Let r be a chosen root of the joint tree. Then � T r F ( x ) = ψ r ( x C r ) k ( x ) k ∈E r where T r � k ( x ) = ψ j ( x C j ) j ∈ τ r k and τ r k is the subtree rooted at k .

  12. Derivative of the joint CDF   � T r ∂ x F ( x ) = ∂ x  ψ r ( x C r ) k ( x )  k ∈E r   � T r = ∂ x Cr ∂ x Cr  ψ r ( x C r ) k ( x )  k ∈E r   � T r = ∂ x Cr  ψ r ( x C r ) ∂ x Cr k ( x )  k ∈E r   � k \ Cr T r = ∂ x Cr  ψ r ( x C r ) k ( x ) ∂ x τ r  k ∈E r the last equality follows from the running intersection property

  13. Messages to the root of the junction tree Message from children k to root r , where A ⊆ C r � � k \ Cr T r m k → r ( A ) ≡ ∂ x A ∂ x τ r k ( x ) In particular k \ Cr T r m k → r ( ∅ ) = ∂ x τ r k ( x ) At the root, if U r ⊆ E r , and A ⊆ C r   � m r ( A , U r ) ≡ ∂ x A  ψ r ( x C r ) m k → r ( ∅ )  k ∈E r

  14. Messages in the rest of the junction tree   � m i ( A , U i ) ≡ ∂ x A  ψ i ( x C i ) m j → i ( ∅ )  j ∈ U i where A ⊆ C i and U i ⊆ E i . � � j \ Si , j T i m j → i ( A ) ≡ ∂ x A ∂ x τ i j ( x ) where A ⊆ S i , j .

  15. Messages in the rest of the junction tree In terms of messages   � m i ( A , U i ) = ∂ x A  ψ i ( x C i ) m k → i ( ∅ ) m j → i ( ∅ )  j ∈ U i \{ k } � = m k → i ( B ) m i ( A \ B , U i \ { k } ) B ⊆ A ∩ S i , k   T j � m j → i ( A ) = ∂ x A , Cj \ Si , j  ψ j ( x C j ) l ( x )  l ∈E j \{ i } = m j ( A ∪ ( C j \ S i , j ) , E j \ { i } )

  16. Gradient of the likelihood Likelihood P ( x | θ ) = ∂ x [ F ( x | θ )] = m r ( C r , E r ) Gradient likelihood ∇ θ m r ( C r , E r ) decomposed similarly to m r ( C r , E r ) in the junction tree: ◮ g i ≡ ∇ θ m i ◮ g j → i ≡ ∇ θ m j → i

  17. JDiff algorithm: outline for each cluster (from leaf to root): 1. compute derivative within cluster 2. compute messages from children 3. send messages to parent

  18. Complexity of JDiff O-notation of number of steps/terms in each inner loop for fixed j : | C j | � | C j | � | M j | k = ( | M j | + 1) | C j | � 1. k k =1 | S j , k | � | S j , k | � 2 | C j \ S j , k | 2 l � 2. ( |E j | − 1) max l k ∈E j l =0 3. 2 | S j , k | Total. Exponential in tree-width of graph � � j ( | M j | + 1) | C j | + max ( j , k ) ∈E ( |E j | − 1)2 | C j \ S j , k | 3 | S j , k | O max

  19. Application: symbolic differentiation on graphs Computation of ∂ x F ( x ) on CDNs ◮ Grids: 3 × 3 to 9 × 9 ◮ Cycles: 10 to 20 nodes =>&??( @#0;"/#0&-#( >A( @=>%#=%2% � % A=>%#=%2% � % ;-'+#% <%#=% � %>?%5',=% B17&"#% ?=C<%#=% � %>=CD%#=% <=>%#=% � %EC?%#=% @=F%#=% � %<>=F%#=%

  20. Application: modeling heavy-tailed data ◮ Rainfall: 61 daily measurements of rainfall at 22 sites in China ◮ H1N1: 29 weekly mortality rates in 11 cities in the Northeastern US during the 2008-2009 epidemic ∩ 1(b). (c) (d)

  21. Application: modeling heavy-tailed data Average test log-likelihoods on leave-one-out cross-validation % % % % % % % % G.',8.&&%+.$.% H<I<%5*-$.&'$1% % � � � � � �

  22. Future work ◮ Develop compact models (bounded treewidth) for applications in other areas (seismology) ◮ Study connection between CDNs and other copula-based algorithms ◮ Develop faster approximate algorithms

Recommend


More recommend