Tractable Representations Inference Probabilistic Learning - PowerPoint PPT Presentation

Variational Autoencoders ∫ p θ ( x ) = p θ ( x | z ) p ( z ) d z an explicit likelihood model! Rezende et al., “Stochastic backprop. and approximate inference in deep generative models”, 2014 15 /108 Kingma et al., “Auto-Encoding Variational Bayes”, 2014

Variational Autoencoders [ ] log p θ ( x ) ≥ E z ∼ q ϕ ( z | x ) log p θ ( x | z ) − KL ( q ϕ ( z | x ) || p ( z )) an explicit likelihood model! ... but computing log p θ ( x ) is intractable an infjnite and uncountable mixture ⇒ ⇒ no tractable EVI we need to optimize the ELBO… which is “broken” ⇒ [Alemi et al. 2017; Dai et al. 2019] 16 /108

Probabilistic Graphical Models (PGMs) Declarative semantics : a clean separation of modeling assumptions from inference Nodes : random variables X 1 X 3 Edges : dependencies X 5 + X 2 X 4 Inference : conditioning [Darwiche 2001; Sang et al. 2005] elimination [Zhang et al. 1994; Dechter 1998] message passing [Yedidia et al. 2001; Dechter et al. 2002; Choi et al. 2010; Sontag et al. 2011] 17 /108

PGMs: MNs and BNs Markov Networks (MNs) X 1 X 3 p ( X ) = 1 ∏ c φ c ( X c ) Z X 5 X 2 X 4 18 /108

PGMs: MNs and BNs Markov Networks (MNs) X 1 X 3 p ( X ) = 1 ∏ c φ c ( X c ) Z X 5 ∫ ∏ Z = c φ c ( X c ) d X X 2 X 4 ⇒ EVI queries are intractable! 18 /108

PGMs: MNs and BNs Markov Networks (MNs) X 1 X 3 p ( X ) = 1 ∏ c φ c ( X c ) Z X 5 ∫ ∏ Z = c φ c ( X c ) d X X 2 X 4 ⇒ EVI queries are intractable! Bayesian Networks (BNs) X 1 X 3 p ( X ) = ∏ i p ( X i | pa ( X i )) X 5 X 2 EVI queries are tractable! ⇒ X 4 18 /108

pinterest.com/pin/190417890473268205/ Marginal queries (MAR) q 1 : What is the probability that today is a Monday at 12.00 and there is a traffjc jam only on Herzl Str.? 19 /108

pinterest.com/pin/190417890473268205/ Marginal queries (MAR) q 1 : What is the probability that today is a Monday at 12.00 and there is a traffjc jam only on Herzl Str.? q 1 ( m ) = p m ( Day = Mon , Jam Herzl = 1) 19 /108

pinterest.com/pin/190417890473268205/ Marginal queries (MAR) q 1 : What is the probability that today is a Monday at 12.00 and there is a traffjc jam only on Herzl Str.? q 1 ( m ) = p m ( Day = Mon , Jam Herzl = 1) General: p m ( e ) = ∫ p m ( e , H ) d H where E ⊂ X H = X \ E 19 /108

pinterest.com/pin/190417890473268205/ Conditional queries (CON) q 4 : What is the probability that there is a traffjc jam on Herzl Str. given that today is a Monday? 20 /108

pinterest.com/pin/190417890473268205/ Conditional queries (CON) q 4 : What is the probability that there is a traffjc jam on Herzl Str. given that today is a Monday? q 4 ( m ) = p m ( Jam Herzl = 1 | Day = Mon ) 20 /108

pinterest.com/pin/190417890473268205/ Conditional queries (CON) q 4 : What is the probability that there is a traffjc jam on Herzl Str. given that today is a Monday? q 4 ( m ) = p m ( Jam Herzl = 1 | Day = Mon ) If you can answer MAR queries, then you can also do conditional queries ( CON ): p m ( Q | E ) = p m ( Q , E ) p m ( E ) 20 /108

Complexity of MAR on PGMs Exact complexity: Computing MAR and COND is #P-complete [Cooper 1990; Roth 1996]. Approximation complexity: Computing MAR and COND approximately within a relative error of 2 n 1 − ϵ for any fjxed ϵ is NP-hard [Dagum et al. 1993; Roth 1996]. Treewidth : Informally, how tree-like is the graphical model m ? Formally, the minimum width of any tree-decomposition of m . Fixed-parameter tractable : MAR and CON on a graphical model m with treewidth w take time O ( | X | · 2 w ) , which is linear for fjxed width w [Dechter 1998; Koller et al. 2009]. what about bounding the treewidth by design? ⇒ 21 /108

Low-treewidth PGMs X 1 X 1 X 3 X 3 X 1 X 2 X 3 X 5 X 2 X 5 X 2 X 5 X 1 X 3 X 4 X 4 X 4 Trees Polytrees Thin Junction trees [Meilă et al. 2000] [Dasgupta 1999] [Bach et al. 2001] If treewidth is bounded (e.g. ≊ 20 ), exact MAR and CON inference is possible in practice 22 /108

Low-treewidth PGMs: trees A tree-structured BN [Meilă et al. 2000] where each X i ∈ X has at most one parent Pa X i . X 1 X 3 ∏ n p ( X ) = i =1 p ( x i | Pa x i ) X 5 X 2 X 4 Exact querying: EVI, MAR, CON tasks linear for trees : O ( | X | ) Exact learning from d examples takes O ( | X | 2 · d ) with the classical Chow-Liu algorithm 1 23 /108 1 Chow et al., “Approximating discrete probability distributions with dependence trees”, 1968

What do we lose? Expressiveness : Ability to compactly represent rich and complex classes of distributions X 1 X 3 X 5 X 2 X 4 Bounded-treewidth PGMs lose the ability to represent all possible distributions … Cohen et al., “On the expressive power of deep learning: A tensor analysis”, 2016 24 /108 Martens et al., “On the Expressive Effjciency of Sum Product Networks”, 2014

Mixtures Mixtures as a convex combination of k (simpler) probabilistic models 0 . 25 0 . 20 p ( X 1 ) 0 . 15 p ( X ) = w 1 · p 1 ( X )+ w 2 · p 2 ( X ) 0 . 10 0 . 05 0 . 00 − 10 − 5 0 5 10 X 1 EVI, MAR, CON queries scale linearly in k 25 /108

Mixtures Mixtures as a convex combination of k (simpler) probabilistic models 0 . 25 p ( X ) = p ( Z = 1 ) · p 1 ( X | Z = 1 ) 0 . 20 p ( X 1 ) + p ( Z = 2 ) · p 2 ( X | Z = 2 ) 0 . 15 0 . 10 0 . 05 0 . 00 − 10 − 5 0 5 10 X 1 Mixtures are marginalizing a categorical latent variable Z with k values ⇒ increased expressiveness 25 /108

Expressiveness and effjciency Expressiveness : Ability to compactly represent rich and efgective classes of functions mixture of Gaussians can approximate any distribution! ⇒ Expressive effjciency (succinctness) compares model sizes in terms of their ability to compactly represent functions but how many components do they need? ⇒ Cohen et al., “On the expressive power of deep learning: A tensor analysis”, 2016 26 /108 Martens et al., “On the Expressive Effjciency of Sum Product Networks”, 2014

Mixture models Expressive effjciency deeper mixtures would be effjcient compared to shallow ones ⇒ 27 /108

pinterest.com/pin/190417890473268205/ Maximum A Posteriori (MAP) aka Most Probable Explanation (MPE) q 5 : Which combination of roads is most likely to be jammed on Monday at 9am? 28 /108

pinterest.com/pin/190417890473268205/ Maximum A Posteriori (MAP) aka Most Probable Explanation (MPE) q 5 : Which combination of roads is most likely to be jammed on Monday at 9am? q 5 ( m ) = argmax j p m ( j 1 , j 2 , . . . | Day = M , Time = 9 ) 28 /108

pinterest.com/pin/190417890473268205/ Maximum A Posteriori (MAP) aka Most Probable Explanation (MPE) q 5 : Which combination of roads is most likely to be jammed on Monday at 9am? q 5 ( m ) = argmax j p m ( j 1 , j 2 , . . . | Day = M , Time = 9 ) General: argmax q p m ( q | e ) where Q ∪ E = X 28 /108

pinterest.com/pin/190417890473268205/ Maximum A Posteriori (MAP) aka Most Probable Explanation (MPE) q 5 : Which combination of roads is most likely to be jammed on Monday at 9am? …intractable for latent variable models! ∑ max p m ( q | e ) = max p m ( q , z | e ) q q z ∑ ̸ = max p m ( q , z | e ) q z 28 /108

pinterest.com/pin/190417890473268205/ Marginal MAP (MMAP) aka Bayesian Network MAP q 6 : Which combination of roads is most likely to be jammed on Monday at 9am? 29 /108

pinterest.com/pin/190417890473268205/ Marginal MAP (MMAP) aka Bayesian Network MAP q 6 : Which combination of roads is most likely to be jammed on Monday at 9am? q 6 ( m ) = argmax j p m ( j 1 , j 2 , . . . | Time = 9 ) 29 /108

pinterest.com/pin/190417890473268205/ Marginal MAP (MMAP) aka Bayesian Network MAP q 6 : Which combination of roads is most likely to be jammed on Monday at 9am? q 6 ( m ) = argmax j p m ( j 1 , j 2 , . . . | Time = 9 ) General: argmax q p m ( q | e ) ∑ = argmax q h p m ( q , h | e ) where Q ∪ H ∪ E = X 29 /108

pinterest.com/pin/190417890473268205/ Marginal MAP (MMAP) aka Bayesian Network MAP q 6 : Which combination of roads is most likely to be jammed on Monday at 9am? q 6 ( m ) = argmax j p m ( j 1 , j 2 , . . . | Time = 9 ) NP PP -complete [Park et al. 2006] ⇒ NP-hard for trees [Campos 2011] ⇒ ⇒ NP-hard even for Naive Bayes [ibid.] 29 /108

pinterest.com/pin/190417890473268205/ Advanced queries q 2 : Which day is most likely to have a traffjc jam on my route to work? 30 /108 Bekker et al., “Tractable Learning for Complex Probability Queries”, 2015

pinterest.com/pin/190417890473268205/ Advanced queries q 2 : Which day is most likely to have a traffjc jam on my route to work? q 2 ( m ) = argmax d p m ( Day = d ∧ ∨ i ∈ route Jam Str i ) marginals + MAP + logical events ⇒ 30 /108 Bekker et al., “Tractable Learning for Complex Probability Queries”, 2015

pinterest.com/pin/190417890473268205/ Advanced queries q 2 : Which day is most likely to have a traffjc jam on my route to work? q 7 : What is the probability of seeing more traffjc jams in Jafga than Marina? 30 /108 Bekker et al., “Tractable Learning for Complex Probability Queries”, 2015

pinterest.com/pin/190417890473268205/ Advanced queries q 2 : Which day is most likely to have a traffjc jam on my route to work? q 7 : What is the probability of seeing more traffjc jams in Jafga than Marina? counts + group comparison ⇒ 30 /108 Bekker et al., “Tractable Learning for Complex Probability Queries”, 2015

pinterest.com/pin/190417890473268205/ Advanced queries q 2 : Which day is most likely to have a traffjc jam on my route to work? q 7 : What is the probability of seeing more traffjc jams in Jafga than Marina? and more: expected classifjcation agreement [Oztok et al. 2016; Choi et al. 2017, 2018] expected predictions [Khosravi et al. 2019a] 30 /108 Bekker et al., “Tractable Learning for Complex Probability Queries”, 2015

Fully factorized models A completely disconnected graph. Example: Product of Bernoullis (PoBs) X 1 ∏ n p ( X ) = i =1 p ( x i ) X 3 X 5 X 2 X 4 Complete evidence, marginals and MAP, MMAP inference is linear ! ⇒ but defjnitely not expressive… 31 /108

more tractable queries more expressive less expressive effjcient effjcient less tractable queries 32 /108

more tractable queries more expressive less expressive effjcient effjcient NADEs BNs NFs MNs VAEs GANs less tractable queries Expressive models are not very tractable… 33 /108

more tractable queries Fully factorized LTM Trees NB Mixtures Polytrees TJT more expressive less expressive effjcient effjcient NADEs BNs NFs MNs VAEs GANs less tractable queries and tractable ones are not very expressive… 34 /108

more tractable queries X Fully factorized LTM Trees NB Mixtures Polytrees TJT more expressive less expressive effjcient effjcient NADEs BNs NFs MNs VAEs GANs less tractable queries probabilistic circuits are at the “sweet spot” 35 /108

Probabilistic Circuits

Stay Tuned For … Next: 1. What are the building blocks of tractable models? a computational graph forming a probabilistic circuit ⇒ 2. For which queries are probabilistic circuits tractable? tractable classes induced by structural properties ⇒ After: How are probabilistic circuits related to the alphabet soup of models? 37 /108

Base Case: Univariate Distributions p X ( x ) x X Generally, univariate distributions are tractable for: EVI : output p ( X i ) (density or mass) MAR : output 1 (normalized) or Z (unnormalized) MAP : output the mode 38 /108

Base Case: Univariate Distributions p X ( x ) x X Generally, univariate distributions are tractable for: EVI : output p ( X i ) (density or mass) MAR : output 1 (normalized) or Z (unnormalized) MAP : output the mode ⇒ often 100% probability for one value of a categorical random variable for example, X or ¬ X for Boolean random variable ⇒ 38 /108

Base Case: Univariate Distributions . 74 . 33 X Generally, univariate distributions are tractable for: EVI : output p ( X i ) (density or mass) MAR : output 1 (normalized) or Z (unnormalized) MAP : output the mode ⇒ often 100% probability for one value of a categorical random variable for example, X or ¬ X for Boolean random variable ⇒ 38 /108

Factorizations are products Divide and conquer complexity p ( X 1 , X 2 , X 3 ) = p ( X 1 ) · p ( X 2 ) · p ( X 3 ) 3 . 0 X 1 × 2 . 5 2 . 0 X 2 1 . 5 1 . 0 X 3 0 . 5 X 1 X 2 X 3 0 . 0 X 1 X 2 X 3 e.g. modeling a multivariate Gaussian with diagonal covariance matrix ⇒ 39 /108

Factorizations are products Divide and conquer complexity p ( x 1 , x 2 , x 3 ) = p ( x 1 ) · p ( x 2 ) · p ( x 3 ) 3 . 0 X 1 × 2 . 5 2 . 0 X 2 1 . 5 0.8 0.5 0.9 1 . 0 X 3 0 . 5 X 1 X 2 X 3 0 . 0 X 1 X 2 X 3 e.g. modeling a multivariate Gaussian with diagonal covariance matrix ⇒ 39 /108

Factorizations are products Divide and conquer complexity p ( x 1 , x 2 , x 3 ) = p ( x 1 ) · p ( x 2 ) · p ( x 3 ) 3 . 0 X 1 .36 2 . 5 2 . 0 X 2 1 . 5 0.5 0.8 0.9 1 . 0 X 3 0 . 5 X 1 X 2 X 3 0 . 0 X 1 X 2 X 3 e.g. modeling a multivariate Gaussian with diagonal covariance matrix ⇒ 39 /108

Mixtures are sums Also mixture models can be treated as a simple computational unit over distributions 0 . 25 0 . 20 p ( X 1 ) 0 . 15 p ( X ) = w 1 · p 1 ( X )+ w 2 · p 2 ( X ) 0 . 10 0 . 05 0 . 00 − 10 − 5 0 5 10 X 1 40 /108

Mixtures are sums Also mixture models can be treated as a simple computational unit over distributions w 1 w 2 p ( x ) = 0 . 2 · p 1 ( x )+0 . 8 · p 2 ( x ) X 1 X 1 40 /108

Mixtures are sums Also mixture models can be treated as a simple computational unit over distributions .44 0 . 2 0 . 8 p ( x ) = 0 . 2 · p 1 ( x )+0 . 8 · p 2 ( x ) 0.2 0.5 X 1 X 1 With mixtures, we increase expressiveness ⇒ by stacking them we increase expressive effjciency 40 /108

A grammar for tractable models Recursive semantics of probabilistic circuits X 1 41 /108

A grammar for tractable models Recursive semantics of probabilistic circuits w 1 w 2 X 1 X 1 X 1 41 /108

A grammar for tractable models Recursive semantics of probabilistic circuits w 1 w 2 × × w 1 w 2 X 1 X 1 X 1 X 1 X 2 X 1 X 2 41 /108

A grammar for tractable models Recursive semantics of probabilistic circuits X 1 X 1 × × w 1 w 2 X 2 X 2 × × w 1 w 2 × × × × X 1 X 1 X 1 X 3 X 4 X 3 X 4 X 1 X 2 X 1 X 2 41 /108

Probabilistic circuits are not PGMs! They are probabilistic and graphical , however … PGMs Circuits Nodes : random variables unit of computations Edges : dependencies order of execution Inference : conditioning feedforward pass elimination backward pass message passing ⇒ they are computational graphs, more like neural networks 42 /108

Just sum, products and distributions? X 1 × × X 2 × × × X 3 × × × X 4 × × × X 5 × × × X 6 × × just arbitrarily compose them like a neural network! 43 /108

Just sum, products and distributions? X 1 × × X 2 × × × X 3 × × × X 4 × × × X 5 × × × X 6 × × just arbitrarily compose them like a neural network! structural constraints needed for tractability ⇒ 43 /108

How do we ensure tractability? 44 /108

Decomposability A product node is decomposable if its children depend on disjoint sets of variables just like in factorization! ⇒ × × X 1 X 2 X 3 X 1 X 1 X 3 decomposable circuit non-decomposable circuit 45 /108 Darwiche et al., “A knowledge compilation map”, 2002

Smoothness aka completeness A sum node is smooth if its children depend of the same variable sets otherwise not accounting for some variables ⇒ w 1 w 2 w 1 w 2 X 1 X 1 X 1 X 2 smooth circuit non-smooth circuit smoothness can be easily enforced [Shih et al. 2019] ⇒ 46 /108 Darwiche et al., “A knowledge compilation map”, 2002

Tractable MAR/CON Smoothness and decomposability enable tractable MAR/CON queries X 1 X 1 × × X 2 X 2 × × × × X 3 X 4 X 3 X 4 47 /108

Tractable MAR/CON Smoothness and decomposability enable tractable MAR/CON queries If p ( x , y ) = p ( x ) p ( y ) , ( decomposability ): X 1 X 1 × × ∫ ∫ ∫ ∫ X 2 X 2 p ( x , y ) d x d y = p ( x ) p ( y ) d x d y = ∫ ∫ × × × × = p ( x ) d x p ( y ) d y X 3 X 4 X 3 X 4 larger integrals decompose into easier ones ⇒ 47 /108

Tractable MAR/CON Smoothness and decomposability enable tractable MAR/CON queries If p ( x ) = ∑ i w i p i ( x ) , ( smoothness ): X 1 X 1 × × ∫ ∫ ∑ X 2 X 2 p ( x ) d x = w i p i ( x ) d x = i × × × × ∫ ∑ = p i ( x ) d x w i i X 3 X 4 X 3 X 4 ⇒ integrals are “pushed down” to children 47 /108

Tractable MAR/CON Smoothness and decomposability enable tractable MAR/CON queries .49 Forward pass evaluation for MAR linear in circuit size! ⇒ 1.0 .35 .64 1.0 E.g. to compute p ( x 2 , x 4 ) : .61 .58 .77 .83 leafs over X 1 and X 3 output Z i = ∫ p ( x i ) dx i .58 .58 .77 .77 ⇒ for normalized leaf distributions: 1 . 0 leafs over X 2 and X 4 output EVI 1.0 .58 1.0 .77 47 /108

Determinism aka selectivity A sum node is deterministic if the output of only one children is non zero for any input e.g. if their distributions have disjoint support ⇒ w 1 w 2 w 1 w 2 × × × × X 1 ≤ θ X 2 X 1 > θ X 2 X 1 X 2 X 1 X 2 deterministic circuit non-deterministic circuit 48 /108

Tractable MAP The addition of determinism enables tractable MAP queries! X 1 X 1 × × X 2 X 2 × × × × X 3 X 4 X 3 X 4 49 /108

Tractable MAP The addition of determinism enables tractable MAP queries! If p ( q , e ) = p ( q x , e x , q y , e y ) = p ( q x , e x ) p ( q y , e y ) ( decomposable product node): X 1 X 1 × × argmax p ( q | e ) = argmax p ( q , e ) q q X 2 X 2 = argmax p ( q x , e x , q y , e y ) q x , q y × × × × = argmax p ( q x , e x ) , argmax p ( q y , e y ) q x q y X 3 X 4 X 3 X 4 solving optimization independently ⇒ 49 /108

Tractable MAP The addition of determinism enables tractable MAP queries! If p ( q , e ) = ∑ i w i p i ( q , e ) = max i w i p i ( q , e ) , ( deterministic sum node): X 1 X 1 × × ∑ argmax p ( q , e ) = argmax w i p i ( q , e ) q q i X 2 X 2 = argmax max w i p i ( q , e ) i q × × × × = max argmax w i p i ( q , e ) i q X 3 X 4 X 3 X 4 one non-zero child term, thus sum is max ⇒ 49 /108

Tractable MAP The addition of determinism enables tractable MAP queries! Evaluating the circuit twice: bottom-up and top-down still linear in circuit size! ⇒ X 1 X 1 × × X 2 X 2 × × × × X 3 X 4 X 3 X 4 49 /108

Tractable MAP The addition of determinism enables tractable MAP queries! Evaluating the circuit twice: max bottom-up and top-down still linear in circuit size! ⇒ X 1 X 1 × × In practice: X 2 X 2 max max 1. turn sum into max nodes 2. evaluate p ( e ) bottom-up × × × × 3. retrieve max activations top-down 4. compute MAP queries at leaves X 3 X 4 X 3 X 4 49 /108

Tractable Representations Inference Probabilistic Learning - PowerPoint PPT Presentation

Tractable Representations Inference Probabilistic Learning Models Applications Guy Van den Broeck University of California, Los Angeles based on joint AAAI-2020 and UAI-2019 tutorials with Antonio Vergari YooJung Choi University of

Chapter 4 ICS-275 Fall 2010 Fall 2010 ICS 275 - Constraint Networks 1 Tractable Tractable

Tractable Term Structure ModelsA New Approach Bruno Feunou, Jean-S ebastien Fontaine, Anh

Tractable Constraint Languages Zion Schell Based on Chapter 11 of Rina Dechter's Constraint

Chordal deletion is fixed-parameter tractable D aniel Marx Humboldt-Universit at zu Berlin

A Tractable State-Space Model for Symmetric Positive-Definite Matrices Jesse Windle 1 Carlos

Tractable Inference for Probabilistic Models Manfred Opper (Aston University, Birmingham, U.K.)

Fixed parameter tractable algorithms for corridor guarding problems R. Subashini Joint work with

Finding topological subgraphs is fixed-parameter tractable Martin Grohe 1 Ken-ichi Kawarabayashi 2

An exact characterization of tractable demand patterns for maximum disjoint path problems Dniel

Computationally Tractable Methods for High-Dimensional Data Peter B uhlmann Seminar f ur

Sometimes, the problem becomes more tractable when presented with the solution DISCLAIMER!!!

Compilation of Query-Rewriting Problems into Tractable Fragments of Propositional Logic Yolif

Towards Tractable Inference for Resource-Bounded Agents Toryn Q. Klassen Sheila A. McIlraith

Faster Attend-Infer-Repeat with Tractable Probabilistic Models Karl Stelzner 1 , Robert Peharz 2 ,

XPath Satisfiability with Parent Axes or Qualifiers Is Tractable under Many of Real-World DTDs

Towards Compact and Tractable Automaton-based Representations of Time Granularities Ugo Dal Lago

Memory Philipp Koehn 9 September 2019 Philipp Koehn Computer Systems Fundamental: Memory 9

The Multilingual Language Library @ LREC 2012 Lets build it together! Nicoletta Calzolari w ith

Comparative Advantage and Optimal Trade Taxes Arnaud Costinot (MIT), Dave Donaldson (MIT),

Mike Blows (UK) mike blows@hotmail.c om 20 long ye ar s! 5000 dya ds/ tria ds/ fa milie s the

ConnectHome Nation Webinar ConnectHome Nation Webinar Digital Inclusion Efforts in Detroit

Distributed optimization over networks: application to multi-building energy management Maria

Falkon: optimal and efficient large scale kernel learning Alessandro Rudi INRIA - Ecole

Characterizing IPv4 Anycast Adoptjon and Deployment Dario Rossi Professor