Exact Inference: Variable Elimination Probabilistic Graphical Models Sharif University of Technology Spring 2018 Soleymani
Probabilistic Inference and Learning We now have compact representations of probability distributions (Graphical Models) A GM M describes a unique probability distribution P Typical tasks: Task 1: How do we answer queries about 𝑄 𝑁 , e.g., 𝑄 𝑁 𝑌 𝑍 ? We use inference as a name for the process of computing answers to such queries Task 2: How do we estimate a plausible model M from data D? i.We use learning as a name for the process of obtaining point estimate of M. ii. But for Bayesian, they seek 𝑞(𝑁|𝐸) , which is actually an inference problem. iii. When not all variables are observable, even computing point estimate of M need to do inference to impute the missing data. 2 This slide has been adopted from Eric Zing, PGM 10708, CMU.
Why we need inference If we know the graphical model, we use the inference to find marginal or conditional distributions efficiently We also need inference in the learning when we try to find a model from incomplete data or when the learning approach is Bayesian (as we will see in the next lectures) 3
Inference query Likelihood: probability of evidence Nodes: 𝒴 = {𝑌 1 , … , 𝑌 𝑜 } 𝒇 : evidence on a set variables 𝑭 𝒀 = 𝒴 − 𝑭 𝑄 𝒇 = 𝑄(𝒀, 𝒇) 𝒁 = 𝒀 − 𝒂 𝒀 Marginal probability distribution: 𝑄 𝒀 = 𝑄(𝒴) 𝒴−𝒀 Conditional probability distribution (a posteriori belief): 𝑄(𝒀, 𝒇) 𝑄 𝒀|𝒇 = 𝒀 𝑄(𝒀, 𝒇) Marginalized conditional probability distribution: 𝒂 𝑄(𝒁, 𝒂, 𝒇) 𝑄 𝒁|𝒇 = (𝒀 = 𝒁 ∪ 𝒂) 𝒁 𝒂 𝑄(𝒁, 𝒂, 𝒇) query a subset Y of all domain variables X={Y,Z} 4 and "don't care" about the remaining Z
Most Probable Assignment (MPA) Most probable assignment for some variables of interest given an evidence 𝑭 = 𝒇 𝒁 ∗ |𝒇 = argmax 𝑄 𝒁|𝒇 𝒁 Maximum a posteriori configuration of 𝒁 Applications of MPA Classification find most likely label, given the evidence Explanation what is the most likely scenario, given the evidence 5
MPA: Example 6 This slide has been adopted from Eric Zing, PGM 10708, CMU.
Marginal probability: Enumeration 𝑄 𝒁 𝒇 ∝ 𝑄 𝒁, 𝒇 𝑄 𝒁, 𝒇 = 𝒂 𝑄(𝒁, 𝒇, 𝒂) Marginal probability: exponential computation is required in general #P-complete problem (enumeration intractable) Even in the graph of polynomial size it can be exponential We cannot find a general procedure that works efficiently for arbitrary GMs 7
Harness of Inference Hardness does not mean we cannot solve inference It implies that we cannot find a general procedure that works efficiently for arbitrary GMs For particular families of GMs, we can have provably efficient procedures For special graph structure, provably efficient algorithms (avoiding exponential cost) are available 8
Exact inference Exact inference: Variable elimination algorithm general graph one query Belief propagation , sum-product on factor graphs Tree marginal probability on all nodes Junction tree algorithm general graph marginal probability on all clique nodes 9
Inference on a chain 𝐵 𝐶 𝐷 𝐸 𝑄 𝑒 = 𝑄(𝑏, 𝑐, 𝑑, 𝑒) 𝑏 𝑐 𝑑 𝑄 𝑒 = 𝑄 𝑏 𝑄 𝑐 𝑏 𝑄 𝑑 𝑐 𝑄(𝑒|𝑑) 𝑐 𝑑 𝑏 A naïve summation needs to enumerate over an exponential number of terms 10
Inference on a chain: marginalization and elimination 𝐵 𝐶 𝐷 𝐸 𝑄 𝑒 = 𝑄 𝑏 𝑄 𝑐 𝑏 𝑄 𝑑 𝑐 𝑄(𝑒|𝑑) 𝑐 𝑑 𝑏 = 𝑄 𝑏 𝑄 𝑐 𝑏 𝑄 𝑑 𝑐 𝑄(𝑒|𝑑) 𝑐 𝑏 𝑑 = 𝑄(𝑒|𝑑) 𝑄 𝑑 𝑐 𝑄 𝑏 𝑄 𝑐 𝑏 𝑐 𝑏 𝑑 𝑄(𝑐) 𝑄(𝑑) 𝑄(𝑒) In a chain of 𝑜 nodes each having 𝑙 values, 𝑃(𝑜𝑙 2 ) instead of 𝑃(𝑙 𝑜 ) 11
𝑌 𝑂 𝑌 1 𝑌 2 … 𝑌 𝑂−1 Inference on a chain 𝑌 𝑂 … 𝑌 1 𝑌 2 𝑌 𝑂−1 In both directed and undirected graphical models, the joint probability is a factored expression over subsets of the variables 𝑄 𝒚 = 1 𝑎 𝜚 1,2 𝑦 1 , 𝑦 2 𝜚 2,3 𝑦 2 , 𝑦 3 … 𝜚 𝑂−1,𝑂 𝑦 𝑂−1 , 𝑦 𝑂 undirected 𝑄 𝑦 𝑗 = 1 𝑎 … … 𝜚 1,2 𝑦 1 , 𝑦 2 … 𝜚 𝑂−1,𝑂 𝑦 𝑂−1 , 𝑦 𝑂 𝑦 𝑗−1 𝑦 𝑗+1 𝑦 𝑂 𝑦 1 𝑄 𝑦 𝑗 = 𝜚 𝑦 𝑗−1 , 𝑦 𝑗 𝜚 𝑦 𝑗−2 , 𝑦 𝑗−1 … 𝜚 𝑦 1 , 𝑦 2 𝑦 𝑗−1 𝑦 𝑗−2 𝑦 1 × 𝜚 𝑦 𝑗 , 𝑦 𝑗+1 𝜚 𝑦 𝑗+1 , 𝑦 𝑗+2 … 𝜚 𝑦 𝑂−1 , 𝑦 𝑂 𝑦 𝑗+1 𝑦 𝑗+2 𝑦 𝑂 operations in each elimination 𝑃 𝑊𝑏𝑚 𝑌 × 𝑊𝑏𝑚 𝑌 𝑘 𝑘+1 12
Inference on a chain: improvement reasons Computing an expression of the form (sum-product inference): 𝜚 𝜲 : the set of factors 𝒂 𝜚∈𝜲 We used the structure of BN to factorize the joint distribution and thus the scope of the resulted factors will be limited. Distributive law: If 𝑌 ∉ Scope(𝜚 1 ) then 𝑌 𝜚 1 . 𝜚 2 = 𝜚 1 . 𝑌 𝜚 2 Performing the summations over the product of only a subset of factors We find sub-expressions that can be computed once and then we save and reuse them in later computations Instead of computing them exponentially many times 13
Variable elimination algorithm for sum-product inference Sum out each variable one at a time all factors containing that variable are (removed from the set of factors and) multiplied to generate a product factor The variable is summed out from the generated product factor and a new factor is obtained The new factor is added to the set of the available factors The resulted factor does not necessarily correspond to any probability or conditional probability in the network 14
Procedure Sum-Product-VE ( Z, G) Procedure Sum-Product-Elim-Var( 𝚾 , 𝑎 ) // 𝒂 : the variables to be eliminated 𝚾 ′ ← {𝜚 ∈ 𝚾: 𝑎 ∈ Scope(𝜚)} 𝚾 ← all factors of G 𝚾 ′′ ← 𝚾 − 𝚾 ′ Select an elimination order 𝑎 1 , . . . , 𝑎 𝐿 for 𝒂 for 𝑗 = 1, . . . , 𝐿 𝑛 ← 𝜚 𝚾 ← Sum-Product-Elim-Var( 𝚾 , 𝑎 𝑗 )) 𝜚∈𝚾 ′ 𝑎 return 𝚾 ′′ ∪ {𝑛} 𝜚 ∗ ← 𝜚 𝜚∈𝜲 Return 𝜚 ∗ • Move all irrelevant factors (to the variable that must be eliminated now) It does not need normalization for outside of the summation directed graph when we have no evidence • Perform sum, getting a new term Insert the new term into the product • 15
Procedure Cond-Prob-VE ( , // the network over 𝒀 𝒁 , // Set of query variables 𝑭 = 𝒇, // evidence) 𝚾 ← the factors parametrizing Replace each 𝜚 ∈ 𝜲 by 𝜚[𝑭 = 𝒇] Select an elimination order 𝑎 1 , . . . , 𝑎 𝐿 for 𝒂 = 𝒀 − 𝒁 − 𝑭 for 𝑗 = 1, . . . , 𝑙 𝚾 ← Sum-Product-Elim-Var( 𝚾 , 𝑎 𝑗 )) 𝜚 ∗ ← 𝜚 𝜚∈𝜲 𝜚 ∗ (𝒛) 𝛽 ← 𝒛∈𝑊𝑏𝑚(𝒁) Return 𝛽, 𝜚 ∗ 16
Directed example Query: 𝑄(𝑌 2 |𝑌 7 = 𝑦 7 ) 𝑌 2 𝑌 1 𝑌 3 𝑄 𝑌 2 𝑦 7 ∝ 𝑄 𝑌 2 , 𝑦 7 𝑌 5 𝑌 4 𝑌 7 𝑄 𝑦 2 , 𝑦 7 𝑌 6 𝑌 8 = 𝑄 𝑦 1 , 𝑦 2 , 𝑦 3 , 𝑦 4 , 𝑦 5 , 𝑦 6 , 𝑦 7 , 𝑦 8 𝑦 1 𝑦 3 𝑦 4 𝑦 5 𝑦 6 𝑦 8 Consider the elimination order 𝑌 1 , 𝑌 3 , 𝑌 4 , 𝑌 5 , 𝑌 6 , 𝑌 8 𝑄 𝑦 2 , 𝑦 7 = 𝑄 𝑦 1 𝑄 𝑦 2 𝑄 𝑦 3 𝑦 1 , 𝑦 2 𝑄 𝑦 4 𝑦 3 𝑄 𝑦 5 𝑦 2 𝑄 𝑦 6 𝑦 3 , 𝑦 7 𝑄( 𝑦 7 |𝑦 4 , 𝑦 5 )𝑄 𝑦 8 𝑦 7 𝑦 8 𝑦 6 𝑦 5 𝑦 4 𝑦 3 𝑦 1 17
𝑄 𝑦 2 , 𝑦 7 = 𝑄 𝑦 2 𝑄 𝑦 4 𝑦 3 𝑄 𝑦 5 𝑦 2 𝑄 𝑦 6 𝑦 3 , 𝑦 7 𝑄( 𝑦 7 |𝑦 4 , 𝑦 5 )𝑄 𝑦 8 𝑦 7 𝑄 𝑦 1 𝑄 𝑦 3 𝑦 1 , 𝑦 2 𝑦 8 𝑦 6 𝑦 5 𝑦 4 𝑦 3 𝑦 1 = 𝑄 𝑦 2 𝑄 𝑦 4 𝑦 3 𝑄 𝑦 5 𝑦 2 𝑄 𝑦 6 𝑦 3 , 𝑦 7 𝑄 𝑦 7 𝑦 4 , 𝑦 5 𝑄 𝑦 8 𝑦 7 𝑛 1 (𝑦 2 , 𝑦 3 ) 𝑦 8 𝑦 6 𝑦 5 𝑦 4 𝑦 3 = 𝑄 𝑦 2 𝑄 𝑦 5 𝑦 2 𝑄 𝑦 7 𝑦 4 , 𝑦 5 𝑄 𝑦 8 𝑦 7 𝑄 𝑦 4 𝑦 3 𝑄 𝑦 6 𝑦 3 , 𝑦 7 𝑛 1 (𝑦 2 , 𝑦 3 ) 𝑦 8 𝑦 6 𝑦 5 𝑦 4 𝑦 3 = 𝑄 𝑦 2 𝑄 𝑦 5 𝑦 2 𝑄 𝑦 7 𝑦 4 , 𝑦 5 𝑄 𝑦 8 𝑦 7 𝑛 3 (𝑦 2 , 𝑦 6 , 𝑦 4 ) 𝑦 8 𝑦 6 𝑦 5 𝑦 4 = 𝑄 𝑦 2 𝑄 𝑦 5 𝑦 2 𝑄 𝑦 8 𝑦 7 𝑄 𝑦 7 𝑦 4 , 𝑦 5 𝑛 3 (𝑦 2 , 𝑦 6 , 𝑦 4 ) 𝑦 8 𝑦 6 𝑦 5 𝑦 4 = 𝑄 𝑦 2 𝑄 𝑦 5 𝑦 2 𝑄 𝑦 8 𝑦 7 𝑛 4 (𝑦 2 , 𝑦 5 , 𝑦 6 ) 𝑦 8 𝑦 6 𝑦 5 = 𝑄 𝑦 2 𝑄 𝑦 8 𝑦 7 𝑄 𝑦 5 𝑦 2 𝑛 4 (𝑦 2 , 𝑦 5 , 𝑦 6 ) 𝑦 8 𝑦 6 𝑦 5 = 𝑄 𝑦 2 𝑄 𝑦 8 𝑦 7 𝑛 5 (𝑦 2 , 𝑦 6 ) 𝑦 8 𝑦 6 = 𝑄 𝑦 2 𝑄 𝑦 8 𝑦 7 𝑛 5 (𝑦 2 , 𝑦 6 ) 𝑦 8 𝑦 6 = 𝑄 𝑦 2 𝑄 𝑦 8 𝑦 7 𝑛 6 (𝑦 2 ) = 𝑛 8 (𝑦 2 )𝑛 6 (𝑦 2 ) 18 𝑦 8
Conditional probability 𝑛 8 (𝑦 2 )𝑛 6 (𝑦 2 ) 𝑄 𝑦 2 | 𝑦 7 = 𝑦 2 𝑛 8 (𝑦 2 )𝑛 6 (𝑦 2 ) 19
Recommend
More recommend