Inference Probabilistic inference is the computation of posterior probabilities for query propositions given observed evidence where the full joint distribution can be viewed as the KB from which answers to all questions may be derived Start with the joint distribution toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity .016 .064 .144 .576 cavity L For any proposition φ , sum the atomic events where it is true P ( φ ) = Σ ω : ω | = φ P ( ω ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 18
Inference by enumeration Start with the joint distribution toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity .016 .064 .144 .576 cavity L For any proposition φ , sum the atomic events where it is true: P ( φ ) = Σ ω : ω | = φ P ( ω ) P ( toothache ) = 0 . 108 + 0 . 012 + 0 . 016 + 0 . 064 = 0 . 2 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 19
Inference by enumeration Start with the joint distribution toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity .016 .064 .144 .576 cavity L For any proposition φ , sum the atomic events where it is true: P ( φ ) = Σ ω : ω | = φ P ( ω ) P ( cavity ∨ toothache ) = 0 . 108 + 0 . 012 + 0 . 072 + 0 . 008 + 0 . 016 + 0 . 064 = 0 . 28 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 20
Inference by enumeration Start with the joint distribution toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity .016 .064 .144 .576 cavity L Can also compute conditional probabilities: P ( ¬ cavity | toothache ) = P ( ¬ cavity ∧ toothache ) P ( toothache ) 0 . 016 + 0 . 064 = 0 . 108 + 0 . 012 + 0 . 016 + 0 . 064 = 0 . 4 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 21
Normalization toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity cavity .016 .064 .144 .576 L Denominator can be viewed as a normalization constant α P ( Cavity | toothache ) = α P ( Cavity, toothache ) = α [ P ( Cavity, toothache, catch ) + P ( Cavity, toothache, ¬ catch )] = α [ � 0 . 108 , 0 . 016 � + � 0 . 012 , 0 . 064 � ] = α � 0 . 12 , 0 . 08 � = � 0 . 6 , 0 . 4 � Idea: compute distribution on query variable by fixing evidence variables and summing over hidden variables AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 22
Inference by enumeration contd. Let X be all the variables. Ask the posterior joint distribution of the query variables Y given specific values e for the evidence variables E Let the hidden variables be H = X − Y − E ⇒ the required summation of joint entries is done by summing out the hidden variables: P ( Y | E = e ) = α P ( Y , E = e ) = α Σ h P ( Y , E = e , H = h ) The terms in the summation are joint entries because Y , E , and H together exhaust the set of random variables Problems 1) Worst-case time complexity O ( d n ) where d is the largest arity 2) Space complexity O ( d n ) to store the joint distribution 3) How to find the numbers for O ( d n ) entries? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 23
Independence A and B are independent iff P ( A | B ) = P ( A ) or P ( B | A ) = P ( B ) or P ( A, B ) = P ( A ) P ( B ) Cavity Cavity Toothache Catch decomposes into Toothache Catch Weather Weather P ( Toothache, Catch, Cavity, Weather ) = P ( Toothache, Catch, Cavity ) P ( Weather ) 32 entries reduced to 12; for n independent biased coins, 2 n → n Absolute independence powerful but rare Dentistry is a large field with hundreds of variables, none of which are independent. What to do? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 24
Conditional independence P ( Toothache, Cavity, Catch ) has 2 3 − 1 = 7 independent entries If I have a cavity, the probability that the probe catches in it doesn’t depend on whether I have a toothache (1) P ( catch | toothache, cavity ) = P ( catch | cavity ) The same independence holds if I haven’t got a cavity (2) P ( catch | toothache, ¬ cavity ) = P ( catch |¬ cavity ) Catch is conditionally independent of Toothache given Cavity : P ( Catch | Toothache, Cavity ) = P ( Catch | Cavity ) Equivalent statements P ( Toothache | Catch, Cavity ) = P ( Toothache | Cavity ) P ( Toothache, Catch | Cavity ) = P ( Toothache | Cavity ) P ( Catch | Cavity ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 25
Conditional independence Write out full joint distribution using chain rule: P ( Toothache, Catch, Cavity ) = P ( Toothache | Catch, Cavity ) P ( Catch, Cavity ) = P ( Toothache | Catch, Cavity ) P ( Catch | Cavity ) P ( Cavity ) = P ( Toothache | Cavity ) P ( Catch | Cavity ) P ( Cavity ) i.e., 2 + 2 + 1 = 5 independent numbers In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in n to linear in n Conditional independence is our most basic and robust form of knowl- edge about uncertainty AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 26
Bayes’ Rule Product rule P ( a ∧ b ) = P ( a | b ) P ( b ) = P ( b | a ) P ( a ) ⇒ Bayes’ rule P ( a | b ) = P ( b | a ) P ( a ) P ( b ) or in distribution form P ( Y | X ) = P ( X | Y ) P ( Y ) = α P ( X | Y ) P ( Y ) P ( X ) Useful for assessing diagnostic probability from causal probability: P ( Cause | Effect ) = P ( Effect | Cause ) P ( Cause ) P ( Effect ) E.g., let M be meningitis, S be stiff neck: P ( m | s ) = P ( s | m ) P ( m ) = 0 . 8 × 0 . 0001 = 0 . 0008 P ( s ) 0 . 1 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 27
Bayes’ Rule and conditional independence P ( Cavity | toothache ∧ catch ) = α P ( toothache ∧ catch | Cavity ) P ( Cavity ) = α P ( toothache | Cavity ) P ( catch | Cavity ) P ( Cavity ) This is an example of a naive Bayes model (Bayesian classifier) P ( Cause, Effect 1 , . . . , Effect n ) = P ( Cause ) Π i P ( Effect i | Cause ) Cavity Cause Effect 1 Effect n Toothache Catch Total number of parameters is linear in n AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 28
Example: Wumpus World 1,4 2,4 3,4 4,4 1,3 2,3 3,3 4,3 1,2 2,2 3,2 4,2 B OK 1,1 2,1 3,1 4,1 B OK OK P ij = true iff [ i, j ] contains a pit B ij = true iff [ i, j ] is breezy Include only B 1 , 1 , B 1 , 2 , B 2 , 1 in the probability model AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 29
Specifying the probability model The full joint distribution is P ( P 1 , 1 , . . . , P 4 , 4 , B 1 , 1 , B 1 , 2 , B 2 , 1 ) Apply product rule: P ( B 1 , 1 , B 1 , 2 , B 2 , 1 | P 1 , 1 , . . . , P 4 , 4 ) P ( P 1 , 1 , . . . , P 4 , 4 ) (Do it this way to get P ( Effect | Cause ) ) First term: 1 if pits are adjacent to breezes, 0 otherwise Second term: pits are placed randomly, probability 0.2 per square: i,j = 1 , 1 P ( P i,j ) = 0 . 2 n × 0 . 8 16 − n P ( P 1 , 1 , . . . , P 4 , 4 ) = Π 4 , 4 for n pits AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 30
Observations and query We know the following facts: b = ¬ b 1 , 1 ∧ b 1 , 2 ∧ b 2 , 1 known = ¬ p 1 , 1 ∧ ¬ p 1 , 2 ∧ ¬ p 2 , 1 Query is P ( P 1 , 3 | known, b ) Define Unknown = P ij s other than P 1 , 3 and Known For inference by enumeration, we have P ( P 1 , 3 | known, b ) = α Σ unknown P ( P 1 , 3 , unknown, known, b ) Grows exponentially with number of squares AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 31
Using conditional independence Basic insight: observations are conditionally independent of other hidden squares given neighbouring hidden squares 1,4 2,4 3,4 4,4 1,3 2,3 3,3 4,3 OTHER QUERY 1,2 2,2 3,2 4,2 FRINGE 1,1 2,1 3,1 4,1 KNOWN Define Unknown = Fringe ∪ Other P ( b | P 1 , 3 , Known, Unknown ) = P ( b | P 1 , 3 , Known, Fringe ) Manipulate query into a form where we can use this AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 32
Using conditional independence � P ( P 1 , 3 | known, b ) = α unknown P ( P 1 , 3 , unknown, known, b ) � = α unknown P ( b | P 1 , 3 , known, unknown ) P ( P 1 , 3 , known, unknown ) � � = α other P ( b | known, P 1 , 3 , fringe, other ) P ( P 1 , 3 , known, fringe, other ) fringe � � = α other P ( b | known, P 1 , 3 , fringe ) P ( P 1 , 3 , known, fringe, other ) fringe � � = α fringe P ( b | known, P 1 , 3 , fringe ) other P ( P 1 , 3 , known, fringe, other ) � � = α fringe P ( b | known, P 1 , 3 , fringe ) other P ( P 1 , 3 ) P ( known ) P ( fringe ) P ( other ) � � = α P ( known ) P ( P 1 , 3 ) fringe P ( b | known, P 1 , 3 , fringe ) P ( fringe ) other P ( other ) = α ′ P ( P 1 , 3 ) � fringe P ( b | known, P 1 , 3 , fringe ) P ( fringe ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 33
Using conditional independence 1,3 1,3 1,3 1,3 1,3 1,2 2,2 1,2 2,2 1,2 2,2 1,2 2,2 1,2 2,2 B B B B B OK OK OK OK OK 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 B B B B B OK OK OK OK OK OK OK OK OK OK 0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16 0.8 x 0.2 = 0.16 0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16 P ( P 1 , 3 | known, b ) = α ′ � 0 . 2(0 . 04 + 0 . 16 + 0 . 16) , 0 . 8(0 . 04 + 0 . 16) � ≈ � 0 . 31 , 0 . 69 � P ( P 2 , 2 | known, b ) ≈ � 0 . 86 , 0 . 14 � AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 34
Bayesian networks BNs: a graphical notation for conditional independence assertions and hence for compact specification of full joint distributions alias Probabilistic Graphical Models (PGMs) Syntax: a set of nodes, one per variable a directed, acyclic graph (link ≈ “directly influences”) a conditional distribution for each node given its parents: P ( X i | Parents ( X i )) In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over X i for each combination of parent values AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 35
Example Topology of network encodes conditional independence assertions: Cavity Weather Toothache Catch Weather is independent of the other variables Toothache and Catch are conditionally independent given Cavity AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 36
Example I’m at work, neighbor John calls to say my alarm is ringing, but neigh- bor Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a burglar? Variables: Burglar , Earthquake , Alarm , JohnCalls , MaryCalls Network topology reflects “causal” knowledge: – A burglar can set the alarm off – An earthquake can set the alarm off – The alarm can cause Mary to call – The alarm can cause John to call AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 37
Example P(E) Burglary P(B) Earthquake .002 .001 B E P(A) T T .95 Alarm T F .94 F T .29 F F .001 P(J) A A P(M) JohnCalls T .90 MaryCalls T .70 F .05 F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 38
Compactness A CPT for Boolean X i with k Boolean parents has 2 k rows for the combinations of parent values B E A Each row requires one number p for X i = true (the number for X i = false is just 1 − p ) J M If each variable has no more than k parents, the complete network requires O ( n · 2 k ) numbers I.e., grows linearly with n , vs. O (2 n ) for the full joint distribution For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 2 5 − 1 = 31 ) In certain cases (assumptions of conditional independency), BNs make O (2 n ) ⇒ O ( kn ) (NP ⇒ P ! ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 39
Global semantics Global semantics defines the full joint distribution B E as the product of the local conditional distributions: P ( x 1 , . . . , x n ) = Π n A i = 1 P ( x i | parents ( X i )) J M e.g., P ( j ∧ m ∧ a ∧ ¬ b ∧ ¬ e ) = AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 40
Global semantics Global semantics defines the full joint distribution B E as the product of the local conditional distributions: P ( x 1 , . . . , x n ) = Π n A i = 1 P ( x i | parents ( X i )) J M e.g., P ( j ∧ m ∧ a ∧ ¬ b ∧ ¬ e ) = P ( j | a ) P ( m | a ) P ( a |¬ b, ¬ e ) P ( ¬ b ) P ( ¬ e ) = 0 . 9 × 0 . 7 × 0 . 001 × 0 . 999 × 0 . 998 ≈ 0 . 00063 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 41
Local semantics Local semantics: each node is conditionally independent of its nondescendants ( Z i,j ) given its parents ( U i in the gray area) U 1 U m . . . X Z 1j Z nj Y n Y 1 . . . Theorem: Local semantics ⇔ global semantics AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 42
Markov blanket Each node is conditionally independent of all others given its Markov blanket: parents + children + children’s parents U 1 U m . . . X Z 1j Z nj Y Y n 1 . . . AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 43
Constructing Bayesian networks Algorithm: a series of locally testable assertions of conditional inde- pendence guarantees the required global semantics 1. Choose an ordering of variables X 1 , . . . , X n 2. For i = 1 to n add X i to the network select parents from X 1 , . . . , X i − 1 such that P ( X i | Parents ( X i )) = P ( X i | X 1 , . . . , X i − 1 ) This choice of parents guarantees the global semantics: P ( X 1 , . . . , X n ) = Π n i = 1 P ( X i | X 1 , . . . , X i − 1 ) (chain rule) = Π n i = 1 P ( X i | Parents ( X i )) (by construction) Each node is conditionally independent of its other predecessors in the node (partial) ordering, given its parents AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 44
Example Suppose we choose the ordering M , J , A , B , E MaryCalls JohnCalls P ( J | M ) = P ( J ) ? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 45
Example Suppose we choose the ordering M , J , A , B , E MaryCalls JohnCalls Alarm P ( J | M ) = P ( J ) ? No P ( A | J, M ) = P ( A | J ) ? P ( A | J, M ) = P ( A ) ? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 46
Example Suppose we choose the ordering M , J , A , B , E MaryCalls JohnCalls Alarm Burglary P ( J | M ) = P ( J ) ? No P ( A | J, M ) = P ( A | J ) ? P ( A | J, M ) = P ( A ) ? No P ( B | A, J, M ) = P ( B | A ) ? P ( B | A, J, M ) = P ( B ) ? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 47
Example Suppose we choose the ordering M , J , A , B , E MaryCalls JohnCalls Alarm Burglary Earthquake P ( J | M ) = P ( J ) ? No P ( A | J, M ) = P ( A | J ) ? P ( A | J, M ) = P ( A ) ? No P ( B | A, J, M ) = P ( B | A ) ? Yes P ( B | A, J, M ) = P ( B ) ? No P ( E | B, A, J, M ) = P ( E | A ) ? P ( E | B, A, J, M ) = P ( E | A, B ) ? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 48
Example Suppose we choose the ordering M , J , A , B , E MaryCalls JohnCalls Alarm Burglary Earthquake P ( J | M ) = P ( J ) ? No P ( A | J, M ) = P ( A | J ) ? P ( A | J, M ) = P ( A ) ? No P ( B | A, J, M ) = P ( B | A ) ? Yes P ( B | A, J, M ) = P ( B ) ? No P ( E | B, A, J, M ) = P ( E | A ) ? No P ( E | B, A, J, M ) = P ( E | A, B ) ? Yes AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 49
Example MaryCalls JohnCalls Alarm Burglary Earthquake Assessing conditional probabilities is hard in noncausal directions Network can be far more compact than the full joint distribution But, this network is less compact: 1 + 2 + 4 + 2 + 4 = 13 (due to the ordering of the variables) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 50
Probabilistic reasoning • Exact inference by enumeration • Exact inference by variable elimination • Approximate inference by stochastic simulation • Approximate inference by Markov chain Monte Carlo AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 51
Reasoning tasks in BNs (PGMs) Simple queries: compute posterior marginal P ( X i | E = e ) e.g., P ( NoGas | Gauge = empty, Lights = on, Starts = false ) Conjunctive queries: P ( X i , X j | E = e ) = P ( X i | E = e ) P ( X j | X i , E = e ) Optimal decisions: decision networks include utility information; probabilistic inference required for P ( outcome | action, evidence ) Value of information: which evidence to seek next? Sensitivity analysis: which probability values are most critical? Explanation: why do I need a new starter motor? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 52
Inference by enumeration Slightly intelligent way to sum out variables from the joint without actually constructing its explicit representation Simple query on the burglary network B E P ( B | j, m ) = P ( B, j, m ) /P ( j, m ) A = α P ( B, j, m ) = α Σ e Σ a P ( B, e, a, j, m ) J M Rewrite full joint entries using product of CPT entries P ( B | j, m ) = α Σ e Σ a P ( B ) P ( e ) P ( a | B, e ) P ( j | a ) P ( m | a ) = α P ( B ) Σ e P ( e ) Σ a P ( a | B, e ) P ( j | a ) P ( m | a ) Recursive depth-first enumeration: O ( n ) space, O ( d n ) time AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 53
Enumeration algorithm function EnumerationAsk (( X , e , bn )) returns a distribution over X inputs : X , the query variable e , observed values for variables E bn , a belief network with variables { X } ∪ E ∪ Y Q ( X ) ← a distribution over X , initially empty for each value x i of X do Q ( x i ) ← EnumerateAll ( bn . Vars , e x i ) where e x i is e extended with X = x i return Normalize ( Q ( X ) ) function EnumerateAll (( vars , e )) returns a real number if Empty? ( vars ) then return 1.0 Y ← First ( vars ) if Y has value y in e then return P ( y | parents ( Y )) × EnumerateAll ( Rest ( vars ), e ) else return � y P ( y | parents ( Y )) × EnumerateAll ( Rest ( vars ), e y ) where e y is e extended with Y = y AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 54
Evaluation tree Summing at the “+” nodes P(b) .001 P(e) P( e) .002 .998 P(a|b,e) P( a|b,e) P(a|b, e) P( a|b, e) .95 .05 .94 .06 P(j|a) P(j| a) P(j|a) P(j| a) .90 .05 .90 .05 P(m|a) P(m| a) P(m|a) P(m| a) .70 .01 .70 .01 Enumeration is inefficient: repeated computation e.g., computes P ( j | a ) P ( m | a ) for each value of e improved by eliminating repeated variables AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 55
Inference by variable elimination Variable elimination: carry out summations right-to-left, storing intermediate results (factors) to avoid recomputation P ( B | j, m ) Σ e P ( e ) Σ a P ( a | B, e ) = α P ( B ) P ( j | a ) P ( m | a ) � �� � � �� � � �� � � �� � � �� � B E A J M = α P ( B ) Σ e P ( e ) Σ a P ( a | B, e ) P ( j | a ) f M ( a ) = α P ( B ) Σ e P ( e ) Σ a P ( a | B, e ) f J ( a ) f M ( a ) = α P ( B ) Σ e P ( e ) Σ a f A ( a, b, e ) f J ( a ) f M ( a ) = α P ( B ) Σ e P ( e ) f ¯ AJM ( b, e ) (sum out A ) = α P ( B ) f ¯ AJM ( b ) (sum out E ) E ¯ = αf B ( b ) × f ¯ AJM ( b ) E ¯ AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 56
Variable elimination: Basic operations Summing out a variable from a product of factors move any constant factors outside the summation add up submatrices in pointwise product of remaining factors Σ x f 1 × · · · × f k = f 1 × · · · × f i Σ x f i +1 × · · · × f k = f 1 × · · · × f i × f ¯ X assuming f 1 , . . . , f i do not depend on X Pointwise product of factors f 1 and f 2 f 1 ( x 1 , . . . , x j , y 1 , . . . , y k ) × f 2 ( y 1 , . . . , y k , z 1 , . . . , z l ) = f ( x 1 , . . . , x j , y 1 , . . . , y k , z 1 , . . . , z l ) e.g., f 1 ( a, b ) × f 2 ( b, c ) = f ( a, b, c ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 57
Variable elimination algorithm function EliminationAsk ( X , e , bn ) returns a distribution over X inputs : X , the query variable e , observed values for variables E bn , a belief network specifying joint distribution P ( X 1 , . . . , X n ) factors ← [ ] for each var in Order ( bn . Vars ) do factors ← [ MakeFactor ( var , e ) | factors ] if var is a hidden variable then factors ← SumOut ( var , factors ) return Normalize ( PointwiseProduct ( factors )) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 58
Irrelevant variables Consider the query P ( JohnCalls | Burglary = true ) B E A P ( J | b ) = αP ( b ) � e P ( e ) � a P ( a | b, e ) P ( J | a ) � m P ( m | a ) J M Sum over m is identically 1; M is irrelevant to the query Thm 1: Y is irrelevant unless Y ∈ Ancestors ( { X } ∪ E ) Here, X = JohnCalls , E = { Burglary } , and Ancestors ( { X } ∪ E ) = { Alarm, Earthquake } so MaryCalls is irrelevant (Compare this to backward chaining from the query in Horn clause KBs) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 59
Irrelevant variables Defn: moral graph of Bayes net: marry all parents and drop arrows Defn: A is m-separated from B by C iff separated by C in the moral graph Thm 2: Y is irrelevant if m-separated from X by E B E A For P ( JohnCalls | Alarm = true ) , both Burglary and Earthquake are irrelevant J M AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 60
Complexity of exact inference Singly connected networks (or polytrees): – any two nodes are connected by at most one (undirected) path – time and space cost of variable elimination are O ( d k n ) Multiply connected networks: – can reduce 3SAT to exact inference ⇒ NP-hard – equivalent to counting 3SAT models ⇒ #P-complete 0.5 0.5 0.5 0.5 A B C D 1. A v B v C 2. C v D v ~A 1 2 3 3. B v C v ~D AND AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 61
Inference by stochastic simulation Idea 1) Draw N samples from a sampling distribution S 2) Compute an approximate posterior probability ˆ 0.5 P 3) Show this converges to the true probability P Coin Methods – Sampling from an empty network – Rejection sampling: reject samples disagreeing with evidence – Likelihood weighting: use evidence to weight samples – Markov chain Monte Carlo (MCMC): sample from a stochastic process whose stationary distribution is the true posterior AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 62
Sampling from an empty network Direct sampling from a network that has no evidence associated (sampling each variable in turn, in topological order) function Prior-Sample ( bn ) returns an event sampled from P ( X 1 , . . . , X n ) specified by bn inputs : bn , a Bayesian network specifying joint distribution P ( X 1 , . . . , X n ) x ← an event with n elements for each variable X i in X 1 , . . . , X n x i ← a random sample from P ( X i | Parents ( X i )) return x AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 63
Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 64
Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 65
Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 66
Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 67
Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 68
Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 69
Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 70
Sampling from an empty network contd. Probability that PriorSample generates a particular event S PS ( x 1 . . . x n ) = Π n i = 1 P ( x i | parents ( X i )) = P ( x 1 . . . x n ) i.e., the true prior probability E.g., S PS ( t, f, t, t ) = 0 . 5 × 0 . 9 × 0 . 8 × 0 . 9 = 0 . 324 = P ( t, f, t, t ) Let N PS ( x 1 . . . x n ) be the number of samples generated for event x 1 , . . . , x n Then we have ˆ lim P ( x 1 , . . . , x n ) = N →∞ N PS ( x 1 , . . . , x n ) /N lim N →∞ = S PS ( x 1 , . . . , x n ) = P ( x 1 . . . x n ) That is, estimates derived from PriorSample are consistent Shorthand: ˆ P ( x 1 , . . . , x n ) ≈ P ( x 1 . . . x n ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 71
Rejection sampling ˆ P ( X | e ) estimated from samples agreeing with e function Rejection-Sampling ( X , e , bn , N ) returns an estimate of P ( X | e ) inputs : X , the query variable e , observed values for variables E bn , a Bayesian network N , the total number of samples to be generated local variables : N , a vector of counts for each value of X , initially zero for j = 1 to N do x ← Prior-Sample ( bn ) if x is consistent with e then /*do not match the evidence*/ N [ x ] ← N [ x ]+1 where x is the value of X in x return Normalize ( N [ X ]) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 72
Example Estimate P ( Rain | Sprinkler = true ) using 100 samples 27 samples have Sprinkler = true Of these, 8 have Rain = true and 19 have Rain = false . ˆ P ( Rain | Sprinkler = true ) = Normalize ( � 8 , 19 � ) = � 0 . 296 , 0 . 704 � Similar to a basic real-world empirical estimation procedure AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 73
Rejection sampling contd. ˆ P ( X | e ) = α N PS ( X, e ) (algorithm defn.) = N PS ( X, e ) /N PS ( e ) (normalized by N PS ( e ) ) ≈ P ( X, e ) /P ( e ) (property of PriorSample ) = P ( X | e ) (defn. of conditional probability) Hence rejection sampling returns consistent posterior estimates Problem: hopelessly expensive if P ( e ) is small P ( e ) drops off exponentially with number of evidence variables AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 74
Likelihood weighting Idea – fix evidence variables – sample only nonevidence variables – weight each sample by the likelihood it accords the evidence AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 75
Likelihood weighting function Likelihood-Weighting ( X , e , bn , N ) returns an estimate of P ( X | e ) inputs : X , the query variable e , observed values for variables E bn , a Bayesian network N , the total number of samples to be generated local variables : W , a vector of weighted counts for each value of X , initially zero for j = 1 to N do x , w ← Weighted-Sample ( bn , e ) W [ x ] ← W [ x ] + w where x is the value of X in x return Normalize ( W [ X ] ) function Weighted-Sample ( bn , e ) returns an event and a weight x ← an event with n elements from e ; w ← 1 for each variable X i in X 1 , · · · , X n do if X i is an evidence variable with value x i in e then w ← w × P ( X i = x i | Parents ( X i )) else x [ i ] ← a random sample from P ( X i | Parents ( X i )) return x , w AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 76
Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 77
Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 78
Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 79
Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 × 0 . 1 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 80
Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 × 0 . 1 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 81
Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 × 0 . 1 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 82
Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 × 0 . 1 × 0 . 99 = 0 . 099 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 83
Likelihood weighting contd. Sampling probability for WeightedSample is S WS ( z , e ) = Π l i = 1 P ( z i | parents ( Z i )) Note: pays attention to evidence in ancestors only Cloudy ⇒ somewhere “in between” prior and posterior distribution Rain Sprinkler Wet Weight for a given sample z , e is Grass w ( z , e ) = Π m i = 1 P ( e i | parents ( E i )) Weighted sampling probability is S WS ( z , e ) w ( z , e ) = Π l i = 1 P ( z i | parents ( Z i )) Π m i = 1 P ( e i | parents ( E i )) = P ( z , e ) (by standard global semantics of network) Hence likelihood weighting returns consistent estimates but performance still degrades with many evidence variables because a few samples have nearly all the total weight AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 84
Inference by Markov chain Monte Carlo (MCMC) “State” of network = current assignment to all variables ⇒ the next state by making random changes to the current state Generate next state by sampling one variable given Markov blanket recall Markov blanket: parents, children, and children’s parents Sample each variable in turn, keeping evidence fixed Specific transition probability with which the stochastic process moves from one state to another defined by conditional distribution given Markov blanket of the variable being sampled AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 85
MCMC Gibbs sampling function MCMC-Gibbs-Ask ( X , e , bn , N ) returns an estimate of P ( X | e ) local variables : N , a vector of counts for each value of X , initially zero Z , the nonevidence variables in bn x , the current state of the network, initially copied from e initialize x with random values for the variables in Z for j = 1 to N do /* Can choose at random */ for each Z i in Z do set the value of Z i in x by sampling from P ( Z i | mb ( Z i )) /*Markov blanket */ N [ x ] ← N [ x ] + 1 where x is the value of X in x return Normalize ( N ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 86
The Markov chain With Sprinkler = true, WetGrass = true , there are four states Cloudy Cloudy Rain Rain Sprinkler Sprinkler Wet Wet Grass Grass Cloudy Cloudy Rain Rain Sprinkler Sprinkler Wet Wet Grass Grass Wander about for a while AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 87
Example Estimate P ( Rain | Sprinkler = true, WetGrass = true ) Sample Cloudy or Rain given its Markov blanket, repeat. Count number of times Rain is true and false in the samples. E.g., visit 100 states 31 have Rain = true , 69 have Rain = false ˆ P ( Rain | Sprinkler = true, WetGrass = true ) = Normalize ( � 31 , 69 � ) = � 0 . 31 , 0 . 69 � Theorem: chain approaches stationary distribution: long-run fraction of time spent in each state is exactly proportional to its posterior probability AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 88
Markov blanket sampling Markov blanket of Cloudy is Cloudy Sprinkler and Rain Markov blanket of Rain is Rain Sprinkler Cloudy , Sprinkler , and WetGrass Wet Grass Probability given the Markov blanket is calculated as follows P ( x ′ i | mb ( X i )) = P ( x ′ i | parents ( X i )) Π Z j ∈ Children ( X i ) P ( z j | parents ( Z j )) Easily implemented in message-passing parallel systems, brains Main computational problems: 1) Difficult to tell if convergence has been achieved 2) Can be wasteful if Markov blanket is large P ( X i | mb ( X i )) won’t change much (law of large numbers) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 89
Approximate inference Exact inference by variable elimination: – polytime on polytrees, NP-hard on general graphs – space = time, very sensitive to topology Approximate inference by LW (Likelihood Weighting), MCMC (Markov chain Monte Carlo): – LW does poorly when there is lots of (downstream) evidence – LW, MCMC generally insensitive to topology – Convergence can be very slow with probabilities close to 1 or 0 – Can handle arbitrary combinations of discrete and continuous variables AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 90
Dynamic Bayesian networks DBNs are Bayesian networks that represent temporal probability models Basic idea: copy state and evidence variables for each time step X t = set of unobservable state variables at time t e.g., BloodSugar t , StomachContents t , etc. E t = set of observable evidence variables at time t e.g., MeasuredBloodSugar t , PulseRate t , FoodEaten t This assumes discrete time ; step size depends on problem Notation: X a : b = X a , X a +1 , . . . , X b − 1 , X b X t , E t contain arbitrarily many variables in a replicated Bayes net AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 91
Hidden Markov models (HMMs) Every HMM is a single-variable DBN every discrete DBN is an HMM (combine all the state variables in the DBN into a single one) X t+1 X t Y Y t t+1 Z Z t t+1 Sparse dependencies ⇒ exponentially fewer parameters; e.g., 20 state variables, three parents each DBN has 20 × 2 3 = 160 parameters, HMM has 2 20 × 2 20 ≈ 10 12 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 92
Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov assumption: X t depends on bounded subset of X 0: t − 1 First-order Markov process: P ( X t | X 0: t − 1 ) = P ( X t | X t − 1 ) Second-order Markov process: P ( X t | X 0: t − 1 ) = P ( X t | X t − 2 , X t − 1 ) X t −2 X t −1 X t +1 X t +2 X t First−order X t −2 X t −1 X t X t +1 X t +2 Second−order Sensor Markov assumption: P ( E t | X 0: t , E 0: t − 1 ) = P ( E t | X t ) Stationary process: transition model P ( X t | X t − 1 ) and sensor model P ( E t | X t ) fixed for all t AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 93
Example P ( R ) R t -1 t t 0.7 f 0.3 Rain t– 1 Rain t Rain t+ 1 P ( U ) R t t t 0.9 f 0.2 Umbrella t+ 1 Umbrella t– 1 Umbrella t First-order Markov assumption not exactly true in real world! Possible fixes: 1. Increase order of Markov process 2. Augment state , e.g., add Temp t , Pressure t AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 94
HMMs X t is a single, discrete variable (usually E t is too) Domain of X t is { 1 , . . . , S } 0 . 7 0 . 3 Transition matrix T ij = P ( X t = j | X t − 1 = i ) , e.g., 0 . 3 0 . 7 Sensor matrix O t for each time step, diagonal elements P ( e t | X t = i ) 0 . 9 0 e.g., with U 1 = true , O 1 = 0 0 . 2 Forward and backward messages as column vectors: f 1: t +1 = α O t +1 T ⊤ f 1: t b k +1: t = TO k +1 b k +2: t Forward-backward algorithm needs time O ( S 2 t ) and space O ( St ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 95
Inference tasks in HMMs Filtering: P ( X t | e 1: t ) belief state—input to the decision process of a rational agent Prediction: P ( X t + k | e 1: t ) for k > 0 evaluation of possible action sequences; like filtering without the evidence Smoothing: P ( X k | e 1: t ) for 0 ≤ k < t better estimate of past states, essential for learning Most likely explanation: arg max x 1: t P ( x 1: t | e 1: t ) speech recognition, decoding with a noisy channel AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 96
Filtering Aim: devise a recursive state estimation algorithm P ( X t +1 | e 1: t +1 ) = f ( e t +1 , P ( X t | e 1: t )) P ( X t +1 | e 1: t +1 ) = P ( X t +1 | e 1: t , e t +1 ) = α P ( e t +1 | X t +1 , e 1: t ) P ( X t +1 | e 1: t ) = α P ( e t +1 | X t +1 ) P ( X t +1 | e 1: t ) I.e., prediction + estimation. Prediction by summing out X t : P ( X t +1 | e 1: t +1 ) = α P ( e t +1 | X t +1 ) Σ x t P ( X t +1 | x t , e 1: t ) P ( x t | e 1: t ) = α P ( e t +1 | X t +1 ) Σ x t P ( X t +1 | x t ) P ( x t | e 1: t ) f 1: t +1 = Forward ( f 1: t , e t +1 ) where f 1: t = P ( X t | e 1: t ) Time and space constant (independent of t ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 97
Inference in DBNs Naive method: unroll the network and run any exact algorithm R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) P(R ) 1 P(R ) 1 1 1 1 1 1 1 0 0 t t t t t t t t 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 f f f f f f f f 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 Rain 0 Rain 1 Rain 0 Rain 1 Rain 2 Rain 3 Rain 4 Rain 5 Rain 6 Rain 7 R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) 1 1 1 1 1 1 1 1 t t t t t t t t 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 f f f f f f f f 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 Umbrella 1 Umbrella 1 Umbrella 2 Umbrella 3 Umbrella 4 Umbrella 5 Umbrella 6 Umbrella 7 Problem: inference cost for each update grows with t Rollup filtering: add slice t + 1 , “sum out” slice t using variable elimination Largest factor is O ( d n +1 ) , update cost O ( d n +2 ) (cf. HMM update cost O ( d 2 n ) ) Approximate inference by MCMC (Markov chain Monte Carlo) etc. AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 98
Probabilistic logic Bayesian networks are essentially propositional: – the set of random variable is fixed and finite – each variable has a fixed domain of possible values Probabilistic reasoning can be formalized as probabilistic logic First-order probabilistic logic combines probability theory with the expressive power of first-order logic AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 99
First-order probabilistic logic Recall: Propositional probabilistic logic – Proposition = disjunction of atomic events in which it is true – Possible world (sample point) ω = propositional logic model (an assignment of values to all of the r.v.s under consideration) – ω | = φ : for any proposition φ , the ω where it is true – probability model: a set Ω of possible worlds with a probability P ( ω ) for each world ω AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 100
Recommend
More recommend