uncertain knowledge and reasoning
play

Uncertain Knowledge and Reasoning 9 AI Slides (6e) c Lin - PowerPoint PPT Presentation

Uncertain Knowledge and Reasoning 9 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 9 1 9 Uncertain Knowledge and Reasoning 9.1 Uncertainty 9.2 Probability Syntax and semantics Inference Independence Bayes rule 9.3 Bayesian


  1. Inference Probabilistic inference is the computation of posterior probabilities for query propositions given observed evidence where the full joint distribution can be viewed as the KB from which answers to all questions may be derived Start with the joint distribution toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity .016 .064 .144 .576 cavity L For any proposition φ , sum the atomic events where it is true P ( φ ) = Σ ω : ω | = φ P ( ω ) AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 18

  2. Inference by enumeration Start with the joint distribution toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity .016 .064 .144 .576 cavity L For any proposition φ , sum the atomic events where it is true P ( φ ) = Σ ω : ω | = φ P ( ω ) E.g., P ( toothache ) = 0 . 108 + 0 . 012 + 0 . 016 + 0 . 064 = 0 . 2 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 19

  3. Inference by enumeration Start with the joint distribution toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity .016 .064 .144 .576 cavity L For any proposition φ , sum the atomic events where it is true P ( φ ) = Σ ω : ω | = φ P ( ω ) E.g., P ( cavity ∨ toothache ) = 0 . 108 + 0 . 012 + 0 . 072 + 0 . 008 + 0 . 016 + 0 . 064 = 0 . 28 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 20

  4. Inference by enumeration Start with the joint distribution toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity .016 .064 .144 .576 cavity L Can also compute conditional probabilities P ( ¬ cavity | toothache ) = P ( ¬ cavity ∧ toothache ) P ( toothache ) 0 . 016 + 0 . 064 = 0 . 108 + 0 . 012 + 0 . 016 + 0 . 064 = 0 . 4 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 21

  5. Normalization toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity cavity .016 .064 .144 .576 L Denominator can be viewed as a normalization constant α P ( Cavity | toothache ) = α P ( Cavity, toothache ) = α [ P ( Cavity, toothache, catch ) + P ( Cavity, toothache, ¬ catch )] = α [ � 0 . 108 , 0 . 016 � + � 0 . 012 , 0 . 064 � ] = α � 0 . 12 , 0 . 08 � = � 0 . 6 , 0 . 4 � Idea: compute distribution on query variable by fixing evidence variables and summing over hidden variables AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 22

  6. Inference by enumeration contd. Let X be all the variables. Ask the posterior joint distribution of the query variables Y given specific values e for the evidence variables E Let the hidden variables be H = X − Y − E ⇒ the required summation of joint entries is done by summing out the hidden variables: P ( Y | E = e ) = α P ( Y , E = e ) = α Σ h P ( Y , E = e , H = h ) The terms in the summation are joint entries because Y , E , and H together exhaust the set of random variables Problems 1) Worst-case time complexity O ( d n ) where d is the largest arity 2) Space complexity O ( d n ) to store the joint distribution 3) How to find the numbers for O ( d n ) entries? AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 23

  7. Independence A and B are independent iff P ( A | B ) = P ( A ) or P ( B | A ) = P ( B ) or P ( A, B ) = P ( A ) P ( B ) Cavity Cavity Toothache Catch decomposes into Toothache Catch Weather Weather P ( Toothache, Catch, Cavity, Weather ) = P ( Toothache, Catch, Cavity ) P ( Weather ) 32 entries reduced to 12; for n independent biased coins, 2 n → n Absolute independence is powerful but rare Dentistry is a large field with hundreds of variables, none of which are independent. What to do? AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 24

  8. Conditional independence P ( Toothache, Cavity, Catch ) has 2 3 − 1 = 7 independent entries If I have a cavity, the probability that the probe catches in it doesn’t depend on whether I have a toothache (1) P ( catch | toothache, cavity ) = P ( catch | cavity ) The same independence holds if I haven’t got a cavity (2) P ( catch | toothache, ¬ cavity ) = P ( catch |¬ cavity ) Catch is conditionally independent of Toothache given Cavity P ( Catch | Toothache, Cavity ) = P ( Catch | Cavity ) Equivalent statements P ( Toothache | Catch, Cavity ) = P ( Toothache | Cavity ) P ( Toothache, Catch | Cavity ) = P ( Toothache | Cavity ) P ( Catch | Cavity ) AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 25

  9. Conditional independence Write out full joint distribution using chain rule P ( Toothache, Catch, Cavity ) = P ( Toothache | Catch, Cavity ) P ( Catch, Cavity ) = P ( Toothache | Catch, Cavity ) P ( Catch | Cavity ) P ( Cavity ) = P ( Toothache | Cavity ) P ( Catch | Cavity ) P ( Cavity ) i.e., 2 + 2 + 1 = 5 independent numbers In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in n to linear in n Conditional independence is our most basic and robust form of knowl- edge about uncertainty AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 26

  10. Bayes’ rule Product rule P ( a ∧ b ) = P ( a | b ) P ( b ) = P ( b | a ) P ( a ) ⇒ Bayes’ rule P ( a | b ) = P ( b | a ) P ( a ) P ( b ) or in distribution form P ( Y | X ) = P ( X | Y ) P ( Y ) = α P ( X | Y ) P ( Y ) P ( X ) Useful for assessing diagnostic probability from causal probability P ( Cause | Effect ) = P ( Effect | Cause ) P ( Cause ) P ( Effect ) E.g., let M be meningitis, S be stiff neck P ( m | s ) = P ( s | m ) P ( m ) = 0 . 8 × 0 . 0001 = 0 . 0008 P ( s ) 0 . 1 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 27

  11. Naive Bayes Bayes’ rule and conditional independence P ( Cavity | toothache ∧ catch ) = α P ( toothache ∧ catch | Cavity ) P ( Cavity ) = α P ( toothache | Cavity ) P ( catch | Cavity ) P ( Cavity ) This is an example of a naive Bayes model (Bayesian classifier) P ( Cause, Effect 1 , . . . , Effect n ) = P ( Cause ) Π i P ( Effect i | Cause ) Cavity Cause Effect 1 Effect n Toothache Catch Total number of parameters is linear in n AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 28

  12. Example: Wumpus World 1,4 2,4 3,4 4,4 1,3 2,3 3,3 4,3 1,2 2,2 3,2 4,2 B OK 1,1 2,1 3,1 4,1 B OK OK P ij = true iff [ i, j ] contains a pit B ij = true iff [ i, j ] is breezy Include only B 1 , 1 , B 1 , 2 , B 2 , 1 in the probability model AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 29

  13. Specifying the probability model The full joint distribution is P ( P 1 , 1 , . . . , P 4 , 4 , B 1 , 1 , B 1 , 2 , B 2 , 1 ) Apply product rule: P ( B 1 , 1 , B 1 , 2 , B 2 , 1 | P 1 , 1 , . . . , P 4 , 4 ) P ( P 1 , 1 , . . . , P 4 , 4 ) (Do it this way to get P ( Effect | Cause ) ) First term: 1 if pits are adjacent to breezes, 0 otherwise Second term: pits are placed randomly, probability 0.2 per square: i,j = 1 , 1 P ( P i,j ) = 0 . 2 n × 0 . 8 16 − n P ( P 1 , 1 , . . . , P 4 , 4 ) = Π 4 , 4 for n pits AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 30

  14. Observations and query We know the following facts: b = ¬ b 1 , 1 ∧ b 1 , 2 ∧ b 2 , 1 known = ¬ p 1 , 1 ∧ ¬ p 1 , 2 ∧ ¬ p 2 , 1 Query is P ( P 1 , 3 | known, b ) Define Unknown = P ij s other than P 1 , 3 and Known For inference by enumeration, we have P ( P 1 , 3 | known, b ) = α Σ unknown P ( P 1 , 3 , unknown, known, b ) Grows exponentially with number of squares AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 31

  15. Using conditional independence Basic insight: observations are conditionally independent of other hidden squares given neighbouring hidden squares 1,4 2,4 3,4 4,4 1,3 2,3 3,3 4,3 OTHER QUERY 1,2 2,2 3,2 4,2 FRINGE 1,1 2,1 3,1 4,1 KNOWN Define Unknown = Fringe ∪ Other P ( b | P 1 , 3 , Known, Unknown ) = P ( b | P 1 , 3 , Known, Fringe ) Manipulate query into a form where we can use this AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 32

  16. Using conditional independence � P ( P 1 , 3 | known, b ) = α unknown P ( P 1 , 3 , unknown, known, b ) � = α unknown P ( b | P 1 , 3 , known, unknown ) P ( P 1 , 3 , known, unknown ) � � = α other P ( b | known, P 1 , 3 , fringe, other ) P ( P 1 , 3 , known, fringe, other ) fringe � � = α other P ( b | known, P 1 , 3 , fringe ) P ( P 1 , 3 , known, fringe, other ) fringe � � = α fringe P ( b | known, P 1 , 3 , fringe ) other P ( P 1 , 3 , known, fringe, other ) � � = α fringe P ( b | known, P 1 , 3 , fringe ) other P ( P 1 , 3 ) P ( known ) P ( fringe ) P ( other ) � � = α P ( known ) P ( P 1 , 3 ) fringe P ( b | known, P 1 , 3 , fringe ) P ( fringe ) other P ( other ) = α ′ P ( P 1 , 3 ) � fringe P ( b | known, P 1 , 3 , fringe ) P ( fringe ) AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 33

  17. Using conditional independence 1,3 1,3 1,3 1,3 1,3 1,2 2,2 1,2 2,2 1,2 2,2 1,2 2,2 1,2 2,2 B B B B B OK OK OK OK OK 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 B B B B B OK OK OK OK OK OK OK OK OK OK 0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16 0.8 x 0.2 = 0.16 0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16 P ( P 1 , 3 | known, b ) = α ′ � 0 . 2(0 . 04 + 0 . 16 + 0 . 16) , 0 . 8(0 . 04 + 0 . 16) � ≈ � 0 . 31 , 0 . 69 � P ( P 2 , 2 | known, b ) ≈ � 0 . 86 , 0 . 14 � AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 34

  18. Bayesian networks BNs: a graphical notation for conditional independence assertions and hence for compact specification of full joint distributions alias Probabilistic Graphical Models (PGMs) Syntax a set of nodes, one per variable a directed acyclic graph (DAG, link → “directly influences”) a conditional distribution for each node given its parents P ( X i | Parents ( X i )) In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over X i for each combination of parent values AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 35

  19. Example Topology of network encodes conditional independence assertions Cavity Weather Toothache Catch Weather is independent of the other variables Toothache and Catch are conditionally independent given Cavity AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 36

  20. Example I’m at work, neighbor John calls to say my alarm is ringing, but neigh- bor Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a burglar? Variables: Burglar , Earthquake , Alarm , JohnCalls , MaryCalls Network topology reflects “causal” knowledge – A burglar can set the alarm off – An earthquake can set the alarm off – The alarm can cause Mary to call – The alarm can cause John to call AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 37

  21. Example P(E) Burglary P(B) Earthquake .002 .001 B E P(A) T T .95 Alarm T F .94 F T .29 F F .001 P(J) A A P(M) JohnCalls T .90 MaryCalls T .70 F .05 F .01 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 38

  22. Compactness A CPT for Boolean X i with k Boolean parents has 2 k rows for the combinations of parent values B E A Each row requires one number p for X i = true (the number for X i = false is just 1 − p ) J M If each variable has no more than k parents, the complete network requires O ( n · 2 k ) numbers I.e., grows linearly with n , vs. O (2 n ) for the full joint distribution For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 2 5 − 1 = 31 ) In certain cases (assumptions of conditional independency), BNs make O (2 n ) ⇒ O ( kn ) (NP ⇒ P !) AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 39

  23. Global semantics Global semantics defines the full joint distribution B E as the product of the local conditional distributions P ( x 1 , . . . , x n ) = Π n A i = 1 P ( x i | parents ( X i )) J M e.g., P ( j ∧ m ∧ a ∧ ¬ b ∧ ¬ e ) = AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 40

  24. Global semantics Global semantics defines the full joint distribution B E as the product of the local conditional distributions P ( x 1 , . . . , x n ) = Π n A i = 1 P ( x i | parents ( X i )) J M e.g., P ( j ∧ m ∧ a ∧ ¬ b ∧ ¬ e ) = P ( j | a ) P ( m | a ) P ( a |¬ b, ¬ e ) P ( ¬ b ) P ( ¬ e ) = 0 . 9 × 0 . 7 × 0 . 001 × 0 . 999 × 0 . 998 ≈ 0 . 00063 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 41

  25. Local semantics Local semantics: each node is conditionally independent of its nondescendants ( Z i,j ) given its parents ( U i in the gray area) U 1 U m . . . X Z 1j Z nj Y n Y 1 . . . Theorem: Local semantics ⇔ global semantics AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 42

  26. Markov blanket Each node is conditionally independent of all others given its Markov blanket: parents + children + children’s parents U 1 U m . . . X Z 1j Z nj Y Y n 1 . . . AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 43

  27. Constructing Bayesian networks Algorithm: a series of locally testable assertions of conditional inde- pendence guarantees the required global semantics 1. Choose an ordering of variables X 1 , . . . , X n 2. For i = 1 to n add X i to the network select parents from X 1 , . . . , X i − 1 such that P ( X i | Parents ( X i )) = P ( X i | X 1 , . . . , X i − 1 ) This choice of parents guarantees the global semantics: P ( X 1 , . . . , X n ) = Π n i = 1 P ( X i | X 1 , . . . , X i − 1 ) (chain rule) = Π n i = 1 P ( X i | Parents ( X i )) (by construction) Each node is conditionally independent of its other predecessors in the node (partial) ordering, given its parents AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 44

  28. Example Suppose we choose the ordering M , J , A , B , E MaryCalls JohnCalls P ( J | M ) = P ( J ) ? AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 45

  29. Example Suppose we choose the ordering M , J , A , B , E MaryCalls JohnCalls Alarm P ( J | M ) = P ( J ) ? No P ( A | J, M ) = P ( A | J ) ? P ( A | J, M ) = P ( A ) ? AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 46

  30. Example Suppose we choose the ordering M , J , A , B , E MaryCalls JohnCalls Alarm Burglary P ( J | M ) = P ( J ) ? No P ( A | J, M ) = P ( A | J ) ? P ( A | J, M ) = P ( A ) ? No P ( B | A, J, M ) = P ( B | A ) ? P ( B | A, J, M ) = P ( B ) ? AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 47

  31. Example Suppose we choose the ordering M , J , A , B , E MaryCalls JohnCalls Alarm Burglary Earthquake P ( J | M ) = P ( J ) ? No P ( A | J, M ) = P ( A | J ) ? P ( A | J, M ) = P ( A ) ? No P ( B | A, J, M ) = P ( B | A ) ? Yes P ( B | A, J, M ) = P ( B ) ? No P ( E | B, A, J, M ) = P ( E | A ) ? P ( E | B, A, J, M ) = P ( E | A, B ) ? AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 48

  32. Example Suppose we choose the ordering M , J , A , B , E MaryCalls JohnCalls Alarm Burglary Earthquake P ( J | M ) = P ( J ) ? No P ( A | J, M ) = P ( A | J ) ? P ( A | J, M ) = P ( A ) ? No P ( B | A, J, M ) = P ( B | A ) ? Yes P ( B | A, J, M ) = P ( B ) ? No P ( E | B, A, J, M ) = P ( E | A ) ? No P ( E | B, A, J, M ) = P ( E | A, B ) ? Yes AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 49

  33. Example MaryCalls JohnCalls Alarm Burglary Earthquake Assessing conditional probabilities is hard in noncausal directions Network can be far more compact than the full joint distribution But, this network is less compact: 1 + 2 + 4 + 2 + 4 = 13 (due to the ordering of the variables) AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 50

  34. Probabilistic reasoning ∗ • Exact inference by enumeration • Exact inference by variable elimination • Approximate inference by stochastic simulation • Approximate inference by Markov chain Monte Carlo AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 51

  35. Reasoning tasks in BNs (PGMs) Simple queries: compute posterior marginal P ( X i | E = e ) e.g., P ( NoGas | Gauge = empty, Lights = on, Starts = false ) Conjunctive queries: P ( X i , X j | E = e ) = P ( X i | E = e ) P ( X j | X i , E = e ) Optimal decisions: decision networks include utility information; probabilistic inference required for P ( outcome | action, evidence ) Value of information: which evidence to seek next? Sensitivity analysis: which probability values are most critical? Explanation/Causal inference: why do I need a new starter motor? AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 52

  36. Inference by enumeration Slightly intelligent way to sum out variables from the joint without actually constructing its explicit representation Simple query on the burglary network B E P ( B | j, m ) = P ( B, j, m ) /P ( j, m ) A = α P ( B, j, m ) = α Σ e Σ a P ( B, e, a, j, m ) J M Rewrite full joint entries using product of CPT entries P ( B | j, m ) = α Σ e Σ a P ( B ) P ( e ) P ( a | B, e ) P ( j | a ) P ( m | a ) = α P ( B ) Σ e P ( e ) Σ a P ( a | B, e ) P ( j | a ) P ( m | a ) Recursive depth-first enumeration: O ( n ) space, O ( d n ) time AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 53

  37. Enumeration algorithm function EnumerationAsk ( X , e , bn ) returns a distribution over X inputs : X , the query variable e , observed values for variables E bn , a belief network with variables { X } ∪ E ∪ Y Q ( X ) ← a distribution over X , initially empty for each value x i of X do Q ( x i ) ← EnumerateAll ( bn . Vars , e x i ) where e x i is e extended with X = x i return Normalize ( Q ( X ) ) function EnumerateAll ( vars , e ) returns a real number if Empty? ( vars ) then return 1.0 Y ← First ( vars ) if Y has value y in e then return P ( y | parents ( Y )) × EnumerateAll ( Rest ( vars ), e ) else return � y P ( y | parents ( Y )) × EnumerateAll ( Rest ( vars ), e y ) where e y is e extended with Y = y AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 54

  38. Evaluation tree Summing at the “+” nodes P(b) .001 P(e) P( e) .002 .998 P(a|b,e) P( a|b,e) P(a|b, e) P( a|b, e) .95 .05 .94 .06 P(j|a) P(j| a) P(j|a) P(j| a) .90 .05 .90 .05 P(m|a) P(m| a) P(m|a) P(m| a) .70 .01 .70 .01 Enumeration is inefficient: repeated computation e.g., computes P ( j | a ) P ( m | a ) for each value of e improved by eliminating repeated variables AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 55

  39. Inference by variable elimination Variable elimination: carry out summations right-to-left, storing intermediate results (factors) to avoid recomputation P ( B | j, m ) Σ e P ( e ) Σ a P ( a | B, e ) = α P ( B ) P ( j | a ) P ( m | a ) � �� � � �� � � �� � � �� � � �� � B E A J M = α P ( B ) Σ e P ( e ) Σ a P ( a | B, e ) P ( j | a ) f M ( a ) = α P ( B ) Σ e P ( e ) Σ a P ( a | B, e ) f J ( a ) f M ( a ) = α P ( B ) Σ e P ( e ) Σ a f A ( a, b, e ) f J ( a ) f M ( a ) = α P ( B ) Σ e P ( e ) f ¯ AJM ( b, e ) (sum out A ) = α P ( B ) f ¯ AJM ( b ) (sum out E ) E ¯ = αf B ( b ) × f ¯ AJM ( b ) E ¯ AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 56

  40. Variable elimination: Basic operations Summing out a variable from a product of factors move any constant factors outside the summation add up submatrices in pointwise product of remaining factors Σ x f 1 × · · · × f k = f 1 × · · · × f i Σ x f i +1 × · · · × f k = f 1 × · · · × f i × f ¯ X assuming f 1 , . . . , f i do not depend on X Pointwise product of factors f 1 and f 2 f 1 ( x 1 , . . . , x j , y 1 , . . . , y k ) × f 2 ( y 1 , . . . , y k , z 1 , . . . , z l ) = f ( x 1 , . . . , x j , y 1 , . . . , y k , z 1 , . . . , z l ) e.g., f 1 ( a, b ) × f 2 ( b, c ) = f ( a, b, c ) AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 57

  41. Variable elimination algorithm function EliminationAsk ( X , e , bn ) returns a distribution over X inputs : X , the query variable e , observed values for variables E bn , a belief network specifying joint distribution P ( X 1 , . . . , X n ) factors ← [ ] for each var in Order ( bn . Vars ) do factors ← [ MakeFactor ( var , e ) | factors ] if var is a hidden variable then factors ← SumOut ( var , factors ) return Normalize ( PointwiseProduct ( factors )) AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 58

  42. Irrelevant variables Consider the query P ( JohnCalls | Burglary = true ) B E A P ( J | b ) = αP ( b ) � e P ( e ) � a P ( a | b, e ) P ( J | a ) � m P ( m | a ) J M Sum over m is identically 1; M is irrelevant to the query Theorem: Y is irrelevant unless Y ∈ Ancestors ( { X } ∪ E ) Here, X = JohnCalls , E = { Burglary } , and Ancestors ( { X } ∪ E ) = { Alarm, Earthquake } so MaryCalls is irrelevant (Compare this to backward chaining from the query in Horn clause KBs) AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 59

  43. Irrelevant variables Defn: moral graph of BN: marry all parents and drop arrows Defn: A is m -separated from B by C iff separated by C in the moral graph Theorem: Y is irrelevant if m -separated from X by E B E A For P ( JohnCalls | Alarm = true ) , both Burglary and Earthquake are irrelevant J M AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 60

  44. Complexity of exact inference Singly connected networks (or polytrees) – any two nodes are connected by at most one (undirected) path – time and space cost of variable elimination are O ( d k n ) Multiply connected networks – can reduce 3SAT to exact inference ⇒ NP-hard – equivalent to counting 3SAT models ⇒ #P-complete 0.5 0.5 0.5 0.5 A B C D 1. A v B v C 2. C v D v ~A 1 2 3 3. B v C v ~D AND AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 61

  45. Inference by stochastic simulation Idea 1) Draw N samples from a sampling distribution S 2) Compute an approximate posterior probability ˆ 0.5 P 3) Show this converges to the true probability P Coin Methods – Sampling from an empty network – Rejection sampling: reject samples disagreeing with evidence – Likelihood weighting: use evidence to weight samples – Markov chain Monte Carlo (MCMC) sample from a stochastic process whose stationary distribution is the true posterior AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 62

  46. Sampling from an empty network Direct sampling from a network that has no evidence associated (sampling each variable in turn, in topological order) function Prior-Sample ( bn ) returns an event sampled from P ( X 1 , . . . , X n ) specified by bn inputs : bn , a BN specifying joint distribution P ( X 1 , . . . , X n ) x ← an event with n elements for each variable X i in X 1 , . . . , X n x i ← a random sample from P ( X i | Parents ( X i )) return x AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 63

  47. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 64

  48. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 65

  49. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 66

  50. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 67

  51. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 68

  52. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 69

  53. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 70

  54. Sampling from an empty network contd. Probability that PriorSample generates a particular event S PS ( x 1 . . . x n ) = Π n i = 1 P ( x i | parents ( X i )) = P ( x 1 . . . x n ) i.e., the true prior probability E.g., S PS ( t, f, t, t ) = 0 . 5 × 0 . 9 × 0 . 8 × 0 . 9 = 0 . 324 = P ( t, f, t, t ) Let N PS ( x 1 . . . x n ) be the number of samples generated for event x 1 , . . . , x n Then we have ˆ lim P ( x 1 , . . . , x n ) = N →∞ N PS ( x 1 , . . . , x n ) /N lim N →∞ = S PS ( x 1 , . . . , x n ) = P ( x 1 . . . x n ) That is, estimates derived from PriorSample are consistent Shorthand: ˆ P ( x 1 , . . . , x n ) ≈ P ( x 1 . . . x n ) AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 71

  55. Rejection sampling ˆ P ( X | e ) estimated from samples agreeing with e function Rejection-Sampling ( X , e , bn , N ) returns an estimate of P ( X | e ) inputs : X , the query variable e , observed values for variables E bn , a BN N , the total number of samples to be generated local variables : N , a vector of counts for each value of X , initially zero for j = 1 to N do x ← Prior-Sample ( bn ) if x is consistent with e then /*do not match the evidence*/ N [ x ] ← N [ x ]+1 where x is the value of X in x return Normalize ( N [ X ]) AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 72

  56. Example Estimate P ( Rain | Sprinkler = true ) using 100 samples 27 samples have Sprinkler = true Of these, 8 have Rain = true and 19 have Rain = false . ˆ P ( Rain | Sprinkler = true ) = Normalize ( � 8 , 19 � ) = � 0 . 296 , 0 . 704 � Similar to a basic real-world empirical estimation procedure AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 73

  57. Rejection sampling contd. ˆ P ( X | e ) = α N PS ( X, e ) (algorithm defn.) = N PS ( X, e ) /N PS ( e ) (normalized by N PS ( e ) ) ≈ P ( X, e ) /P ( e ) (property of PriorSample ) = P ( X | e ) (defn. of conditional probability) Hence rejection sampling returns consistent posterior estimates Problem: hopelessly expensive if P ( e ) is small P ( e ) drops off exponentially with number of evidence variables AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 74

  58. Likelihood weighting Idea – fix evidence variables – sample only nonevidence variables – weight each sample by the likelihood it accords the evidence AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 75

  59. Likelihood weighting function Likelihood-Weighting ( X , e , bn , N ) returns an estimate of P ( X | e ) inputs : X , the query variable e , observed values for variables E bn , a BN N , the total number of samples to be generated local variables : W , a vector of weighted counts for each value of X , initially 0 for j = 1 to N do x , w ← Weighted-Sample ( bn , e ) W [ x ] ← W [ x ] + w where x is the value of X in x return Normalize ( W [ X ] ) function Weighted-Sample ( bn , e ) returns an event and a weight x ← an event with n elements from e ; w ← 1 for each variable X i in X 1 , · · · , X n do if X i is an evidence variable with value x i in e then w ← w × P ( X i = x i | Parents ( X i )) else x [ i ] ← a random sample from P ( X i | Parents ( X i )) return x , w AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 76

  60. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 77

  61. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 78

  62. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 79

  63. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 × 0 . 1 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 80

  64. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 × 0 . 1 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 81

  65. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 × 0 . 1 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 82

  66. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 × 0 . 1 × 0 . 99 = 0 . 099 AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 83

  67. Likelihood weighting contd. Sampling probability for WeightedSample is S WS ( z , e ) = Π l i = 1 P ( z i | parents ( Z i )) Note: pays attention to evidence in ancestors only Cloudy ⇒ somewhere “in between” prior and posterior distribution Rain Sprinkler Wet Weight for a given sample z , e is Grass w ( z , e ) = Π m i = 1 P ( e i | parents ( E i )) Weighted sampling probability is S WS ( z , e ) w ( z , e ) = Π l i = 1 P ( z i | parents ( Z i )) Π m i = 1 P ( e i | parents ( E i )) = P ( z , e ) (by standard global semantics of network) Hence likelihood weighting returns consistent estimates but performance still degrades with many evidence variables because a few samples have nearly all the total weight AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 84

  68. Inference by Markov chain Monte Carlo “State” of network = current assignment to all variables ⇒ the next state by making random changes to the current state Generate next state by sampling one variable given Markov blanket recall Markov blanket: parents, children, and children’s parents Sample each variable in turn, keeping evidence fixed Specific transition probability with which the stochastic process moves from one state to another defined by conditional distribution given Markov blanket of the variable being sampled AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 85

  69. MCMC Gibbs sampling function MCMC-Gibbs-Ask ( X , e , bn , N ) returns an estimate of P ( X | e ) local variables : N , a vector of counts for each value of X , initially zero Z , the nonevidence variables in bn x , the current state of the network, initially copied from e initialize x with random values for the variables in Z for j = 1 to N do /* Can choose at random */ for each Z i in Z do set the value of Z i in x by sampling from P ( Z i | mb ( Z i )) /*Markov blanket */ N [ x ] ← N [ x ] + 1 where x is the value of X in x return Normalize ( N ) AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 86

  70. The Markov chain With Sprinkler = true, WetGrass = true , there are four states Cloudy Cloudy Rain Rain Sprinkler Sprinkler Wet Wet Grass Grass Cloudy Cloudy Rain Rain Sprinkler Sprinkler Wet Wet Grass Grass Wander about for a while AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 87

  71. Example Estimate P ( Rain | Sprinkler = true, WetGrass = true ) Sample Cloudy or Rain given its Markov blanket, repeat Count number of times Rain is true and false in the samples E.g., visit 100 states 31 have Rain = true , 69 have Rain = false ˆ P ( Rain | Sprinkler = true, WetGrass = true ) = Normalize ( � 31 , 69 � ) = � 0 . 31 , 0 . 69 � Theorem: chain approaches stationary distribution long-run fraction of time spent in each state is exactly proportional to its posterior probability AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 88

  72. Markov blanket sampling Markov blanket of Cloudy is Cloudy Sprinkler and Rain Markov blanket of Rain is Rain Sprinkler Cloudy , Sprinkler , and WetGrass Wet Grass Probability given the Markov blanket is calculated as follows P ( x ′ i | mb ( X i )) = P ( x ′ i | parents ( X i )) Π Z j ∈ Children ( X i ) P ( z j | parents ( Z j )) Easily implemented in message-passing parallel systems, brains Main computational problems: 1) Difficult to tell if convergence has been achieved 2) Can be wasteful if Markov blanket is large P ( X i | mb ( X i )) won’t change much (law of large numbers) AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 89

  73. Approximate inference Exact inference by variable elimination: – polytime on polytrees, NP-hard on general graphs – space = time, very sensitive to topology Approximate inference by LW (Likelihood Weighting), MCMC (Markov chain Monte Carlo): – LW does poorly when there is lots of (downstream) evidence – LW, MCMC generally insensitive to topology – Convergence can be very slow with probabilities close to 1 or 0 – Can handle arbitrary combinations of discrete and continuous variables AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 90

  74. Dynamic Bayesian networks ∗ DBNs are BNs that represent temporal probability models Basic idea: copy state and evidence variables for each time step X t = set of unobservable state variables at time t e.g., BloodSugar t , StomachContents t , etc. E t = set of observable evidence variables at time t e.g., MeasuredBloodSugar t , PulseRate t , FoodEaten t This assumes discrete time ; step size depends on problem Notation: X a : b = X a , X a +1 , . . . , X b − 1 , X b X t , E t contain arbitrarily many variables in a replicated Bayes net AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 91

  75. Hidden Markov models HMMs: single-(state) variable DBNs every discrete DBN is an HMM (combine all the state variables in the DBN into a single one) X t+1 X t Y Y t t+1 Z Z t t+1 Sparse dependencies ⇒ exponentially fewer parameters; e.g., 20 state variables, three parents each DBN has 20 × 2 3 = 160 parameters, HMM has 2 20 × 2 20 ≈ 10 12 (analogous to BNs and full tabulated joint distributions) AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 92

  76. Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov assumption: X t depends on bounded subset of X 0: t − 1 First-order Markov process: P ( X t | X 0: t − 1 ) = P ( X t | X t − 1 ) Second-order Markov process: P ( X t | X 0: t − 1 ) = P ( X t | X t − 2 , X t − 1 ) X t −2 X t −1 X t +1 X t +2 X t First−order X t −2 X t −1 X t X t +1 X t +2 Second−order Sensor Markov assumption: P ( E t | X 0: t , E 0: t − 1 ) = P ( E t | X t ) Stationary process: transition model P ( X t | X t − 1 ) and sensor model P ( E t | X t ) fixed for all t AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 93

  77. Example P ( R ) R t -1 t t 0.7 f 0.3 Rain t– 1 Rain t Rain t+ 1 P ( U ) R t t t 0.9 f 0.2 Umbrella t+ 1 Umbrella t– 1 Umbrella t First-order Markov assumption not exactly true in real world! Possible fixes: 1. Increase order of Markov process 2. Augment state , e.g., add Temp t , Pressure t AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 94

  78. HMMs X t is a single, discrete variable (usually E t is too) Domain of X t is { 1 , . . . , S }    0 . 7 0 . 3 Transition matrix T ij = P ( X t = j | X t − 1 = i ) , e.g.,     0 . 3 0 . 7  Sensor matrix O t for each time step, diagonal elements P ( e t | X t = i )    0 . 9 0 e.g., with U 1 = true , O 1 =     0 0 . 2  Forward and backward messages as column vectors f 1: t +1 = α O t +1 T ⊤ f 1: t b k +1: t = TO k +1 b k +2: t Forward-backward algorithm needs time O ( S 2 t ) and space O ( St ) AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 95

  79. Inference tasks in HMMs Filtering: P ( X t | e 1: t ) belief state—input to the decision process of a rational agent Prediction: P ( X t + k | e 1: t ) for k > 0 evaluation of possible action sequences; like filtering without the evidence Smoothing: P ( X k | e 1: t ) for 0 ≤ k < t better estimate of past states, essential for learning Most likely explanation: arg max x 1: t P ( x 1: t | e 1: t ) speech recognition, decoding with a noisy channel AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 96

  80. Filtering Aim: devise a recursive state estimation algorithm P ( X t +1 | e 1: t +1 ) = f ( e t +1 , P ( X t | e 1: t )) P ( X t +1 | e 1: t +1 ) = P ( X t +1 | e 1: t , e t +1 ) = α P ( e t +1 | X t +1 , e 1: t ) P ( X t +1 | e 1: t ) = α P ( e t +1 | X t +1 ) P ( X t +1 | e 1: t ) I.e., prediction + estimation. Prediction by summing out X t : P ( X t +1 | e 1: t +1 ) = α P ( e t +1 | X t +1 ) Σ x t P ( X t +1 | x t , e 1: t ) P ( x t | e 1: t ) = α P ( e t +1 | X t +1 ) Σ x t P ( X t +1 | x t ) P ( x t | e 1: t ) f 1: t +1 = Forward ( f 1: t , e t +1 ) where f 1: t = P ( X t | e 1: t ) Time and space constant (independent of t ) AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 97

  81. Inference in DBNs Naive method: unroll the network and run any exact algorithm R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) P(R ) 1 P(R ) 1 1 1 1 1 1 1 0 0 t t t t t t t t 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 f f f f f f f f 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 Rain 0 Rain 1 Rain 0 Rain 1 Rain 2 Rain 3 Rain 4 Rain 5 Rain 6 Rain 7 R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) 1 1 1 1 1 1 1 1 t t t t t t t t 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 f f f f f f f f 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 Umbrella 1 Umbrella 1 Umbrella 2 Umbrella 3 Umbrella 4 Umbrella 5 Umbrella 6 Umbrella 7 Problem: inference cost for each update grows with t Rollup filtering: add slice t + 1 , “sum out” slice t using variable elimination Largest factor is O ( d n +1 ) , update cost O ( d n +2 ) (cf. HMM update cost O ( d 2 n ) ) Approximate inference by MCMC (Markov chain Monte Carlo) etc. AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 98

  82. Causal Inference Questions – Observations: “What if we see A ?” (What is?) P ( y | A ) – Actions: “What if we do A ?” (What if?) P ( y | do ( A )) – Counterfactuals: “What if we did things differerently?” (Why?) P ( y A ′ | A ) E.g., recall C (limate)- S (prinkler)- R (rain)- W (etness) “Would the pavement be wet HAD the sprinkler been ON?” ( P ( S | C ) = 1 ) Find if P ( W S =1 = 1) = P ( W = 1 | do ( S = 1)) Can drive counterfactuals from a model AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 99

  83. Graphical representations • Observations → Bayesian networks • Actions → Causal Bayesian networks • Counterfactuals → Functional causal diagrams Hints – Can reduce the action questions to symbolic calculus – Can be estimated in in polynomial time, complete algorithm (with the independence in the distribution) AI Slides (6e) c � Lin Zuoquan@PKU 1998-2020 9 100

Recommend


More recommend