uncertainty
play

Uncertainty 10 AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 10 1 - PowerPoint PPT Presentation

Uncertainty 10 AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 10 1 10 Uncertainty 10.1 Uncertainty 10.2 Probability Syntax and semantics Inference Independence Bayes rule 10.3 Bayesian networks 10.4 Probabilistic reasoning


  1. Inference Probabilistic inference is the computation of posterior probabilities for query propositions given observed evidence where the full joint distribution can be viewed as the KB from which answers to all questions may be derived Start with the joint distribution toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity .016 .064 .144 .576 cavity L For any proposition φ , sum the atomic events where it is true P ( φ ) = Σ ω : ω | = φ P ( ω ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 18

  2. Inference by enumeration Start with the joint distribution toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity .016 .064 .144 .576 cavity L For any proposition φ , sum the atomic events where it is true: P ( φ ) = Σ ω : ω | = φ P ( ω ) P ( toothache ) = 0 . 108 + 0 . 012 + 0 . 016 + 0 . 064 = 0 . 2 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 19

  3. Inference by enumeration Start with the joint distribution toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity .016 .064 .144 .576 cavity L For any proposition φ , sum the atomic events where it is true: P ( φ ) = Σ ω : ω | = φ P ( ω ) P ( cavity ∨ toothache ) = 0 . 108 + 0 . 012 + 0 . 072 + 0 . 008 + 0 . 016 + 0 . 064 = 0 . 28 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 20

  4. Inference by enumeration Start with the joint distribution toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity .016 .064 .144 .576 cavity L Can also compute conditional probabilities: P ( ¬ cavity | toothache ) = P ( ¬ cavity ∧ toothache ) P ( toothache ) 0 . 016 + 0 . 064 = 0 . 108 + 0 . 012 + 0 . 016 + 0 . 064 = 0 . 4 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 21

  5. Normalization toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity cavity .016 .064 .144 .576 L Denominator can be viewed as a normalization constant α P ( Cavity | toothache ) = α P ( Cavity, toothache ) = α [ P ( Cavity, toothache, catch ) + P ( Cavity, toothache, ¬ catch )] = α [ � 0 . 108 , 0 . 016 � + � 0 . 012 , 0 . 064 � ] = α � 0 . 12 , 0 . 08 � = � 0 . 6 , 0 . 4 � Idea: compute distribution on query variable by fixing evidence variables and summing over hidden variables AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 22

  6. Inference by enumeration contd. Let X be all the variables. Ask the posterior joint distribution of the query variables Y given specific values e for the evidence variables E Let the hidden variables be H = X − Y − E ⇒ the required summation of joint entries is done by summing out the hidden variables: P ( Y | E = e ) = α P ( Y , E = e ) = α Σ h P ( Y , E = e , H = h ) The terms in the summation are joint entries because Y , E , and H together exhaust the set of random variables Problems 1) Worst-case time complexity O ( d n ) where d is the largest arity 2) Space complexity O ( d n ) to store the joint distribution 3) How to find the numbers for O ( d n ) entries? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 23

  7. Independence A and B are independent iff P ( A | B ) = P ( A ) or P ( B | A ) = P ( B ) or P ( A, B ) = P ( A ) P ( B ) Cavity Cavity Toothache Catch decomposes into Toothache Catch Weather Weather P ( Toothache, Catch, Cavity, Weather ) = P ( Toothache, Catch, Cavity ) P ( Weather ) 32 entries reduced to 12; for n independent biased coins, 2 n → n Absolute independence powerful but rare Dentistry is a large field with hundreds of variables, none of which are independent. What to do? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 24

  8. Conditional independence P ( Toothache, Cavity, Catch ) has 2 3 − 1 = 7 independent entries If I have a cavity, the probability that the probe catches in it doesn’t depend on whether I have a toothache (1) P ( catch | toothache, cavity ) = P ( catch | cavity ) The same independence holds if I haven’t got a cavity (2) P ( catch | toothache, ¬ cavity ) = P ( catch |¬ cavity ) Catch is conditionally independent of Toothache given Cavity : P ( Catch | Toothache, Cavity ) = P ( Catch | Cavity ) Equivalent statements P ( Toothache | Catch, Cavity ) = P ( Toothache | Cavity ) P ( Toothache, Catch | Cavity ) = P ( Toothache | Cavity ) P ( Catch | Cavity ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 25

  9. Conditional independence Write out full joint distribution using chain rule: P ( Toothache, Catch, Cavity ) = P ( Toothache | Catch, Cavity ) P ( Catch, Cavity ) = P ( Toothache | Catch, Cavity ) P ( Catch | Cavity ) P ( Cavity ) = P ( Toothache | Cavity ) P ( Catch | Cavity ) P ( Cavity ) i.e., 2 + 2 + 1 = 5 independent numbers In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in n to linear in n Conditional independence is our most basic and robust form of knowl- edge about uncertainty AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 26

  10. Bayes’ Rule Product rule P ( a ∧ b ) = P ( a | b ) P ( b ) = P ( b | a ) P ( a ) ⇒ Bayes’ rule P ( a | b ) = P ( b | a ) P ( a ) P ( b ) or in distribution form P ( Y | X ) = P ( X | Y ) P ( Y ) = α P ( X | Y ) P ( Y ) P ( X ) Useful for assessing diagnostic probability from causal probability: P ( Cause | Effect ) = P ( Effect | Cause ) P ( Cause ) P ( Effect ) E.g., let M be meningitis, S be stiff neck: P ( m | s ) = P ( s | m ) P ( m ) = 0 . 8 × 0 . 0001 = 0 . 0008 P ( s ) 0 . 1 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 27

  11. Bayes’ Rule and conditional independence P ( Cavity | toothache ∧ catch ) = α P ( toothache ∧ catch | Cavity ) P ( Cavity ) = α P ( toothache | Cavity ) P ( catch | Cavity ) P ( Cavity ) This is an example of a naive Bayes model (Bayesian classifier) P ( Cause, Effect 1 , . . . , Effect n ) = P ( Cause ) Π i P ( Effect i | Cause ) Cavity Cause Effect 1 Effect n Toothache Catch Total number of parameters is linear in n AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 28

  12. Example: Wumpus World 1,4 2,4 3,4 4,4 1,3 2,3 3,3 4,3 1,2 2,2 3,2 4,2 B OK 1,1 2,1 3,1 4,1 B OK OK P ij = true iff [ i, j ] contains a pit B ij = true iff [ i, j ] is breezy Include only B 1 , 1 , B 1 , 2 , B 2 , 1 in the probability model AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 29

  13. Specifying the probability model The full joint distribution is P ( P 1 , 1 , . . . , P 4 , 4 , B 1 , 1 , B 1 , 2 , B 2 , 1 ) Apply product rule: P ( B 1 , 1 , B 1 , 2 , B 2 , 1 | P 1 , 1 , . . . , P 4 , 4 ) P ( P 1 , 1 , . . . , P 4 , 4 ) (Do it this way to get P ( Effect | Cause ) ) First term: 1 if pits are adjacent to breezes, 0 otherwise Second term: pits are placed randomly, probability 0.2 per square: i,j = 1 , 1 P ( P i,j ) = 0 . 2 n × 0 . 8 16 − n P ( P 1 , 1 , . . . , P 4 , 4 ) = Π 4 , 4 for n pits AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 30

  14. Observations and query We know the following facts: b = ¬ b 1 , 1 ∧ b 1 , 2 ∧ b 2 , 1 known = ¬ p 1 , 1 ∧ ¬ p 1 , 2 ∧ ¬ p 2 , 1 Query is P ( P 1 , 3 | known, b ) Define Unknown = P ij s other than P 1 , 3 and Known For inference by enumeration, we have P ( P 1 , 3 | known, b ) = α Σ unknown P ( P 1 , 3 , unknown, known, b ) Grows exponentially with number of squares AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 31

  15. Using conditional independence Basic insight: observations are conditionally independent of other hidden squares given neighbouring hidden squares 1,4 2,4 3,4 4,4 1,3 2,3 3,3 4,3 OTHER QUERY 1,2 2,2 3,2 4,2 FRINGE 1,1 2,1 3,1 4,1 KNOWN Define Unknown = Fringe ∪ Other P ( b | P 1 , 3 , Known, Unknown ) = P ( b | P 1 , 3 , Known, Fringe ) Manipulate query into a form where we can use this AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 32

  16. Using conditional independence � P ( P 1 , 3 | known, b ) = α unknown P ( P 1 , 3 , unknown, known, b ) � = α unknown P ( b | P 1 , 3 , known, unknown ) P ( P 1 , 3 , known, unknown ) � � = α other P ( b | known, P 1 , 3 , fringe, other ) P ( P 1 , 3 , known, fringe, other ) fringe � � = α other P ( b | known, P 1 , 3 , fringe ) P ( P 1 , 3 , known, fringe, other ) fringe � � = α fringe P ( b | known, P 1 , 3 , fringe ) other P ( P 1 , 3 , known, fringe, other ) � � = α fringe P ( b | known, P 1 , 3 , fringe ) other P ( P 1 , 3 ) P ( known ) P ( fringe ) P ( other ) � � = α P ( known ) P ( P 1 , 3 ) fringe P ( b | known, P 1 , 3 , fringe ) P ( fringe ) other P ( other ) = α ′ P ( P 1 , 3 ) � fringe P ( b | known, P 1 , 3 , fringe ) P ( fringe ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 33

  17. Using conditional independence 1,3 1,3 1,3 1,3 1,3 1,2 2,2 1,2 2,2 1,2 2,2 1,2 2,2 1,2 2,2 B B B B B OK OK OK OK OK 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 B B B B B OK OK OK OK OK OK OK OK OK OK 0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16 0.8 x 0.2 = 0.16 0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16 P ( P 1 , 3 | known, b ) = α ′ � 0 . 2(0 . 04 + 0 . 16 + 0 . 16) , 0 . 8(0 . 04 + 0 . 16) � ≈ � 0 . 31 , 0 . 69 � P ( P 2 , 2 | known, b ) ≈ � 0 . 86 , 0 . 14 � AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 34

  18. Bayesian networks BNs: a graphical notation for conditional independence assertions and hence for compact specification of full joint distributions alias Probabilistic Graphical Models (PGMs) Syntax: a set of nodes, one per variable a directed, acyclic graph (link ≈ “directly influences”) a conditional distribution for each node given its parents: P ( X i | Parents ( X i )) In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over X i for each combination of parent values AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 35

  19. Example Topology of network encodes conditional independence assertions: Cavity Weather Toothache Catch Weather is independent of the other variables Toothache and Catch are conditionally independent given Cavity AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 36

  20. Example I’m at work, neighbor John calls to say my alarm is ringing, but neigh- bor Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a burglar? Variables: Burglar , Earthquake , Alarm , JohnCalls , MaryCalls Network topology reflects “causal” knowledge: – A burglar can set the alarm off – An earthquake can set the alarm off – The alarm can cause Mary to call – The alarm can cause John to call AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 37

  21. Example P(E) Burglary P(B) Earthquake .002 .001 B E P(A) T T .95 Alarm T F .94 F T .29 F F .001 P(J) A A P(M) JohnCalls T .90 MaryCalls T .70 F .05 F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 38

  22. Compactness A CPT for Boolean X i with k Boolean parents has 2 k rows for the combinations of parent values B E A Each row requires one number p for X i = true (the number for X i = false is just 1 − p ) J M If each variable has no more than k parents, the complete network requires O ( n · 2 k ) numbers I.e., grows linearly with n , vs. O (2 n ) for the full joint distribution For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 2 5 − 1 = 31 ) In certain cases (assumptions of conditional independency), BNs make O (2 n ) ⇒ O ( kn ) (NP ⇒ P ! ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 39

  23. Global semantics Global semantics defines the full joint distribution B E as the product of the local conditional distributions: P ( x 1 , . . . , x n ) = Π n A i = 1 P ( x i | parents ( X i )) J M e.g., P ( j ∧ m ∧ a ∧ ¬ b ∧ ¬ e ) = AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 40

  24. Global semantics Global semantics defines the full joint distribution B E as the product of the local conditional distributions: P ( x 1 , . . . , x n ) = Π n A i = 1 P ( x i | parents ( X i )) J M e.g., P ( j ∧ m ∧ a ∧ ¬ b ∧ ¬ e ) = P ( j | a ) P ( m | a ) P ( a |¬ b, ¬ e ) P ( ¬ b ) P ( ¬ e ) = 0 . 9 × 0 . 7 × 0 . 001 × 0 . 999 × 0 . 998 ≈ 0 . 00063 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 41

  25. Local semantics Local semantics: each node is conditionally independent of its nondescendants ( Z i,j ) given its parents ( U i in the gray area) U 1 U m . . . X Z 1j Z nj Y n Y 1 . . . Theorem: Local semantics ⇔ global semantics AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 42

  26. Markov blanket Each node is conditionally independent of all others given its Markov blanket: parents + children + children’s parents U 1 U m . . . X Z 1j Z nj Y Y n 1 . . . AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 43

  27. Constructing Bayesian networks Algorithm: a series of locally testable assertions of conditional inde- pendence guarantees the required global semantics 1. Choose an ordering of variables X 1 , . . . , X n 2. For i = 1 to n add X i to the network select parents from X 1 , . . . , X i − 1 such that P ( X i | Parents ( X i )) = P ( X i | X 1 , . . . , X i − 1 ) This choice of parents guarantees the global semantics: P ( X 1 , . . . , X n ) = Π n i = 1 P ( X i | X 1 , . . . , X i − 1 ) (chain rule) = Π n i = 1 P ( X i | Parents ( X i )) (by construction) Each node is conditionally independent of its other predecessors in the node (partial) ordering, given its parents AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 44

  28. Example Suppose we choose the ordering M , J , A , B , E MaryCalls JohnCalls P ( J | M ) = P ( J ) ? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 45

  29. Example Suppose we choose the ordering M , J , A , B , E MaryCalls JohnCalls Alarm P ( J | M ) = P ( J ) ? No P ( A | J, M ) = P ( A | J ) ? P ( A | J, M ) = P ( A ) ? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 46

  30. Example Suppose we choose the ordering M , J , A , B , E MaryCalls JohnCalls Alarm Burglary P ( J | M ) = P ( J ) ? No P ( A | J, M ) = P ( A | J ) ? P ( A | J, M ) = P ( A ) ? No P ( B | A, J, M ) = P ( B | A ) ? P ( B | A, J, M ) = P ( B ) ? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 47

  31. Example Suppose we choose the ordering M , J , A , B , E MaryCalls JohnCalls Alarm Burglary Earthquake P ( J | M ) = P ( J ) ? No P ( A | J, M ) = P ( A | J ) ? P ( A | J, M ) = P ( A ) ? No P ( B | A, J, M ) = P ( B | A ) ? Yes P ( B | A, J, M ) = P ( B ) ? No P ( E | B, A, J, M ) = P ( E | A ) ? P ( E | B, A, J, M ) = P ( E | A, B ) ? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 48

  32. Example Suppose we choose the ordering M , J , A , B , E MaryCalls JohnCalls Alarm Burglary Earthquake P ( J | M ) = P ( J ) ? No P ( A | J, M ) = P ( A | J ) ? P ( A | J, M ) = P ( A ) ? No P ( B | A, J, M ) = P ( B | A ) ? Yes P ( B | A, J, M ) = P ( B ) ? No P ( E | B, A, J, M ) = P ( E | A ) ? No P ( E | B, A, J, M ) = P ( E | A, B ) ? Yes AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 49

  33. Example MaryCalls JohnCalls Alarm Burglary Earthquake Assessing conditional probabilities is hard in noncausal directions Network can be far more compact than the full joint distribution But, this network is less compact: 1 + 2 + 4 + 2 + 4 = 13 (due to the ordering of the variables) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 50

  34. Probabilistic reasoning • Exact inference by enumeration • Exact inference by variable elimination • Approximate inference by stochastic simulation • Approximate inference by Markov chain Monte Carlo AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 51

  35. Reasoning tasks in BNs (PGMs) Simple queries: compute posterior marginal P ( X i | E = e ) e.g., P ( NoGas | Gauge = empty, Lights = on, Starts = false ) Conjunctive queries: P ( X i , X j | E = e ) = P ( X i | E = e ) P ( X j | X i , E = e ) Optimal decisions: decision networks include utility information; probabilistic inference required for P ( outcome | action, evidence ) Value of information: which evidence to seek next? Sensitivity analysis: which probability values are most critical? Explanation: why do I need a new starter motor? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 52

  36. Inference by enumeration Slightly intelligent way to sum out variables from the joint without actually constructing its explicit representation Simple query on the burglary network B E P ( B | j, m ) = P ( B, j, m ) /P ( j, m ) A = α P ( B, j, m ) = α Σ e Σ a P ( B, e, a, j, m ) J M Rewrite full joint entries using product of CPT entries P ( B | j, m ) = α Σ e Σ a P ( B ) P ( e ) P ( a | B, e ) P ( j | a ) P ( m | a ) = α P ( B ) Σ e P ( e ) Σ a P ( a | B, e ) P ( j | a ) P ( m | a ) Recursive depth-first enumeration: O ( n ) space, O ( d n ) time AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 53

  37. Enumeration algorithm function EnumerationAsk (( X , e , bn )) returns a distribution over X inputs : X , the query variable e , observed values for variables E bn , a belief network with variables { X } ∪ E ∪ Y Q ( X ) ← a distribution over X , initially empty for each value x i of X do Q ( x i ) ← EnumerateAll ( bn . Vars , e x i ) where e x i is e extended with X = x i return Normalize ( Q ( X ) ) function EnumerateAll (( vars , e )) returns a real number if Empty? ( vars ) then return 1.0 Y ← First ( vars ) if Y has value y in e then return P ( y | parents ( Y )) × EnumerateAll ( Rest ( vars ), e ) else return � y P ( y | parents ( Y )) × EnumerateAll ( Rest ( vars ), e y ) where e y is e extended with Y = y AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 54

  38. Evaluation tree Summing at the “+” nodes P(b) .001 P(e) P( e) .002 .998 P(a|b,e) P( a|b,e) P(a|b, e) P( a|b, e) .95 .05 .94 .06 P(j|a) P(j| a) P(j|a) P(j| a) .90 .05 .90 .05 P(m|a) P(m| a) P(m|a) P(m| a) .70 .01 .70 .01 Enumeration is inefficient: repeated computation e.g., computes P ( j | a ) P ( m | a ) for each value of e improved by eliminating repeated variables AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 55

  39. Inference by variable elimination Variable elimination: carry out summations right-to-left, storing intermediate results (factors) to avoid recomputation P ( B | j, m ) Σ e P ( e ) Σ a P ( a | B, e ) = α P ( B ) P ( j | a ) P ( m | a ) � �� � � �� � � �� � � �� � � �� � B E A J M = α P ( B ) Σ e P ( e ) Σ a P ( a | B, e ) P ( j | a ) f M ( a ) = α P ( B ) Σ e P ( e ) Σ a P ( a | B, e ) f J ( a ) f M ( a ) = α P ( B ) Σ e P ( e ) Σ a f A ( a, b, e ) f J ( a ) f M ( a ) = α P ( B ) Σ e P ( e ) f ¯ AJM ( b, e ) (sum out A ) = α P ( B ) f ¯ AJM ( b ) (sum out E ) E ¯ = αf B ( b ) × f ¯ AJM ( b ) E ¯ AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 56

  40. Variable elimination: Basic operations Summing out a variable from a product of factors move any constant factors outside the summation add up submatrices in pointwise product of remaining factors Σ x f 1 × · · · × f k = f 1 × · · · × f i Σ x f i +1 × · · · × f k = f 1 × · · · × f i × f ¯ X assuming f 1 , . . . , f i do not depend on X Pointwise product of factors f 1 and f 2 f 1 ( x 1 , . . . , x j , y 1 , . . . , y k ) × f 2 ( y 1 , . . . , y k , z 1 , . . . , z l ) = f ( x 1 , . . . , x j , y 1 , . . . , y k , z 1 , . . . , z l ) e.g., f 1 ( a, b ) × f 2 ( b, c ) = f ( a, b, c ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 57

  41. Variable elimination algorithm function EliminationAsk ( X , e , bn ) returns a distribution over X inputs : X , the query variable e , observed values for variables E bn , a belief network specifying joint distribution P ( X 1 , . . . , X n ) factors ← [ ] for each var in Order ( bn . Vars ) do factors ← [ MakeFactor ( var , e ) | factors ] if var is a hidden variable then factors ← SumOut ( var , factors ) return Normalize ( PointwiseProduct ( factors )) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 58

  42. Irrelevant variables Consider the query P ( JohnCalls | Burglary = true ) B E A P ( J | b ) = αP ( b ) � e P ( e ) � a P ( a | b, e ) P ( J | a ) � m P ( m | a ) J M Sum over m is identically 1; M is irrelevant to the query Thm 1: Y is irrelevant unless Y ∈ Ancestors ( { X } ∪ E ) Here, X = JohnCalls , E = { Burglary } , and Ancestors ( { X } ∪ E ) = { Alarm, Earthquake } so MaryCalls is irrelevant (Compare this to backward chaining from the query in Horn clause KBs) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 59

  43. Irrelevant variables Defn: moral graph of Bayes net: marry all parents and drop arrows Defn: A is m-separated from B by C iff separated by C in the moral graph Thm 2: Y is irrelevant if m-separated from X by E B E A For P ( JohnCalls | Alarm = true ) , both Burglary and Earthquake are irrelevant J M AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 60

  44. Complexity of exact inference Singly connected networks (or polytrees): – any two nodes are connected by at most one (undirected) path – time and space cost of variable elimination are O ( d k n ) Multiply connected networks: – can reduce 3SAT to exact inference ⇒ NP-hard – equivalent to counting 3SAT models ⇒ #P-complete 0.5 0.5 0.5 0.5 A B C D 1. A v B v C 2. C v D v ~A 1 2 3 3. B v C v ~D AND AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 61

  45. Inference by stochastic simulation Idea 1) Draw N samples from a sampling distribution S 2) Compute an approximate posterior probability ˆ 0.5 P 3) Show this converges to the true probability P Coin Methods – Sampling from an empty network – Rejection sampling: reject samples disagreeing with evidence – Likelihood weighting: use evidence to weight samples – Markov chain Monte Carlo (MCMC): sample from a stochastic process whose stationary distribution is the true posterior AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 62

  46. Sampling from an empty network Direct sampling from a network that has no evidence associated (sampling each variable in turn, in topological order) function Prior-Sample ( bn ) returns an event sampled from P ( X 1 , . . . , X n ) specified by bn inputs : bn , a Bayesian network specifying joint distribution P ( X 1 , . . . , X n ) x ← an event with n elements for each variable X i in X 1 , . . . , X n x i ← a random sample from P ( X i | Parents ( X i )) return x AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 63

  47. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 64

  48. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 65

  49. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 66

  50. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 67

  51. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 68

  52. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 69

  53. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 70

  54. Sampling from an empty network contd. Probability that PriorSample generates a particular event S PS ( x 1 . . . x n ) = Π n i = 1 P ( x i | parents ( X i )) = P ( x 1 . . . x n ) i.e., the true prior probability E.g., S PS ( t, f, t, t ) = 0 . 5 × 0 . 9 × 0 . 8 × 0 . 9 = 0 . 324 = P ( t, f, t, t ) Let N PS ( x 1 . . . x n ) be the number of samples generated for event x 1 , . . . , x n Then we have ˆ lim P ( x 1 , . . . , x n ) = N →∞ N PS ( x 1 , . . . , x n ) /N lim N →∞ = S PS ( x 1 , . . . , x n ) = P ( x 1 . . . x n ) That is, estimates derived from PriorSample are consistent Shorthand: ˆ P ( x 1 , . . . , x n ) ≈ P ( x 1 . . . x n ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 71

  55. Rejection sampling ˆ P ( X | e ) estimated from samples agreeing with e function Rejection-Sampling ( X , e , bn , N ) returns an estimate of P ( X | e ) inputs : X , the query variable e , observed values for variables E bn , a Bayesian network N , the total number of samples to be generated local variables : N , a vector of counts for each value of X , initially zero for j = 1 to N do x ← Prior-Sample ( bn ) if x is consistent with e then /*do not match the evidence*/ N [ x ] ← N [ x ]+1 where x is the value of X in x return Normalize ( N [ X ]) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 72

  56. Example Estimate P ( Rain | Sprinkler = true ) using 100 samples 27 samples have Sprinkler = true Of these, 8 have Rain = true and 19 have Rain = false . ˆ P ( Rain | Sprinkler = true ) = Normalize ( � 8 , 19 � ) = � 0 . 296 , 0 . 704 � Similar to a basic real-world empirical estimation procedure AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 73

  57. Rejection sampling contd. ˆ P ( X | e ) = α N PS ( X, e ) (algorithm defn.) = N PS ( X, e ) /N PS ( e ) (normalized by N PS ( e ) ) ≈ P ( X, e ) /P ( e ) (property of PriorSample ) = P ( X | e ) (defn. of conditional probability) Hence rejection sampling returns consistent posterior estimates Problem: hopelessly expensive if P ( e ) is small P ( e ) drops off exponentially with number of evidence variables AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 74

  58. Likelihood weighting Idea – fix evidence variables – sample only nonevidence variables – weight each sample by the likelihood it accords the evidence AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 75

  59. Likelihood weighting function Likelihood-Weighting ( X , e , bn , N ) returns an estimate of P ( X | e ) inputs : X , the query variable e , observed values for variables E bn , a Bayesian network N , the total number of samples to be generated local variables : W , a vector of weighted counts for each value of X , initially zero for j = 1 to N do x , w ← Weighted-Sample ( bn , e ) W [ x ] ← W [ x ] + w where x is the value of X in x return Normalize ( W [ X ] ) function Weighted-Sample ( bn , e ) returns an event and a weight x ← an event with n elements from e ; w ← 1 for each variable X i in X 1 , · · · , X n do if X i is an evidence variable with value x i in e then w ← w × P ( X i = x i | Parents ( X i )) else x [ i ] ← a random sample from P ( X i | Parents ( X i )) return x , w AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 76

  60. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 77

  61. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 78

  62. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 79

  63. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 × 0 . 1 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 80

  64. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 × 0 . 1 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 81

  65. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 × 0 . 1 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 82

  66. Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 × 0 . 1 × 0 . 99 = 0 . 099 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 83

  67. Likelihood weighting contd. Sampling probability for WeightedSample is S WS ( z , e ) = Π l i = 1 P ( z i | parents ( Z i )) Note: pays attention to evidence in ancestors only Cloudy ⇒ somewhere “in between” prior and posterior distribution Rain Sprinkler Wet Weight for a given sample z , e is Grass w ( z , e ) = Π m i = 1 P ( e i | parents ( E i )) Weighted sampling probability is S WS ( z , e ) w ( z , e ) = Π l i = 1 P ( z i | parents ( Z i )) Π m i = 1 P ( e i | parents ( E i )) = P ( z , e ) (by standard global semantics of network) Hence likelihood weighting returns consistent estimates but performance still degrades with many evidence variables because a few samples have nearly all the total weight AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 84

  68. Inference by Markov chain Monte Carlo (MCMC) “State” of network = current assignment to all variables ⇒ the next state by making random changes to the current state Generate next state by sampling one variable given Markov blanket recall Markov blanket: parents, children, and children’s parents Sample each variable in turn, keeping evidence fixed Specific transition probability with which the stochastic process moves from one state to another defined by conditional distribution given Markov blanket of the variable being sampled AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 85

  69. MCMC Gibbs sampling function MCMC-Gibbs-Ask ( X , e , bn , N ) returns an estimate of P ( X | e ) local variables : N , a vector of counts for each value of X , initially zero Z , the nonevidence variables in bn x , the current state of the network, initially copied from e initialize x with random values for the variables in Z for j = 1 to N do /* Can choose at random */ for each Z i in Z do set the value of Z i in x by sampling from P ( Z i | mb ( Z i )) /*Markov blanket */ N [ x ] ← N [ x ] + 1 where x is the value of X in x return Normalize ( N ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 86

  70. The Markov chain With Sprinkler = true, WetGrass = true , there are four states Cloudy Cloudy Rain Rain Sprinkler Sprinkler Wet Wet Grass Grass Cloudy Cloudy Rain Rain Sprinkler Sprinkler Wet Wet Grass Grass Wander about for a while AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 87

  71. Example Estimate P ( Rain | Sprinkler = true, WetGrass = true ) Sample Cloudy or Rain given its Markov blanket, repeat. Count number of times Rain is true and false in the samples. E.g., visit 100 states 31 have Rain = true , 69 have Rain = false ˆ P ( Rain | Sprinkler = true, WetGrass = true ) = Normalize ( � 31 , 69 � ) = � 0 . 31 , 0 . 69 � Theorem: chain approaches stationary distribution: long-run fraction of time spent in each state is exactly proportional to its posterior probability AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 88

  72. Markov blanket sampling Markov blanket of Cloudy is Cloudy Sprinkler and Rain Markov blanket of Rain is Rain Sprinkler Cloudy , Sprinkler , and WetGrass Wet Grass Probability given the Markov blanket is calculated as follows P ( x ′ i | mb ( X i )) = P ( x ′ i | parents ( X i )) Π Z j ∈ Children ( X i ) P ( z j | parents ( Z j )) Easily implemented in message-passing parallel systems, brains Main computational problems: 1) Difficult to tell if convergence has been achieved 2) Can be wasteful if Markov blanket is large P ( X i | mb ( X i )) won’t change much (law of large numbers) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 89

  73. Approximate inference Exact inference by variable elimination: – polytime on polytrees, NP-hard on general graphs – space = time, very sensitive to topology Approximate inference by LW (Likelihood Weighting), MCMC (Markov chain Monte Carlo): – LW does poorly when there is lots of (downstream) evidence – LW, MCMC generally insensitive to topology – Convergence can be very slow with probabilities close to 1 or 0 – Can handle arbitrary combinations of discrete and continuous variables AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 90

  74. Dynamic Bayesian networks DBNs are Bayesian networks that represent temporal probability models Basic idea: copy state and evidence variables for each time step X t = set of unobservable state variables at time t e.g., BloodSugar t , StomachContents t , etc. E t = set of observable evidence variables at time t e.g., MeasuredBloodSugar t , PulseRate t , FoodEaten t This assumes discrete time ; step size depends on problem Notation: X a : b = X a , X a +1 , . . . , X b − 1 , X b X t , E t contain arbitrarily many variables in a replicated Bayes net AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 91

  75. Hidden Markov models (HMMs) Every HMM is a single-variable DBN every discrete DBN is an HMM (combine all the state variables in the DBN into a single one) X t+1 X t Y Y t t+1 Z Z t t+1 Sparse dependencies ⇒ exponentially fewer parameters; e.g., 20 state variables, three parents each DBN has 20 × 2 3 = 160 parameters, HMM has 2 20 × 2 20 ≈ 10 12 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 92

  76. Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov assumption: X t depends on bounded subset of X 0: t − 1 First-order Markov process: P ( X t | X 0: t − 1 ) = P ( X t | X t − 1 ) Second-order Markov process: P ( X t | X 0: t − 1 ) = P ( X t | X t − 2 , X t − 1 ) X t −2 X t −1 X t +1 X t +2 X t First−order X t −2 X t −1 X t X t +1 X t +2 Second−order Sensor Markov assumption: P ( E t | X 0: t , E 0: t − 1 ) = P ( E t | X t ) Stationary process: transition model P ( X t | X t − 1 ) and sensor model P ( E t | X t ) fixed for all t AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 93

  77. Example P ( R ) R t -1 t t 0.7 f 0.3 Rain t– 1 Rain t Rain t+ 1 P ( U ) R t t t 0.9 f 0.2 Umbrella t+ 1 Umbrella t– 1 Umbrella t First-order Markov assumption not exactly true in real world! Possible fixes: 1. Increase order of Markov process 2. Augment state , e.g., add Temp t , Pressure t AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 94

  78. HMMs X t is a single, discrete variable (usually E t is too) Domain of X t is { 1 , . . . , S }    0 . 7 0 . 3 Transition matrix T ij = P ( X t = j | X t − 1 = i ) , e.g.,     0 . 3 0 . 7  Sensor matrix O t for each time step, diagonal elements P ( e t | X t = i )    0 . 9 0 e.g., with U 1 = true , O 1 =     0 0 . 2  Forward and backward messages as column vectors: f 1: t +1 = α O t +1 T ⊤ f 1: t b k +1: t = TO k +1 b k +2: t Forward-backward algorithm needs time O ( S 2 t ) and space O ( St ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 95

  79. Inference tasks in HMMs Filtering: P ( X t | e 1: t ) belief state—input to the decision process of a rational agent Prediction: P ( X t + k | e 1: t ) for k > 0 evaluation of possible action sequences; like filtering without the evidence Smoothing: P ( X k | e 1: t ) for 0 ≤ k < t better estimate of past states, essential for learning Most likely explanation: arg max x 1: t P ( x 1: t | e 1: t ) speech recognition, decoding with a noisy channel AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 96

  80. Filtering Aim: devise a recursive state estimation algorithm P ( X t +1 | e 1: t +1 ) = f ( e t +1 , P ( X t | e 1: t )) P ( X t +1 | e 1: t +1 ) = P ( X t +1 | e 1: t , e t +1 ) = α P ( e t +1 | X t +1 , e 1: t ) P ( X t +1 | e 1: t ) = α P ( e t +1 | X t +1 ) P ( X t +1 | e 1: t ) I.e., prediction + estimation. Prediction by summing out X t : P ( X t +1 | e 1: t +1 ) = α P ( e t +1 | X t +1 ) Σ x t P ( X t +1 | x t , e 1: t ) P ( x t | e 1: t ) = α P ( e t +1 | X t +1 ) Σ x t P ( X t +1 | x t ) P ( x t | e 1: t ) f 1: t +1 = Forward ( f 1: t , e t +1 ) where f 1: t = P ( X t | e 1: t ) Time and space constant (independent of t ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 97

  81. Inference in DBNs Naive method: unroll the network and run any exact algorithm R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) P(R ) 1 P(R ) 1 1 1 1 1 1 1 0 0 t t t t t t t t 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 f f f f f f f f 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 Rain 0 Rain 1 Rain 0 Rain 1 Rain 2 Rain 3 Rain 4 Rain 5 Rain 6 Rain 7 R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) 1 1 1 1 1 1 1 1 t t t t t t t t 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 f f f f f f f f 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 Umbrella 1 Umbrella 1 Umbrella 2 Umbrella 3 Umbrella 4 Umbrella 5 Umbrella 6 Umbrella 7 Problem: inference cost for each update grows with t Rollup filtering: add slice t + 1 , “sum out” slice t using variable elimination Largest factor is O ( d n +1 ) , update cost O ( d n +2 ) (cf. HMM update cost O ( d 2 n ) ) Approximate inference by MCMC (Markov chain Monte Carlo) etc. AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 98

  82. Probabilistic logic Bayesian networks are essentially propositional: – the set of random variable is fixed and finite – each variable has a fixed domain of possible values Probabilistic reasoning can be formalized as probabilistic logic First-order probabilistic logic combines probability theory with the expressive power of first-order logic AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 99

  83. First-order probabilistic logic Recall: Propositional probabilistic logic – Proposition = disjunction of atomic events in which it is true – Possible world (sample point) ω = propositional logic model (an assignment of values to all of the r.v.s under consideration) – ω | = φ : for any proposition φ , the ω where it is true – probability model: a set Ω of possible worlds with a probability P ( ω ) for each world ω AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 100

Recommend


More recommend