bayesian networks lecture 18
play

Bayesian networks Lecture 18 David Sontag New York University - PowerPoint PPT Presentation

Bayesian networks Lecture 18 David Sontag New York University Outline for today Modeling sequen&al data (e.g., =me series, speech processing) using hidden Markov models (HMMs) Bayesian networks Independence proper=es


  1. Bayesian networks Lecture 18 David Sontag New York University

  2. Outline for today • Modeling sequen&al data (e.g., =me series, speech processing) using hidden Markov models (HMMs) • Bayesian networks – Independence proper=es – Examples – Learning and inference

  3. Example applica=on: Tracking Observe noisy measurements of missile loca=on: Y 1 , Y 2 , … Radar Where is the missile now ? Where will it be in 10 seconds?

  4. Probabilis=c approach • Our measurements of the missile loca=on were Y 1 , Y 2 , …, Y n • Let X t be the true <missile loca=on, velocity> at =me t • To keep this simple, suppose that everything is discrete, i.e. X t takes the values 1, …, k Grid the space:

  5. Probabilis=c approach • First, we specify the condi&onal distribu=on Pr(X t | X t-1 ): From basic physics, we can bound the distance that the missile can have traveled • Then, we specify Pr(Y t | X t =<(10,20), 200 mph toward the northeast>): With probability ½, Y t = X t (ignoring the velocity). Otherwise, Y t is a uniformly chosen grid loca=on

  6. Hidden Markov models 1960’s • Assume that the joint distribu=on on X 1, X 2 , …, X n and Y 1 , Y 2 , …, Y n factors as follows: n Y Pr( x 1 , . . . x n , y 1 , . . . , y n ) = Pr( x 1 ) Pr( y 1 | x 1 ) Pr( x t | x t − 1 ) Pr( y t | x t ) t =2 • To find out where the missile is now , we do marginal inference : Pr( x n | y 1 , . . . , y n ) • To find the most likely trajectory , we do MAP (maximum a posteriori) inference : arg max Pr( x 1 , . . . , x n | y 1 , . . . , y n ) x

  7. Inference • Recall, to find out where the missile is now, we do marginal inference: Pr( x n | y 1 , . . . , y n ) • How does one compute this? • Applying rule of condi=onal probability, we have: Pr( x n | y 1 , . . . , y n ) = Pr( x n , y 1 , . . . , y n ) Pr( x n , y 1 , . . . , y n ) = P k Pr( y 1 , . . . , y n ) x n =1 Pr(ˆ x n , y 1 , . . . , y n ) ˆ • Naively, would seem to require k n-1 summa=ons, Is there a more efficient X algorithm? Pr( x n , y 1 , . . . , y n ) = Pr( x 1 , . . . , x n , y 1 , . . . , y n ) x 1 ,...,x n − 1

  8. Marginal inference in HMMs • Use dynamic programming X Pr( A = a ) = Pr( B = b, A = a ) X Pr( x n , y 1 , . . . , y n ) = Pr( x n − 1 , x n , y 1 , . . . , y n ) b Pr( � a, � B = � b ) = Pr( � a ) Pr( � B = � b | � A = � A = � A = � a ) x n − 1 X = Pr( x n − 1 , y 1 , . . . , y n − 1 ) Pr( x n , y n | x n − 1 , y 1 , . . . , y n − 1 ) Condi=onal independence in HMMs x n − 1 X = Pr( x n − 1 , y 1 , . . . , y n − 1 ) Pr( x n , y n | x n − 1 ) x n − 1 Pr( A = a, B = b ) = Pr( A = a ) Pr( B = b | A = a ) X = Pr( x n − 1 , y 1 , . . . , y n − 1 ) Pr( x n | x n − 1 ) Pr( y n | x n , x n − 1 ) Condi=onal independence in HMMs x n − 1 X = Pr( x n − 1 , y 1 , . . . , y n − 1 ) Pr( x n | x n − 1 ) Pr( y n | x n ) x n − 1 • For n=1, ini=alize Pr( x 1 , y 1 ) = Pr( x 1 ) Pr( y 1 | x 1 ) Easy to do filtering • Total running =me is O(nk) – linear =me!

  9. MAP inference in HMMs • MAP inference in HMMs can also be solved in linear =me! arg max Pr( x 1 , . . . x n | y 1 , . . . , y n ) = arg max Pr( x 1 , . . . x n , y 1 , . . . , y n ) x x = arg max log Pr( x 1 , . . . x n , y 1 , . . . , y n ) x n h i h i X = arg max log Pr( x 1 ) Pr( y 1 | x 1 ) + log Pr( x i | x i − 1 ) Pr( y i | x i ) x i =2 • Formulate as a shortest paths problem Weight for edge (x i-1 , x i ) is - Weight for edge (s, x 1 ) is h i log Pr( x i | x i − 1 ) Pr( y i | x i ) Path from s to t gives - h i log Pr( x 1 ) Pr( y 1 | x 1 ) the MAP assignment … s t Weight for edge (x n , t) is 0 k nodes per variable X 1 X 2 X n-1 X n Called the Viterbi algorithm

  10. Applica=ons of HMMs • Speech recogni=on – Predict phonemes from the sounds forming words (i.e., the actual signals) • Natural language processing – Predict parts of speech (verb, noun, determiner, etc.) from the words in a sentence • Computa=onal biology – Predict intron/exon regions from DNA – Predict protein structure from DNA (locally) • And many many more!

  11. HMMs as a graphical model We can represent a hidden Markov model with a graph: • X 1 X 2 X 3 X 4 X 5 X 6 Shading in denotes observed variables (e.g. what is available at test =me) Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 n Y Pr( x 1 , . . . x n , y 1 , . . . , y n ) = Pr( x 1 ) Pr( y 1 | x 1 ) Pr( x t | x t − 1 ) Pr( y t | x t ) t =2 There is a 1-1 mapping between the graph structure and the factoriza=on • of the joint distribu=on

  12. Naïve Bayes as a graphical model We can represent a naïve Bayes model with a graph: • Label Y Shading in denotes observed variables (e.g. what is available at test =me) . . . X1 X2 X3 Xn Features n Y Pr( y, x 1 , . . . , x n ) = Pr( y ) Pr( x i | y ) i =1 There is a 1-1 mapping between the graph structure and the factoriza=on • of the joint distribu=on

  13. Bayesian networks • A Bayesian network is specified by a directed acyclic graph G=(V,E) with: – One node i for each random variable X i – One condi=onal probability distribu=on (CPD) per node, p ( x i | x Pa(i) ), specifying the variable’s probability condi=oned on its parents’ values • Corresponds 1-1 with a par=cular factoriza=on of the joint distribu=on: Y p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V • Powerful framework for designing algorithms to perform probability computa=ons

  14. 2011 Turing award was for Bayesian networks

  15. Example • Consider the following Bayesian network: d 0 d 1 i 0 i 1 0.6 0.4 0.7 0.3 Difficulty Intelligence Example from Koller & g 1 g 2 g 3 Friedman, P robabilis&c Grade SAT i 0 , d 0 0.3 0.4 0.3 Graphical Models, 2009 i 0 , d 1 0.05 0.25 0.7 s 0 s 1 i 0 , d 0 0.9 0.08 0.02 Letter i 0 , d 1 i 0 0.5 0.3 0.2 0.95 0.05 i 1 0.2 0.8 l 0 l 1 g 1 0.1 0.9 g 2 0.4 0.6 g 2 0.99 0.01 • What is its joint distribu=on? Y p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V p ( d , i , g , s , l ) = p ( d ) p ( i ) p ( g | i , d ) p ( s | i ) p ( l | g )

  16. Example • Consider the following Bayesian network: d 0 d 1 i 0 i 1 0.6 0.4 0.7 0.3 Difficulty Intelligence Example from Koller & g 1 g 2 g 3 Friedman, P robabilis&c Grade SAT i 0 , d 0 0.3 0.4 0.3 Graphical Models, 2009 i 0 , d 1 0.05 0.25 0.7 s 0 s 1 i 0 , d 0 0.9 0.08 0.02 Letter i 0 , d 1 i 0 0.5 0.3 0.2 0.95 0.05 i 1 0.2 0.8 l 0 l 1 g 1 0.1 0.9 g 2 0.4 0.6 g 2 0.99 0.01 • What is this model assuming? SAT 6? Grade SAT ⊥ Grade | Intelligence

  17. Example • Consider the following Bayesian network: d 0 d 1 i 0 i 1 0.6 0.4 0.7 0.3 Difficulty Intelligence Example from Koller & g 1 g 2 g 3 Friedman, P robabilis&c Grade SAT i 0 , d 0 0.3 0.4 0.3 Graphical Models, 2009 i 0 , d 1 0.05 0.25 0.7 s 0 s 1 i 0 , d 0 0.9 0.08 0.02 Letter i 0 , d 1 i 0 0.5 0.3 0.2 0.95 0.05 i 1 0.2 0.8 l 0 l 1 g 1 0.1 0.9 g 2 0.4 0.6 g 2 0.99 0.01 • Compared to a simple log-linear model to predict intelligence: – Captures non-linearity between grade, course difficulty, and intelligence – Modular . Training data can come from different sources! – Built in feature selec&on : lerer of recommenda=on is irrelevant given grade

  18. Bayesian networks enable use of domain knowledge Y p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V Will my car start this morning? Heckerman et al. , Decision-Theore=c Troubleshoo=ng, 1995

  19. Bayesian networks enable use of domain knowledge Y p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V What is the differen=al diagnosis? Beinlich et al. , The ALARM Monitoring System, 1989

  20. Bayesian networks are genera&ve models • Can sample from the joint distribu=on, top-down • Suppose Y can be “spam” or “not spam”, and X i is a binary indicator of whether word i is present in the e-mail • Let’s try genera=ng a few emails! Label Y . . . X1 X2 X3 Xn Features • Oven helps to think about Bayesian networks as a genera=ve model when construc=ng the structure and thinking about the model assump=ons

  21. Inference in Bayesian networks • Compu=ng marginal probabili=es in tree structured Bayesian networks is easy – The algorithm called “belief propaga=on” generalizes what we showed for hidden Markov models to arbitrary trees Label X 1 X 2 X 3 X 4 X 5 X 6 Y . . . X1 X2 X3 Xn Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Features • Wait… this isn’t a tree! What can we do?

  22. Inference in Bayesian networks • In some cases (such as this) we can transform this into what is called a “junc=on tree”, and then run belief propaga=on

  23. Approximate inference • There is also a wealth of approximate inference algorithms that can be applied to Bayesian networks such as these • Markov chain Monte Carlo algorithms repeatedly sample assignments for es=ma=ng marginals • Varia=onal inference algorithms (determinis=c) find a simpler distribu=on which is “close” to the original, then compute marginals using the simpler distribu=on

Recommend


More recommend