Probabilistic Graphical Models Lecture 12 Dynamical Models - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 12 – Dynamical Models CS/CNS/EE 155 Andreas Krause

Announcements Homework 3 out tonight Start early!! Project milestones due today Please email to TAs 2

Parameter learning for log-linear models Feature functions � � (C i ) defined over cliques Log linear model over undirected graph G Feature functions � 1 (C 1 ),…, � k (C k ) Domains C i can overlap Joint distribution How do we get weights w i ? 3

Log-linear conditional random field Define log-linear model over outputs Y No assumptions about inputs X Feature functions � � (C i ,x) defined over cliques and inputs Joint distribution 4

Example: CRFs in NLP Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Y 9 Y 10 Y 11 Y 12 X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 11 X 12 Mrs. Greene spoke today in New York. Green chairs the finance committee Classify into Person, Location or Other 5

Example: CRFs in vision 6

Gradient of conditional log-likelihood Partial derivative Requires one inference per Can optimize using conjugate gradient 7

Exponential Family Distributions Distributions for log-linear models More generally: Exponential family distributions h(x): Base measure w: natural parameters � (x): Sufficient statistics A(w): log-partition function Hereby x can be continuous (defined over any set) 8

Examples h(x): Base measure Exp. Family: w: natural parameters � (x): Sufficient statistics A(w): log-partition function Gaussian distribution Other examples: Multinomial, Poisson, Exponential, Gamma, Weibull, chi-square, Dirichlet, Geometric, … 9

Moments and gradients Correspondence between moments and log-partition function (just like in log-linear models) Can compute moments from derivatives, and derivatives from moments! MLE � moment matching 10

Conjugate priors in Exponential Family Any exponential family likelihood has a conjugate prior 11

Exponential family graphical models So far, only defined graphical models over discrete variables. Can define GMs over continuous distributions! For exponential family distributions: Can do much of what we discussed (VE, JT, parameter learning, etc.) for such exponential family models Important example: Gaussian Networks 12

Multivariate Gaussian distribution Joint distribution over n random variables P(X 1 ,…X n ) σ jk = E[ (X j – µ j ) (X k - µ k ) ] X j and X k independent � σ jk =0 13

Marginalization Suppose (X 1 ,…,X n ) ~ N( µ , Σ ) What is P(X 1 )?? More generally: Let A={i 1 ,…,i k } � {1,…,N} Write X A = (X i1 ,…,X ik ) X A ~ N( µ A , Σ AA ) 14

Conditioning Suppose (X 1 ,…,X n ) ~ N( µ , Σ ) Decompose as (X A ,X B ) What is P(X A | X B )?? P(X A = x A | X B = x B ) = N(x A ; µ A|B , Σ A|B ) where Computable using linear algebra! 15

Conditional linear Gaussians 16

Canonical Representation Multivariate Gaussians in exponential family! Standard vs canonical form: � = � -1 � � = � -1 17

Gaussian Networks Zeros in precision matrix � indicate missing edges in log-linear model! 18

Inference in Gaussian Networks Can compute marginal distributions in O(n 3 )! For large numbers n of variables, still intractable If Gaussian Network has low treewidth, can use variable elimination / JT inference! Need to be able to multiply and marginalize factors! 19

Multiplying factors in Gaussians 20

Conditioning in canonical form Joint distribution (X A , X B ) ~ N( � �� , � �� ) Conditioning: P(X A | X B = x B ) = N(x A ; � � | � = �� , � A|B=xB ) 21

Marginalizing in canonical form Recall conversion formulas � = � -1 � � = � -1 Marginal distribution 22

Standard vs. canonical form Standard form Canonical form Marginalization Conditioning � In standard form, marginalization is easy � In canonical form, conditioning is easy! 23

Variable elimination In Gaussian Markov Networks, Variable elimination = Gaussian elimination (fast for low bandwidth = low treewidth matrices) 24

Dynamical models 25

HMMs / Kalman Filters Most famous Graphical models: Naïve Bayes model Hidden Markov model Kalman Filter Hidden Markov models Speech recognition Sequence analysis in comp. bio Kalman Filters control Cruise control in cars GPS navigation devices Tracking missiles.. Very simple models but very powerful!! 26

HMMs / Kalman Filters X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 X 1 ,…,X T : Unobserved (hidden) variables Y 1 ,…,Y T : Observations HMMs: X i Multinomial, Y i arbitrary Kalman Filters: X i , Y i Gaussian distributions Non-linear KF: X i Gaussian, Y i arbitrary 27

HMMs for speech recognition Words X 1 X 2 X 3 X 4 X 5 X 6 Phoneme Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 “He ate the cookies on the couch” Infer spoken words from audio signals 28 28

Hidden Markov Models Inference: X 1 X 2 X 3 X 4 X 5 X 6 In principle, can use VE, JT etc. New variables X t , Y t at Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 each time step � need to rerun Bayesian Filtering: Suppose we already have computed P(X t | y 1,…,t ) Want to efficiently compute P(X t+1 | y 1,…,t+1 ) 29

Bayesian filtering Start with P(X 1 ) X 1 X 2 X 3 X 4 X 5 X 6 At time t Assume we have P(X t | y 1…t-1 ) Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Condition: P(X t | y 1…t ) Prediction: P(X t+1 , X t | y 1…t ) Marginalization: P(X t+1 | y 1…t ) 30

Parameter learning in HMMs Assume we have labels for hidden variables Assume stationarity P(X t+1 | X t ) is same over all time steps P(Y t | X t ) is same over all time steps Violates parameter independence ( � parameter “sharing”) Example: compute parameters for P(X t+1 =x | X t =x’) What if we don’t have labels for hidden vars? � Use EM (later this course) 31

Kalman Filters (Gaussian HMMs) X 1 ,…,X T : Location of object being tracked Y 1 ,…,Y T : Observations P(X 1 ): Prior belief about location at time 1 P(X t+1 |X t ): “Motion model” How do I expect my target to move in the environment? Represented as CLG: X t+1 = A X t + N(0, � � ) P(Y t | X t ): “Sensor model” What do I observe if target is at location X t ? Represented as CLG: Y t = H X t + N(0, � � ) X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 32

Understanding Motion model 33

Understanding sensor model 34

Bayesian Filtering for KFs Can use Gaussian elimination to perform inference in “unrolled” model X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Start with prior belief P(X 1 ) At every timestep have belief P(X t | y 1:t-1 ) Condition on observation: P(X t | y 1:t ) Predict (multiply motion model): P(X t+1 ,X t | y 1:t ) “Roll-up” (marginalize prev. time): P(X t+1 | y 1:t ) 35

Implementation Current belief: P(x t | y 1:t-1 ) = N(x t ; � Xt , � Xt ) Multiply sensor and motion model Marginalize 36

What if observations not “linear”? Linear observations: Y t = H X t + noise Nonlinear observations: 37

Incorporating Non-gaussian observations Nonlinear observation � P(Y t | X t ) not Gaussian First approach: Approximate P(Y t | X t ) as CLG Linearize P(Y t | X t ) around current estimate E[X t | y 1..t-1 ] Known as Extended Kalman Filter (EKF) Can perform poorly if P(Y t | X t ) highly nonlinear Second approach: Approximate P(Y t , X t ) as Gaussian Takes correlation in X t into account After obtaining approximation, condition on Y t =y t (now a “linear” observation) 38

Finding Gaussian approximations Need to find Gaussian approximation of P(X t ,Y t ) How? Gaussians in Exponential Family � Moment matching!! E[Y t ] = E[Y t2 ] = E[X t Y t ] = 39

Linearization by integration Need to integrate product of Gaussian with arbitrary function Can do that by numerical integration Approximate integral as weighted sum of evaluation points Gaussian quadrature defines locations and weights of points For 1 dim: Exact for polynomials of degree D if choosing 2D points using Gaussian quadrature For higher dimensions: Need exponentially many points to achieve exact evaluation for polynomials Application of this is known as “Unscented” Kalman Filter (UKF) 40

Factored dynamical models So far: HMMs and Kalman filters X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 What if we have more than one variable at each time step? E.g., temperature at different locations, or road conditions in a road network? � Spatio-temporal models 41

Dynamic Bayesian Networks At every timestep have a Bayesian Network A 1 A 2 A 3 D 1 D 2 D 3 B 1 B 2 B 3 E 1 E 2 E 3 C 1 C 2 C 3 Variables at each time step t called a “slice” S t “Temporal” edges connecting S t+1 with S t 42

Tasks Read Koller & Friedman Chapters 6.2.3, 15.1 43

Probabilistic Graphical Models Lecture 12 Dynamical Models - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 12 Dynamical Models CS/CNS/EE 155 Andreas Krause Announcements Homework 3 out tonight Start early!! Project milestones due today Please email to TAs 2 Parameter learning for log-linear models

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Logistic Regression, Generative and Discriminative Classifiers Recommended reading: Ng and

CSC421 Lecture 2: Linear Models Roger Grosse and Jimmy Ba Roger Grosse and Jimmy Ba CSC421

RFID based People-Object Direction of Pass Detection Ral Parada a , Joan Meli-Segu b , c and

Machine Learning for Computer Vision a whirlwind tour of key concepts for the uninitiated Toby

Summary Linearly separable classification problems. Logistic loss log and (empirical)

Combining probabilities with log-linear pooling : application to spatial data Denis Allard 1 ,

Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning: