Probabilistic Graphical Models Lecture 12 – Dynamical Models CS/CNS/EE 155 Andreas Krause
Announcements Homework 3 out tonight Start early!! Project milestones due today Please email to TAs 2
Parameter learning for log-linear models Feature functions � � (C i ) defined over cliques Log linear model over undirected graph G Feature functions � 1 (C 1 ),…, � k (C k ) Domains C i can overlap Joint distribution How do we get weights w i ? 3
Log-linear conditional random field Define log-linear model over outputs Y No assumptions about inputs X Feature functions � � (C i ,x) defined over cliques and inputs Joint distribution 4
Example: CRFs in NLP Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Y 9 Y 10 Y 11 Y 12 X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 11 X 12 Mrs. Greene spoke today in New York. Green chairs the finance committee Classify into Person, Location or Other 5
Example: CRFs in vision 6
Gradient of conditional log-likelihood Partial derivative Requires one inference per Can optimize using conjugate gradient 7
Exponential Family Distributions Distributions for log-linear models More generally: Exponential family distributions h(x): Base measure w: natural parameters � (x): Sufficient statistics A(w): log-partition function Hereby x can be continuous (defined over any set) 8
Examples h(x): Base measure Exp. Family: w: natural parameters � (x): Sufficient statistics A(w): log-partition function Gaussian distribution Other examples: Multinomial, Poisson, Exponential, Gamma, Weibull, chi-square, Dirichlet, Geometric, … 9
Moments and gradients Correspondence between moments and log-partition function (just like in log-linear models) Can compute moments from derivatives, and derivatives from moments! MLE � moment matching 10
Conjugate priors in Exponential Family Any exponential family likelihood has a conjugate prior 11
Exponential family graphical models So far, only defined graphical models over discrete variables. Can define GMs over continuous distributions! For exponential family distributions: Can do much of what we discussed (VE, JT, parameter learning, etc.) for such exponential family models Important example: Gaussian Networks 12
Multivariate Gaussian distribution Joint distribution over n random variables P(X 1 ,…X n ) σ jk = E[ (X j – µ j ) (X k - µ k ) ] X j and X k independent � σ jk =0 13
Marginalization Suppose (X 1 ,…,X n ) ~ N( µ , Σ ) What is P(X 1 )?? More generally: Let A={i 1 ,…,i k } � {1,…,N} Write X A = (X i1 ,…,X ik ) X A ~ N( µ A , Σ AA ) 14
Conditioning Suppose (X 1 ,…,X n ) ~ N( µ , Σ ) Decompose as (X A ,X B ) What is P(X A | X B )?? P(X A = x A | X B = x B ) = N(x A ; µ A|B , Σ A|B ) where Computable using linear algebra! 15
Conditional linear Gaussians 16
Canonical Representation Multivariate Gaussians in exponential family! Standard vs canonical form: � = � -1 � � = � -1 17
Gaussian Networks Zeros in precision matrix � indicate missing edges in log-linear model! 18
Inference in Gaussian Networks Can compute marginal distributions in O(n 3 )! For large numbers n of variables, still intractable If Gaussian Network has low treewidth, can use variable elimination / JT inference! Need to be able to multiply and marginalize factors! 19
Multiplying factors in Gaussians 20
Conditioning in canonical form Joint distribution (X A , X B ) ~ N( � �� , � �� ) Conditioning: P(X A | X B = x B ) = N(x A ; � � | � = �� , � A|B=xB ) 21
Marginalizing in canonical form Recall conversion formulas � = � -1 � � = � -1 Marginal distribution 22
Standard vs. canonical form Standard form Canonical form Marginalization Conditioning � In standard form, marginalization is easy � In canonical form, conditioning is easy! 23
Variable elimination In Gaussian Markov Networks, Variable elimination = Gaussian elimination (fast for low bandwidth = low treewidth matrices) 24
Dynamical models 25
HMMs / Kalman Filters Most famous Graphical models: Naïve Bayes model Hidden Markov model Kalman Filter Hidden Markov models Speech recognition Sequence analysis in comp. bio Kalman Filters control Cruise control in cars GPS navigation devices Tracking missiles.. Very simple models but very powerful!! 26
HMMs / Kalman Filters X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 X 1 ,…,X T : Unobserved (hidden) variables Y 1 ,…,Y T : Observations HMMs: X i Multinomial, Y i arbitrary Kalman Filters: X i , Y i Gaussian distributions Non-linear KF: X i Gaussian, Y i arbitrary 27
HMMs for speech recognition Words X 1 X 2 X 3 X 4 X 5 X 6 Phoneme Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 “He ate the cookies on the couch” Infer spoken words from audio signals 28 28
Hidden Markov Models Inference: X 1 X 2 X 3 X 4 X 5 X 6 In principle, can use VE, JT etc. New variables X t , Y t at Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 each time step � need to rerun Bayesian Filtering: Suppose we already have computed P(X t | y 1,…,t ) Want to efficiently compute P(X t+1 | y 1,…,t+1 ) 29
Bayesian filtering Start with P(X 1 ) X 1 X 2 X 3 X 4 X 5 X 6 At time t Assume we have P(X t | y 1…t-1 ) Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Condition: P(X t | y 1…t ) Prediction: P(X t+1 , X t | y 1…t ) Marginalization: P(X t+1 | y 1…t ) 30
Parameter learning in HMMs Assume we have labels for hidden variables Assume stationarity P(X t+1 | X t ) is same over all time steps P(Y t | X t ) is same over all time steps Violates parameter independence ( � parameter “sharing”) Example: compute parameters for P(X t+1 =x | X t =x’) What if we don’t have labels for hidden vars? � Use EM (later this course) 31
Kalman Filters (Gaussian HMMs) X 1 ,…,X T : Location of object being tracked Y 1 ,…,Y T : Observations P(X 1 ): Prior belief about location at time 1 P(X t+1 |X t ): “Motion model” How do I expect my target to move in the environment? Represented as CLG: X t+1 = A X t + N(0, � � ) P(Y t | X t ): “Sensor model” What do I observe if target is at location X t ? Represented as CLG: Y t = H X t + N(0, � � ) X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 32
Understanding Motion model 33
Understanding sensor model 34
Bayesian Filtering for KFs Can use Gaussian elimination to perform inference in “unrolled” model X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Start with prior belief P(X 1 ) At every timestep have belief P(X t | y 1:t-1 ) Condition on observation: P(X t | y 1:t ) Predict (multiply motion model): P(X t+1 ,X t | y 1:t ) “Roll-up” (marginalize prev. time): P(X t+1 | y 1:t ) 35
Implementation Current belief: P(x t | y 1:t-1 ) = N(x t ; � Xt , � Xt ) Multiply sensor and motion model Marginalize 36
What if observations not “linear”? Linear observations: Y t = H X t + noise Nonlinear observations: 37
Incorporating Non-gaussian observations Nonlinear observation � P(Y t | X t ) not Gaussian First approach: Approximate P(Y t | X t ) as CLG Linearize P(Y t | X t ) around current estimate E[X t | y 1..t-1 ] Known as Extended Kalman Filter (EKF) Can perform poorly if P(Y t | X t ) highly nonlinear Second approach: Approximate P(Y t , X t ) as Gaussian Takes correlation in X t into account After obtaining approximation, condition on Y t =y t (now a “linear” observation) 38
Finding Gaussian approximations Need to find Gaussian approximation of P(X t ,Y t ) How? Gaussians in Exponential Family � Moment matching!! E[Y t ] = E[Y t2 ] = E[X t Y t ] = 39
Linearization by integration Need to integrate product of Gaussian with arbitrary function Can do that by numerical integration Approximate integral as weighted sum of evaluation points Gaussian quadrature defines locations and weights of points For 1 dim: Exact for polynomials of degree D if choosing 2D points using Gaussian quadrature For higher dimensions: Need exponentially many points to achieve exact evaluation for polynomials Application of this is known as “Unscented” Kalman Filter (UKF) 40
Factored dynamical models So far: HMMs and Kalman filters X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 What if we have more than one variable at each time step? E.g., temperature at different locations, or road conditions in a road network? � Spatio-temporal models 41
Dynamic Bayesian Networks At every timestep have a Bayesian Network A 1 A 2 A 3 D 1 D 2 D 3 B 1 B 2 B 3 E 1 E 2 E 3 C 1 C 2 C 3 Variables at each time step t called a “slice” S t “Temporal” edges connecting S t+1 with S t 42
Tasks Read Koller & Friedman Chapters 6.2.3, 15.1 43
Recommend
More recommend