Probabilistic Graphical Models 10-708 Learning Partially Observed - PDF document

Probabilistic Graphical Models 10-708 Learning Partially Observed Learning Partially Observed Graphical Models Graphical Models Eric Xing Eric Xing Lecture 13, Oct 26, 2005 Reading: MJ-Chap. 5,10,11 Partially observed GMs � Speech recognition Y 1 Y 2 Y 3 Y T ... X 1 A X 2 A X 3 A X T A ... 1

Partially observed GM � Biological Evolution ancestor ? T years Q m Q h G A G A A C C A Unobserved Variables � A variable can be unobserved (latent) because: it is an imaginary quantity meant to provide some simplified and � abstractive view of the data generation process e.g., speech recognition models, mixture models … � it is a real-world object and/or phenomena, but difficult or impossible to � measure e.g., the temperature of a star, causes of a disease, evolutionary � ancestors … it is a real-world object and/or phenomena, but sometimes wasn’t � measured, because of faulty sensors, etc. � Discrete latent variables can be used to partition/cluster data into sub-groups. � Continuous latent variables (factors) can be used for dimensionality reduction (factor analysis, etc). 2

Mixture models � A density model p ( x ) may be multi-modal. � We may be able to model it as a mixture of uni-modal distributions (e.g., Gaussians). � Each mode may correspond to a different sub-population (e.g., male and female). Gaussian Mixture Models (GMMs) � Consider a mixture of K Gaussian components: ( ) k ∏ z p z = z π = π Z is a latent class indicator vector: n ( ) multi ( : ) � n n k k X is a conditional Gaussian variable with a class-specific mean/covariance � 1 { } p x z k 1 x T 1 x = µ Σ = 1 µ Σ − µ ( | , , ) exp - ( - ) ( - ) n n n k k n k 1 2 2 2 m 2 π Σ / / ( ) k The likelihood of a sample: � ∑ mixture proportion mixture component p x µ Σ = p z k = 1 π p x z k = 1 µ Σ ( , ) ( | ) ( , | , , ) n k ( ) ∑ ∑ ∏ ( ) k z N x z k N x = π µ Σ = π µ Σ n ( : , ) n ( , | , ) k n k k k k k z k k n Z X This model can be used for unsupervised clustering. � This model (fit by AutoClass) has been used to discover new kinds of stars in � astronomical data, etc. 3

Conditional mixture model: Mixture of experts We will model p ( Y | X ) using different experts, each responsible for � different regions of the input space. Latent variable Z chooses expert using softmax gating function: � ( ) P z k = 1 x T x = ξ ( ) Softmax ( ) P y x z k 1 y T x 2 Each expert can be a linear regression model: = = N θ σ ( , ) ; , � k k The posterior expert responsibilities are � p z k = 1 x p y x θ σ 2 ( ) ( , , ) P z k 1 x y k k k = θ = ( , , ) ∑ j p z = 1 x p y x θ σ 2 ( ) ( , , ) j j j j Hierarchical mixture of experts This is like a soft version of a depth-2 classification/regression tree. � P ( Y | X , G 1 , G 2 ) can be modeled as a GLIM, with parameters � dependent on the values of G 1 and G 2 (which specify a "conditional path" to a given leaf in the tree). 4

Mixture of overlapping experts � By removing the X � Z arc, we can make the partitions independent of the input, thus allowing overlap. � This is a mixture of linear regressors; each subpopulation has a different conditional mean. p z k = 1 p y x θ σ 2 ( ) ( , , ) P z k 1 x y k k k = θ = ( , , ) ∑ j p z = 1 p y x θ σ 2 ( ) ( , , ) j j j j Why is Learning Harder? � In fully observed iid settings, the log likelihood decomposes into a sum of local terms (at least for directed models). D p x z p z p x z θ = θ = θ + θ l ( ; ) log ( , | ) log ( | ) log ( | , ) c z x � With latent variables, all the parameters become coupled together via marginalization ∑ ∑ D p x z p z p x z θ = θ = θ θ l ( ; ) log ( , | ) log ( | ) ( | , ) c z x z z 5

Gradient Learning for mixture models � We can learn mixture densities using gradient descent on the log likelihood. The gradients are quite interesting: ∑ p p l θ = θ = π θ ( ) log ( x | ) log ( x ) k k k k p 1 ∂ θ ∂ l ( x ) ∑ k k = π k ∂ θ p θ ∂ θ ( x | ) k p ∂ θ π log ( x ) ∑ p k k = k θ ( x ) p k k θ ∂ θ ( x | ) k p p θ ∂ θ ∂ l ( x ) log ( x ) ∑ ∑ k k k k r k = π = k k p θ ∂ θ ∂ θ ( x | ) k k k k � In other words, the gradient is the responsibility weighted sum of the individual log likelihood gradients. � Can pass this to a conjugate gradient routine. Parameter Constraints � Often we have constraints on the parameters, e.g. Σ k π k = 1, Σ being symmetric positive definite (hence Σ ii > 0). � We can use constrained optimization, or we can reparameterize in terms of unconstrained values. γ π = exp( ) k For normalized weights, use the softmax transform: � k Σ γ exp( ) j j For covariance matrices, use the Cholesky decomposition: � Σ − 1 T = A A where A is upper diagonal with positive diagonal: ( ) 0 j i 0 j i = λ > = η > = < A A A exp ( ) ( ) ii i ij ij ij the parameters γ i , λ i , η ij ∈ R are unconstrained. ∂ l ∂ l , A Use chain rule to compute . � ∂ π ∂ 6

Identifiability � A mixture model induces a multi-modal likelihood. � Hence gradient ascent can only find a local maximum. � Mixture models are unidentifiable, since we can always switch the hidden labels without affecting the likelihood. � Hence we should be careful in trying to interpret the “meaning” of latent variables. Expectation-Maximization (EM) Algorithm � EM is an optimization strategy for objective functions that can be interpreted as likelihoods in the presence of missing data. � It is much simpler than gradient methods: No need to choose step size. � Enforces constraints automatically. � Calls inference and fully observed learning as subroutines. � � EM is an Iterative algorithm with two linked steps: E-step: fill-in hidden values using inference, p ( z | x , θ t ). � M-step: update parameters t+1 using standard MLE/MAP method � applied to completed data � We will prove that this procedure monotonically improves (or leaves it unchanged). Thus it always converges to a local optimum of the likelihood. 7

Complete & Incomplete Log Likelihoods � Complete log likelihood Let X denote the observable variable(s), and Z denote the latent variable(s). If Z could be observed, then def x z p x z l θ = θ ( ; , ) log ( , | ) c Usually, optimizing l c () given both z and x is straightforward (c.f. MLE for fully � observed models). Recalled that in this case the objective for, e.g., MLE, decomposes into a sum of � factors, the parameter for each factor can be estimated separately. But given that Z is not observed, l c () is a random quantity, cannot be � maximized directly . � Incomplete log likelihood With z unobserved, our objective becomes the log of a marginal probability: ∑ x p x p x z θ = θ = θ l ( ; ) log ( | ) log ( , | ) c z This objective won't decouple � Expected Complete Log Likelihood � For any distribution q ( z ), define expected complete log likelihood : def ∑ x z q z x p x z l θ = θ θ ( ; , ) ( | , ) log ( , | ) c q z A deterministic function of θ � Linear in l c () --- inherit its factorizabiility � Does maximizing this surrogate yield a maximizer of the likelihood? � � Jensen’s inequality x p x θ = θ l ( ; ) log ( | ) ∑ p x z = θ log ( , | ) z p x z θ ( , | ) ∑ q z x = log ( | ) q z x ( | ) z p x z θ ( , | ) ∑ q z x x x z H = ⇒ θ ≥ θ + l l ( | ) log ( ; ) ( ; , ) q z x c q q ( | ) z 8

Lower Bounds and Free Energy � For fixed data x, define a functional called the free energy: p x z θ = ∑ def ( , | ) F q q z x x θ ≤ l θ ( , ) ( | ) log ( ; ) q z x ( | ) z � The EM algorithm is coordinate-ascent on F : q F q t + 1 = θ t arg max ( , ) E-step: � q + = t 1 F q t 1 t θ + θ M-step: � arg max ( , ) θ E-step: maximization of expected l c w.r.t. q � Claim: q t + 1 F q t p z x t = θ = θ arg max ( , ) ( | , ) q This is the posterior distribution over the latent variables given the data � and the parameters. Often we need this at test time anyway (e.g. to perform classification). � Proof (easy): this setting attains the bound l ( θ ; x ) ≥ F ( q , θ ) p x z t θ ( , | ) ∑ F p z x t t p z x t θ θ = θ ( ( , ), ) ( , ) log p z x t θ ( , ) z ∑ q z x p x = θ t ( | ) log ( | ) z p x t t x = θ = l θ log ( | ) ( ; ) � Can also show this result using variational calculus or the fact ( ) that x F q q p z x θ − θ = θ l ( ; ) ( , ) KL || ( | , ) 9

Probabilistic Graphical Models 10-708 Learning Partially Observed - PDF document

Probabilistic Graphical Models 10-708 Learning Partially Observed Learning Partially Observed Graphical Models Graphical Models Eric Xing Eric Xing Lecture 13, Oct 26, 2005 Reading: MJ-Chap. 5,10,11 Partially observed GMs Speech

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Stacks and Queues 25 Stack In/Out LIFO: Last-in First-out Push Pop Undo/Redo Back/Forward

ss 3 Cl Class CSC 495/583 Topics of Software Security X86 Assembly & Stack & Stack

Supporting Functions (procedures) What is needed? Functions: Analogy of a spy secret

C Programming for Engineers Data Structure ICEN 360 Spring 2017 Prof. Dola Saha 1 Data

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh

Deep Hybrid Models: Bridging Discriminative and Generative Approaches Volodymyr Kuleshov and

Lecture 22 & 23: Variational Autoencoders April 2020 Lecturer: Steven Wu Scribe: Steven Wu

Probabilistic Graphical Models 10-708 Learning Partially Observed - PDF document

Probabilistic Graphical Models 10-708 Learning Partially Observed Learning Partially Observed Graphical Models Graphical Models Eric Xing Eric Xing Lecture 13, Oct 26, 2005 Reading: MJ-Chap. 5,10,11 Partially observed GMs Speech

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Stacks and Queues 25 Stack In/Out LIFO: Last-in First-out Push Pop Undo/Redo Back/Forward

ss 3 Cl Class CSC 495/583 Topics of Software Security X86 Assembly &amp; Stack &amp; Stack

Supporting Functions (procedures) What is needed? Functions: Analogy of a spy secret

C Programming for Engineers Data Structure ICEN 360 Spring 2017 Prof. Dola Saha 1 Data

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh

Deep Hybrid Models: Bridging Discriminative and Generative Approaches Volodymyr Kuleshov and

Lecture 22 &amp; 23: Variational Autoencoders April 2020 Lecturer: Steven Wu Scribe: Steven Wu

ss 3 Cl Class CSC 495/583 Topics of Software Security X86 Assembly & Stack & Stack

Lecture 22 & 23: Variational Autoencoders April 2020 Lecturer: Steven Wu Scribe: Steven Wu