Composing graphical models with neural networks for structured representations and fast inference Matt Johnson, David Duvenaud, Alex Wiltschko, Bob Datta, Ryan Adams
6 0 6 0 6 0 5 0 5 0 5 0 4 0 4 0 4 0 m m m m 3 0 m 3 0 m 3 0 2 0 2 0 2 0 1 0 1 0 1 0 0 10 0 0 m m m 10 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 mm 10 m mm 2 0 10 20 30 40 50 60 70 20 10 20 30 40 50 60 70 20 30 3 0 3 0 40 40 40 m m m m 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 10 20 30 40 50 60 70 10 20 30 40 50 60 70 dart pause rear [1,2] /b/ /ax/ /n/ /ae/ /n/ /ax/ 60 60 50 50 40 40 mm mm 30 30 20 20 10 10 0 0 10 mm 10 mm 10 20 30 40 50 60 70 20 10 20 30 40 50 60 70 20 30 30 40 40 mm mm 10 20 30 40 50 60 70 10 20 30 40 50 60 70 60 50 40 mm 30 20 10 0 10 mm mm 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 20 30 40 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 [1] Lee and Glass. A Nonparametric Bayesian Approach to Acoustic Model Discovery. ACL 2012. [2] Lee. Discovering Linguistic Structures in Speech: Models and Applications. MIT Ph.D. Thesis 2014.
Alexander Wiltschko, Matthew Johnson , et al., Neuron 2015.
image manifold
depth video image manifold
depth video image manifold
depth video image manifold
depth video image manifold
depth video image manifold
depth video image manifold
depth video image manifold manifold coordinates dart rear
Recurrent neural networks? [1,2,3] v 3 ˆ v 2 ˆ v 1 ˆ Learned Representation W 1 W 1 copy W 2 W 2 v 1 v 2 v 3 v 3 v 3 v 2 v 2 Figure 2. LSTM Autoencoder Model Figure 1. LSTM unit Probabilistic graphical models? [4,5,6] [1] Srivastava, Mansimov, Salakhutdinov. Unsupervised learning of video representations using LSTMs. ICML 2015. [2] Ranzato, MarcAurelio, et al. Video (language) modeling: a baseline for generative models of natural videos. Preprint 2015. [3] Sutskever, Hinton, and Taylor. The Recurrent Temporal Restricted Boltzmann Machine. NIPS 2008. [4] Fox, Sudderth, Jordan, Willsky. Bayesian nonparametric inference of switching dynamic linear models. IEEE TSP 2011. [5] Johnson and Willsky. Bayesian nonparametric hidden semi-Markov models. JMLR 2013. [6] Murphy. Machine learning: a probabilistic perspective. MIT Press 2012.
unsupervised learning supervised learning
Probabilistic graphical models Deep learning + structured representations – neural net “goo” + priors and uncertainty – difficult parameterization + data and computational efficiency – can require lots of data – rigid assumptions may not fit + flexible – feature engineering + feature learning – top-down inference + recognition networks
Modeling idea: graphical models on latent variables, neural network models for observations Inference: recognition networks output conjugate potentials, then apply fast graphical model inference Application: learn syllable representation of behavior from video 60 60 60 50 50 50 40 40 40 mm mm mm 30 30 30 20 20 20 10 10 10 0 0 10 0 mm mm 10 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 mm 10 2 mm 0 10 20 30 40 50 60 70 2 10 20 30 40 50 60 70 0 2 0 30 30 30 40 40 40 m m m m 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 10 20 30 40 50 60 70 10 20 30 40 50 60 70
Modeling idea: graphical models on latent variables, neural network models for observations
π (1) z t +1 ∼ π ( z t ) π (2) π = π (3) z 2 z 3 z 4 z 5 z 6 z 7 z 1 A (1) A (2) A (3) iid x t +1 = A ( z t ) x t + B ( z t ) u t ∼ N (0 , I ) u t B (1) B (2) B (3) 60 60 60 50 50 50 40 40 40 mm mm mm 30 30 30 20 20 20 10 10 10 0 10 0 0 mm mm 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 10 mm 10 mm 20 10 20 30 40 50 60 70 10 20 30 40 50 60 70 20 20 30 30 30 4 0 4 0 4 0 mm mm 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 10 20 30 40 50 60 70 10 20 30 40 50 60 70
π (1) π (2) π = π (3) z 1 z 2 z 3 z 4 z 5 z 6 z 7 x 1 x 3 x 2 x 4 x 5 x 6 x 7 A (1) A (2) A (3) B (1) B (2) B (3) 60 60 60 50 50 50 40 40 40 mm mm mm 30 30 30 20 20 20 10 10 10 0 10 0 0 mm mm 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 10 mm 10 mm 20 10 20 30 40 50 60 70 10 20 30 40 50 60 70 20 20 30 30 30 4 0 4 0 4 0 mm mm 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 10 20 30 40 50 60 70 10 20 30 40 50 60 70
θ z 1 z 2 z 3 z 4 z 5 z 6 z 7 x 1 x 3 x 2 x 4 x 5 x 6 x 7 60 60 60 50 50 50 40 40 40 mm mm mm 30 30 30 20 20 20 10 10 10 0 10 0 0 mm mm 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 10 mm 10 mm 20 10 20 30 40 50 60 70 10 20 30 40 50 60 70 20 20 30 30 30 4 0 4 0 4 0 mm mm 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 10 20 30 40 50 60 70 10 20 30 40 50 60 70
θ z 1 z 2 z 3 z 4 z 5 z 6 z 7 x 1 x 3 x 2 x 4 x 5 x 6 x 7 y 1 y 2 y 3 y 4 y 5 y 6 y 7 60 60 60 50 50 50 40 40 40 mm mm mm 30 30 30 20 20 20 10 10 10 0 10 0 0 mm mm 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 10 mm 10 mm 20 10 20 30 40 50 60 70 10 20 30 40 50 60 70 20 20 30 30 30 4 0 4 0 4 0 mm mm 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 10 20 30 40 50 60 70 10 20 30 40 50 60 70
θ z 1 z 2 z 3 z 4 z 5 z 6 z 7 x 1 x 3 x 2 x 4 x 5 x 6 x 7 γ y 1 y 2 y 3 y 4 y 5 y 6 y 7 x t y t | x t , γ ∼ N ( µ ( x t ; γ ) , Σ ( x t ; γ )) µ ( x t ; γ ) diag( Σ ( x t ; γ ))
θ θ θ z n z 1 z 2 z 3 z 4 x n γ x n x 1 x 2 x 3 x 4 γ γ y n y n y 1 y 2 y 3 y 4 p ( θ ) conjugate prior on global variables θ p ( x | θ ) exponential family on local variables x n γ p ( γ ) any prior on observation parameters y n p ( y | x, γ ) neural network observation model
[1] [2] [3] [4] Gaussian mixture model Linear dynamical system Hidden Markov model Switching LDS [5] [2] [6] [7] Mixture of Experts Driven LDS IO-HMM Factorial HMM [8,9] [10] Canonical correlations analysis admixture / LDA / NMF [1] Palmer, Wipf, Kreutz-Delgado, and Rao. Variational EM algorithms for non-Gaussian latent variable models. NIPS 2005. [2] Ghahramani and Beal. Propagation algorithms for variational Bayesian learning. NIPS 2001. [3] Beal. Variational algorithms for approximate Bayesian inference, Ch. 3. U of London Ph.D. Thesis 2003. [4] Ghahramani and Hinton. Variational learning for switching state-space models. Neural Computation 2000. [5] Jordan and Jacobs. Hierarchical Mixtures of Experts and the EM algorithm. Neural Computation 1994. [6] Bengio and Frasconi. An Input Output HMM Architecture. NIPS 1995. [7] Ghahramani and Jordan. Factorial Hidden Markov Models. Machine Learning 1997. [8] Bach and Jordan. A probabilistic interpretation of Canonical Correlation Analysis. Tech. Report 2005. [9] Archambeau and Bach. Sparse probabilistic projections. NIPS 2008. [10] Hoffman, Bach, Blei. Online learning for Latent Dirichlet Allocation. NIPS 2010.
θ x n Inference? γ y n
θ θ x 3 x 1 x 2 x 4 x 3 x 1 x 2 x 4 y 1 y 2 y 3 y 4 p ( x | θ ) is linear dynamical system q ( θ ) q ( x ) ≈ p ( θ , x | y ) p ( y | x, θ ) is linear-Gaussian p ( θ ) is conjugate prior h i log p ( θ ,x,y ) L [ q ( θ ) q ( x ) ] , E q ( θ ) q ( x ) q ( θ ) q ( x ) q ( θ ) ↔ η θ q ( x ) ↔ η x
θ θ x 3 x 1 x 2 x 4 x 3 x 1 x 2 x 4 y 1 y 2 y 3 y 4 p ( x | θ ) is linear dynamical system q ( θ ) q ( x ) ≈ p ( θ , x | y ) p ( y | x, θ ) is linear-Gaussian p ( θ ) is conjugate prior h i log p ( θ ,x,y ) L ( η θ , η x ) , E q ( θ ) q ( x ) q ( θ ) q ( x ) x ( η θ ) , arg max L SVI ( η θ ) , L ( η θ , η ∗ L ( η θ , η x ) x ( η θ )) η ∗ η x Proposition (natural gradient SVI of Ho ff man et al. 2013) r L SVI ( η θ ) = η 0 e θ + E q ∗ ( x ) ( t xy ( x, y ) , 1) � η θ
θ θ x 3 x 1 x 2 x 4 x 3 x 1 x 2 x 4 y 1 y 2 y 3 y 4 N N p ( x | θ ) is linear dynamical system q ( θ ) q ( x ) ≈ p ( θ , x | y ) p ( y | x, θ ) is linear-Gaussian p ( θ ) is conjugate prior h i log p ( θ ,x,y ) L ( η θ , η x ) , E q ( θ ) q ( x ) q ( θ ) q ( x ) x ( η θ ) , arg max L SVI ( η θ ) , L ( η θ , η ∗ L ( η θ , η x ) x ( η θ )) η ∗ η x Proposition (natural gradient SVI of Ho ff man et al. 2013) N X e r L SVI ( η θ ) = η 0 θ + E q ∗ ( x n ) ( t xy ( x n , y n ) , 1) � η θ n =1
Step 1: compute evidence potentials [1] Johnson and Willsky. Stochastic variational inference for Bayesian time series models. ICML 2014. [2] Foti, Xu, Laird, and Fox. Stochastic variational inference for hidden Markov models. NIPS 2014.
Step 1: compute evidence potentials [1] Johnson and Willsky. Stochastic variational inference for Bayesian time series models. ICML 2014. [2] Foti, Xu, Laird, and Fox. Stochastic variational inference for hidden Markov models. NIPS 2014.
Recommend
More recommend