variational autoencoders deep generative models
play

Variational Autoencoders + Deep Generative Models Matt Gormley - PowerPoint PPT Presentation

10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University Variational Autoencoders + Deep Generative Models Matt Gormley Lecture 27 Dec. 4, 2019 1 Reminders


  1. 10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University Variational Autoencoders + Deep Generative Models Matt Gormley Lecture 27 Dec. 4, 2019 1

  2. Reminders • Final Exam – Evening Exam – Thu, Dec. 5 at 6:30pm – 9:00pm • 618 Final Poster: – Submission: Tue, Dec. 10 at 11:59pm – Presentation: Wed, Dec. 11 (time will be announced on Piazza) 3

  3. FINAL EXAM LOGISTICS 6

  4. Final Exam • Time / Location – Time: Evening Exam Thu, Dec. 5 at 6:30pm – 9:00pm – Room : Doherty Hall A302 – Seats: There will be assigned seats . Please arrive early to find yours. – Please watch Piazza carefully for announcements • Logistics – Covered material: Lecture 1 – Lecture 26 (not the new material in Lecture 27) – Format of questions: • Multiple choice • True / False (with justification) • Derivations • Short answers • Interpreting figures • Implementing algorithms on paper – No electronic devices – You are allowed to bring one 8½ x 11 sheet of notes (front and back) 7

  5. Final Exam • Advice (for during the exam) – Solve the easy problems first (e.g. multiple choice before derivations) • if a problem seems extremely complicated you’re likely missing something – Don’t leave any answer blank! – If you make an assumption, write it down – If you look at a question and don’t know the answer: • we probably haven’t told you the answer • but we’ve told you enough to work it out • imagine arguing for some answer and see if you like it 8

  6. Final Exam • Exam Contents – ~30% of material comes from topics covered before Midterm Exam – ~70% of material comes from topics covered after Midterm Exam 9

  7. Topics from before Midterm Exam • • Search-Based Structured Graphical Model Learning Prediction – Fully observed Bayesian – Reductions to Binary Network learning – Fully observed MRF learning Classification – Learning to Search – Fully observed CRF learning – RNN-LMs – Parameterization of a GM – seq2seq models – Neural potential functions • • Graphical Model Exact Inference Representation – Three inference problems: – Directed GMs vs. (1) marginals (2) partition function Undirected GMs vs. (3) most probably assignment Factor Graphs – Variable Elimination – Bayesian Networks vs. Markov Random Fields vs. – Belief Propagation (sum- Conditional Random Fields product and max-product) – MAP Inference via MILP 10

  8. Topics from after Midterm Exam • • Learning for Structure Approximate Inference by Prediction Optimization – Structured Perceptron – Variational Inference – Structured SVM – Mean Field Variational – Neural network potentials Inference – Coordinate Ascent V.I. (CAVI) • Approximate MAP Inference – Variational EM – MAP Inference via MILP – Variational Bayes – MAP Inference via LP • Bayesian Nonparametrics relaxation • – Dirichlet Process Approximate Inference by – DP Mixture Model Sampling • – Monte Carlo Methods Deep Generative Models – Gibbs Sampling – Variational Autoencoders – Metropolis-Hastings – Markov Chains and MCMC 11

  9. VARIATIONAL EM 12

  10. Variational EM Whiteboard – Example: Unsupervised POS Tagging – Variational Bayes – Variational EM 13

  11. Unsupervised POS Tagging Bayesian Inference for HMMs • Task : unsupervised POS tagging • Data : 1 million words (i.e. unlabeled sentences) of WSJ text • Dictionary : defines legal part-of-speech (POS) tags for each word type • Models : – EM: standard HMM – VB: uncollapsed variational Bayesian HMM – Algo 1 (CVB): collapsed variational Bayesian HMM (strong indep. assumption) – Algo 2 (CVB): collapsed variational Bayesian HMM (weaker indep. assumption) – CGS: collapsed Gibbs Sampler for Bayesian HMM E q ( z ¬ t ) [ C ¬ t E q ( z ¬ t ) [ C ¬ t E q ( z ¬ t ) [ C ¬ t k,w ] + β z t � 1 ,k ] + α k,z t +1 ] + α + E q ( z ¬ t ) [ δ ( z t − 1 = k = z t +1 )] Algo 1 mean field update: q ( z t = k ) ∝ k, · ] + W β · z t � 1 , · ] + K α · E q ( z ¬ t ) [ C ¬ t E q ( z ¬ t ) [ C ¬ t E q ( z ¬ t ) [ C ¬ t k, · ] + K α + E q ( z ¬ t ) [ δ ( z t − 1 = k )] C ¬ t C ¬ t C ¬ t z t � 1 ,k + α k,z t +1 + α + δ ( z t � 1 = k = z t +1 ) k,w + β CGS full conditional: p ( z t = k | x , z ¬ t , α , β ) ∝ k, · + W β · z t � 1 , · + K α · C ¬ t C ¬ t C ¬ t k, · + K α + δ ( z t � 1 = k ) 14 Figure from Wang & Blunsom (2013)

  12. Unsupervised POS Tagging Bayesian Inference for HMMs • Task : unsupervised POS tagging • Data : 1 million words (i.e. unlabeled sentences) of WSJ text • Dictionary : defines legal part-of-speech (POS) tags for each word type • Models : – EM: standard HMM – VB: uncollapsed variational Bayesian HMM – Algo 1 (CVB): collapsed variational Bayesian HMM (strong indep. assumption) – Algo 2 (CVB): collapsed variational Bayesian HMM (weaker indep. assumption) – CGS: collapsed Gibbs Sampler for Bayesian HMM Number of Iterations (CGS) Number of Iterations (CGS) 400 4,000 8,000 12,000 16,000 20,000 0 4,000 8,000 12,000 16,000 20,000 VB 1,500 Algo 1 0.85 1,400 Algo 2 Test Perplexity 1,300 0.8 CGS Accuracy 1,200 0.75 1,100 EM (28mins) 1,000 0.7 VB (35mins) Algo 1 (15mins) 900 0.65 Algo 2 (50mins) 800 CGS (480mins) 10 20 30 40 50 0 10 20 30 40 50 Number of Iterations (Variational Algorithms) Number of Iterations (Variational Algorithms) 15 Figure from Wang & Blunsom (2013)

  13. Unsupervised POS Tagging Bayesian Inference for HMMs • Task : unsupervised POS tagging • Data : 1 million words (i.e. unlabeled sentences) of WSJ text • Dictionary : defines legal part-of-speech (POS) tags for each word type • Models : – EM: standard HMM – VB: uncollapsed variational Bayesian HMM – Algo 1 (CVB): collapsed variational Bayesian HMM (strong indep. assumption) – Algo 2 (CVB): collapsed variational Bayesian HMM (weaker indep. assumption) – CGS: collapsed Gibbs Sampler for Bayesian HMM Speed: • EM is slow b/c of log-space computations EM (28mins) • VB is slow b/c of digamma computations VB (35mins) • Algo 1 (CVB) is the fastest! Algo 1 (15mins) • Algo 2 (CVB) is slow b/c it computes dynamic Algo 2 (50mins) parameters • CGS: an order of magnitude slower than any CGS (480mins) deterministic algorithm 16 Figure from Wang & Blunsom (2013)

  14. Stochastic Variational Bayesian HMM • Task : Human Chromatin Segmentation • Goal : unsupervised segmentation of the genome • Data : from ENCODE, “250 million observations consisting of twelve assays carried out in the chronic myeloid leukemia cell line K562” • Figure from Foti et al. (2014) Metric : “the false discovery rate (FDR) of predicting active promoter elements in the sequence" ● L/2 = 1 L/2 = 3 L/2 = 10 1.5 − 3.0 Diag. Dom. • Diag. Dom. GrowBuffer Models: 1.0 Held out log − probability − 3.5 ● Off – 0.5 DBN HMM: dynamic Bayesian On − 4.0 ||A|| F HMM trained with standard EM 0.0 ● ● ● κ − 4.5 1.00 ● – 0.1 SVIHMM: stochastic variational − 6.0 Rev. Cycles 0.75 Rev. Cycles 0.3 inference for a Bayesian HMM 0.50 − 6.2 ● 0.5 • 0.7 Main Takeaway : 0.25 − 6.4 ● ● ● 0.00 − 6.6 – the two models perform at 1 10 100 0 20 40 60 0 20 40 60 0 20 40 60 L/2 (log − scale) Iteration similar levels of FDR Figure from Mammana & Chung (2015) – SVIHMM takes one hour – DBNHMM takes days 17

  15. Grammar Induction Question: Can maximizing (unsupervised) marginal likelihood produce useful results? Answer: Let’s look at an example… • Babies learn the syntax of their native language (e.g. English) just by hearing many sentences • Can a computer similarly learn syntax of a human language just by looking at lots of example sentences? – This is the problem of Grammar Induction! – It’s an unsupervised learning problem – We try to recover the syntactic structure for each sentence without any supervision 18

  16. Grammar Induction time flies like an arrow time flies like an arrow time flies like an arrow … No semantic interpretation time flies like an arrow 19

  17. Grammar Induction Training Data: Sentences only, without parses x (1) like Sample 1: time flies an arrow x (2) Sample 2: real flies like soup x (3) fly with Sample 3: flies their wings x (4) you Sample 4: with time will see Test Data: Sentences with parses, so we can evaluate accuracy 20

  18. Grammar Induction Q: Does likelihood Dependency Model with Valence (Klein & Manning, 2004) correlate with 60 accuracy on a task we care about? Attachment Accuracy (%) Pearson’s r = 0.63 50 (strong correlation) A: Yes, but there is 40 still a wide range of accuracies for a 30 particular likelihood value 20 10 -20.2 -20 -19.8 -19.6 -19.4 -19.2 -19 Log-Likelihood (per sentence) lti 21 Figure from Gimpel & Smith (NAACL 2012) - slides

Recommend


More recommend