10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1
Reminders • Homework 6: PAC Learning / Generative Models – Out: Wed, Oct 31 – Due: Wed, Nov 7 at 11:59pm (1 week) • Homework 7: HMMs – Out: Wed, Nov 7 – Due: Mon, Nov 19 at 11:59pm • Grades are up on Canvas 2
Q&A Q: Why would we use Naïve Bayes? Isn’t it too Naïve? A: Naïve Bayes has one key advantage over methods like Perceptron, Logistic Regression, Neural Nets: Training is lightning fast! While other methods require slow iterative training procedures that might require hundreds of epochs, Naïve Bayes computes its parameters in closed form by counting. 3
DISCRIMINATIVE AND GENERATIVE CLASSIFIERS 4
Generative vs. Discriminative • Generative Classifiers: – Example: Naïve Bayes – Define a joint model of the observations x and the labels y: p ( x , y ) – Learning maximizes (joint) likelihood – Use Bayes’ Rule to classify based on the posterior: p ( y | x ) = p ( x | y ) p ( y ) /p ( x ) • Discriminative Classifiers: – Example: Logistic Regression – Directly model the conditional: p ( y | x ) – Learning maximizes conditional likelihood 5
Generative vs. Discriminative Whiteboard – Contrast: To model p(x) or not to model p(x)? 6
Generative vs. Discriminative Finite Sample Analysis (Ng & Jordan, 2002) [Assume that we are learning from a finite training dataset] If model assumptions are correct: Naive Bayes is a more efficient learner (requires fewer samples) than Logistic Regression If model assumptions are incorrect: Logistic Regression has lower asymtotic error, and does better than Naïve Bayes 7
solid: NB dashed: LR 8 Slide courtesy of William Cohen
solid: NB dashed: LR Naïve Bayes makes stronger assumptions about the data but needs fewer examples to estimate the parameters “On Discriminative vs Generative Classifiers: ….” Andrew Ng and Michael Jordan, NIPS 2001. 9 Slide courtesy of William Cohen
Generative vs. Discriminative Learning (Parameter Estimation) Naïve Bayes: Parameters are decoupled à Closed form solution for MLE Logistic Regression: Parameters are coupled à No closed form solution – must use iterative optimization techniques instead 10
Naïve Bayes vs. Logistic Reg. Learning (MAP Estimation of Parameters) Bernoulli Naïve Bayes: Parameters are probabilities à Beta prior (usually) pushes probabilities away from zero / one extremes Logistic Regression: Parameters are not probabilities à Gaussian prior encourages parameters to be close to zero (effectively pushes the probabilities away from zero / one extremes) 11
Naïve Bayes vs. Logistic Reg. Features Naïve Bayes: Features x are assumed to be conditionally independent given y . (i.e. Naïve Bayes Assumption) Logistic Regression: No assumptions are made about the form of the features x . They can be dependent and correlated in any fashion. 12
MOTIVATION: STRUCTURED PREDICTION 13
Structured Prediction • Most of the models we’ve seen so far were for classification – Given observations: x = (x 1 , x 2 , …, x K ) – Predict a (binary) label: y • Many real-world problems require structured prediction – Given observations: x = (x 1 , x 2 , …, x K ) – Predict a structure: y = (y 1 , y 2 , …, y J ) • Some classification problems benefit from latent structure 14
Structured Prediction Examples • Examples of structured prediction – Part-of-speech (POS) tagging – Handwriting recognition – Speech recognition – Word alignment – Congressional voting • Examples of latent structure – Object recognition 15
Dataset for Supervised Part-of-Speech (POS) Tagging D = { x ( n ) , y ( n ) } N Data: n =1 y (1) p n n v d Sample 1: x (1) like time flies an arrow y (2) n n v d n Sample 2: x (2) time flies like an arrow y (3) n v p n n Sample 3: x (3) fly with flies their wings y (4) p n n v v Sample 4: x (4) you with time will see 16
Dataset for Supervised Handwriting Recognition D = { x ( n ) , y ( n ) } N Data: n =1 Sample 1: y (1) u n e x t p e c e d x (1) Sample 2: y (2) o l v c a n i c x (2) Sample 2: y (3) e m b r a c e s x (3) 17 Figures from (Chatzis & Demiris, 2013)
Dataset for Supervised Phoneme (Speech) Recognition D = { x ( n ) , y ( n ) } N Data: n =1 Sample 1: y (1) h# dh ih s iy w uh z z iy x (1) Sample 2: y (2) f ao r ah s s h# x (2) 18 Figures from (Jansen & Niyogi, 2013)
Application: Word Alignment / Phrase Extraction • Variables (boolean) : – For each (Chinese phrase, English phrase) pair, are they linked? • Interactions : – Word fertilities – Few “jumps” (discontinuities) – Syntactic reorderings – “ITG contraint” on alignment – Phrases are disjoint (?) (Burkett & Klein, 2012) 19
Application: Congressional Voting • Variables : – Text of all speeches of a representative – Local contexts of references between two representatives • Interactions : – Words used by representative and their vote – Pairs of representatives and their local context (Stoyanov & Eisner, 2012) 20
Structured Prediction Examples • Examples of structured prediction – Part-of-speech (POS) tagging – Handwriting recognition – Speech recognition – Word alignment – Congressional voting • Examples of latent structure – Object recognition 21
Case Study: Object Recognition Data consists of images x and labels y . x (2) x (1) y (2) y (1) pigeon rhinoceros x (3) x (4) y (3) y (4) leopard llama 22
Case Study: Object Recognition Data consists of images x and labels y . • Preprocess data into “patches” • Posit a latent labeling z describing the object’s parts (e.g. head, leg, tail, torso, grass) • Define graphical model with these latent variables in mind • z is not observed at leopard train or test time 23
Case Study: Object Recognition Data consists of images x and labels y . • Preprocess data into “patches” • Posit a latent labeling z describing the object’s Z 7 parts (e.g. head, leg, tail, torso, grass) X 7 • Define graphical Z 2 model with these latent variables in mind X 2 • z is not observed at leopard Y train or test time 24
Case Study: Object Recognition Data consists of images x and labels y . • Preprocess data into “patches” • Posit a latent labeling z describing the object’s Z 7 parts (e.g. head, leg, ψ 4 tail, torso, grass) ψ 1 X 7 ψ 4 • Define graphical Z 2 ψ 4 ψ 2 model with these ψ 3 latent variables in mind X 2 • z is not observed at leopard Y train or test time 25
� Structured Prediction Preview of challenges to come… • Consider the task of finding the most probable assignment to the output Classification Structured Prediction ˆ y = ������ ˆ p ( y | � ) � = ������ p ( � | � ) y where � ∈ Y where y ∈ { +1 , − 1 } and |Y| is very large 26
Machine Learning Our model The data inspires defines a score the structures for each structure we want to predict It also tells us Domain Mathematical Knowledge Modeling what to optimize ML Inference finds Combinatorial Optimization { best structure, marginals, Optimization partition function }for a new observation Learning tunes the parameters of the (Inference is usually model called as a subroutine in learning) 27
Machine Learning Model Data X 1 X 3 arrow X 2 an like flies time X 4 X 5 Objective Inference Learning (Inference is usually called as a subroutine in learning) 28
BACKGROUND 29
Background: Chain Rule of Probability For random variables A and B : P ( A, B ) = P ( A | B ) P ( B ) For random variables X 1 , X 2 , X 3 , X 4 : P ( X 1 , X 2 , X 3 , X 4 ) = P ( X 1 | X 2 , X 3 , X 4 ) P ( X 2 | X 3 , X 4 ) P ( X 3 | X 4 ) P ( X 4 ) 31
� Background: Conditional Independence Random variables A and B are conditionally independent given C if: (1) P ( A, B | C ) = P ( A | C ) P ( B | C ) or equivalently: (2) P ( A | B, C ) = P ( A | C ) We write this as: (3) B | C A Later we will also | write: I<A, {C}, B> 32
HIDDEN MARKOV MODEL (HMM) 33
HMM Outline • Motivation – Time Series Data • Hidden Markov Model (HMM) – Example: Squirrel Hill Tunnel Closures [courtesy of Roni Rosenfeld] – Background: Markov Models – From Mixture Model to HMM – History of HMMs – Higher-order HMMs • Training HMMs – (Supervised) Likelihood for HMM – Maximum Likelihood Estimation (MLE) for HMM – EM for HMM (aka. Baum-Welch algorithm) • Forward-Backward Algorithm – Three Inference Problems for HMM – Great Ideas in ML: Message Passing – Example: Forward-Backward on 3-word Sentence – Derivation of Forward Algorithm – Forward-Backward Algorithm – Viterbi algorithm 34
Markov Models Whiteboard – Example: Squirrel Hill Tunnel Closures [courtesy of Roni Rosenfeld] – First-order Markov assumption – Conditional independence assumptions 35
36
Recommend
More recommend