Hidden Markov Models Jimmy Lin Jimmy Lin The iSchool University - PowerPoint PPT Presentation

CMSC 723: Computational Linguistics I ― Session #5 Hidden Markov Models Jimmy Lin Jimmy Lin The iSchool University of Maryland Wednesday, September 30, 2009

Today’s Agenda � The great leap forward in NLP � Hidden Markov models (HMMs) dde a o ode s ( s) � Forward algorithm � Viterbi decoding � Supervised training � Unsupervised training teaser � HMMs for POS tagging � HMMs for POS tagging

Deterministic to Stochastic � The single biggest leap forward in NLP: � From deterministic to stochastic models � What? A stochastic process is one whose behavior is non- deterministic in that a system’s subsequent state is determined both by the process’s predictable actions and by a random element. � What’s the biggest challenge of NLP? � Why are deterministic models poorly adapted? � What’s the underlying mathematical tool? � Why can’t you do this by hand?

FSM: Formal Specification � Q : a finite set of N states � Q = { q 0 , q 1 , q 2 , q 3 , …} � The start state: q 0 � The set of final states: q F � Σ : a finite input alphabet of symbols � Σ : a finite input alphabet of symbols � δ ( q , i ): transition function � Given state q and input symbol i , transition to new state q' Given state q and input symbol i transition to new state q'

Fi it Finite number of states b f t t

T Transitions iti

I Input alphabet t l h b t

St Start state t t t

Fi Final state(s) l t t ( )

The problem w ith FSMs… � All state transitions are equally likely � But what if we know that isn’t true? ut at e o t at s t t ue � How might we know ?

Weighted FSMs � What if we know more about state transitions? � ‘a’ is twice as likely to be seen in state 1 as ‘b’ or ‘c’ � ‘c’ is three times as likely to be seen in state 2 as ‘a’ 3 2 1 1 1 1 � FSM → Weighted FSM � What do we get of it? Wh t d t f it? � score(‘ab’) = 2 (?) � score(‘bc’) = 3 (?) ( ) ( )

Introducing Probabilities � What’s the problem with adding weights to transitions? � What if we replace weights with probabilities? at e ep ace e g ts t p obab t es � Probabilities provide a theoretically-sound way to model uncertainly (ambiguity in language) � But how do we assign probabilities? But how do we assign probabilities?

Probabilistic FSMs � What if we know more about state transitions? � ‘a’ is twice as likely to be seen in state 1 as ‘b’ or ‘c’ � ‘c’ is three times as likely to be seen in state 2 as ‘a’ 0.75 0.5 0.25 0.25 1.0 0.25 � What do we get of it? What’s the interpretation? � P(‘ab’) = 0 5 � P( ab ) = 0.5 � P(‘bc’) = 0.1875 � This is a Markov chain

Markov Chain: Formal Specification � Q : a finite set of N states � Q = { q 0 , q 1 , q 2 , q 3 , …} � The start state � An explicit start state: q 0 � Alternatively, a probability distribution over start states: { π 1 , π 2 , π 3 , …}, Σ π i = 1 � The set of final states: q F q F � N × N Transition probability matrix A = [ a ij ] � a ij = P ( q j | q i ), Σ a ij = 1 ∀ i a ij ( q j | q i ), a ij 0.75 0.5 0 25 0.25 0 25 0.25 1 0 1.0 0.25

Let’s model the stock market… Each state corresponds to a physical state in the world 0.2 What’s missing? Add “priors” 0.5 0.3 � What’s special about this FSM? � Present state only depends on the previous state! � The (1st order) Markov assumption � P ( q i | q 0 … q i-1 ) = P ( q i | q i-1 ) P ( | ) P ( | )

Are states alw ays observable ? Day: 1 2 3 4 5 6 Not observable ! Bull: Bull Market B ull B ear SB ear B ull S Bear: Bear Market ull ear ear ull S: Static Market Here’s what you actually observe: ↑ : Market is up ↑ p ↑ ↓ ↔ ↑ ↓ ↔ ↓ : Market is down ↔ : Market hasn’t changed

Hidden Markov Models � Markov chains aren’t enough! � What if you can’t directly observe the states? � We need to model problems where observations don’t directly correspond to states… � Solution: A Hidden Markov Model (HMM) � Solution: A Hidden Markov Model (HMM) � Assume two probabilistic processes � Underlying process (state transition) is hidden � Second process generates sequence of observed events

HMM: Formal Specification � Q : a finite set of N states � Q = { q 0 , q 1 , q 2 , q 3 , …} � N × N Transition probability matrix A = [ a ij ] � a ij = P ( q j | q i ), Σ a ij = 1 ∀ i � Sequence of observations O = o 1 , o 2 , ... o T � Each drawn from a given set of symbols (vocabulary V) � N × | V | Emission probability matrix, B = [ b it ] � b it = b i ( o t ) = P ( o t | q i ), Σ b it = 1 ∀ i � Start and end states St t d d t t � An explicit start state q 0 or alternatively, a prior distribution over start states: { π 1 , π 2 , π 3 , …}, Σ π i = 1 � The set of final states: q F

Stock Market HMM States? ✓ Transitions? Vocabulary? Emissions? Priors?

Stock Market HMM States? ✓ Transitions? ✓ Vocabulary? Emissions? Priors?

Stock Market HMM States? ✓ Transitions? ✓ Vocabulary? ✓ Emissions? Priors?

Stock Market HMM States? ✓ Transitions? ✓ Vocabulary? ✓ Emissions? ✓ Priors?

Stock Market HMM States? ✓ Transitions? ✓ Vocabulary? ✓ Emissions? ✓ Priors? ✓ π 3 =0.3 π 1 =0.5 π 2 =0.2

Properties of HMMs � The (first-order) Markov assumption holds � The probability of an output symbol depends only on the e p obab ty o a output sy bo depe ds o y o t e state generating it � The number of states (N) does not have to equal the number of observations (T)

HMMs: Three Problems � Likelihood: Given an HMM λ = ( A , B , ∏ ), and a sequence of observed events O , find P ( O | λ ) � Decoding: Given an HMM λ = ( A , B , ∏ ), and an observation sequence O , find the most likely (hidden) state sequence state sequence � Learning: Given a set of observation sequences and the set of states Q in λ compute the parameters A and B set of states Q in λ , compute the parameters A and B Okay, but where did the structure of the HMM come from?

HMM Problem #1: Likelihood

Computing Likelihood π 3 =0.3 π 1 =0.5 π 2 =0.2 t : 1 2 3 4 5 6 O : ↑ ↓ ↔ ↑ ↓ ↔ λ stock Assuming λ stock models the stock market, how likely are we to observe the sequence of outputs?

Computing Likelihood � Easy, right? � Sum over all possible ways in which we could generate O from λ Takes O( N T ) time to compute! � What’s the problem? � Right idea, wrong algorithm!

Computing Likelihood � What are we doing wrong? � State sequences may have a lot of overlap… � We’re recomputing the shared subsequences every time � Let’s store intermediate results and reuse them! � Can we do this? Can we do this? � Sounds like a job for dynamic programming!

Forw ard Algorithm � Use an N × T trellis or chart [ α tj ] � Forward probabilities: α tj or α t ( j ) o a d p obab t es α tj o α t ( j ) � = P (being in state j after seeing t observations) � = P ( o 1 , o 2 , ... o t , q t = j ) � Each cell = ∑ extensions of all paths from other cells α t ( j ) = ∑ i α t-1 ( i ) a ij b j ( o t ) � α t-1 ( i ): forward path probability until ( t - 1 ) α ( i ): forward path probability until ( t 1 ) � a ij : transition probability of going from state i to j � b j ( o t ): probability of emitting symbol o t in state j � P ( O | λ ) = ∑ i α T ( i ) � What’s the running time of this algorithm?

Forw ard Algorithm: Formal Definition � Initialization � Recursion � Termination � Termination

Forw ard Algorithm O = ↑ ↓ ↑ find P ( O | λ stock )

Forw ard Algorithm Static Static es state Bear Bull ↑ ↓ ↑ t=1 t=2 t=3 time

Forw ard Algorithm: Initialization 0.3 × 0.3 Static α 1 (Static) Static α 1 (Static) =0.09 0 09 es state 0.5 × 0.1 Bear α 1 (Bear) =0.05 0.2 × 0.7= Bull α 1 (Bull) 0.14 ↑ ↓ ↑ t=1 t=2 t=3 time

Forw ard Algorithm: Recursion 0.3 × 0.3 Static Static =0.09 0 09 es .... and so on and so on state 0.5 × 0.1 Bear =0.05 ∑ 0.2 × 0.7= Bull 0.0145 0.14 0.14 × 0.6 × 0.1=0.0084 α 1 (Bull) × a BullBull × b Bull ( ↓ ) ↑ ↓ ↑ t=1 t=2 t=3 time

Forw ard Algorithm: Recursion Work through the rest of these numbers… 0.3 × 0.3 Static Static ? ? =0.09 0 09 es state 0.5 × 0.1 Bear ? ? =0.05 0.2 × 0.7= Bull 0.0145 ? 0.14 ↑ ↓ ↑ t=1 t=2 t=3 time What’s the asymptotic complexity of this algorithm?

Forw ard Algorithm: Recursion 0.3 × 0.3 Static Static 0.0249 0.006477 =0.09 0 09 es state 0.5 × 0.1 Bear 0.0312 0.001475 =0.05 0.2 × 0.7= Bull 0.0145 0.024 0.14 ↑ ↓ ↑ t=1 t=2 t=3 time

Forw ard Algorithm: Termination 0.3 × 0.3 Static Static 0.0249 0.006477 =0.09 0 09 es state 0.5 × 0.1 Bear 0.0312 0.001475 =0.05 0.2 × 0.7= Bull 0.0145 0.024 0.14 P(O) = 0.03195 P(O) = 0 03195 ↑ ↓ ↑ t=1 t=2 t=3 time

HMM Problem #2: Decoding

Decoding π 3 =0.3 π 1 =0.5 π 2 =0.2 t : 1 2 3 4 5 6 O : ↑ ↓ ↔ ↑ ↓ ↔ λ λ stock Given λ stock as our model and O as our observations, what are stock the most likely states the market went through to produce O ?

Hidden Markov Models Jimmy Lin Jimmy Lin The iSchool University - PowerPoint PPT Presentation

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models Jimmy Lin Jimmy Lin The iSchool University of Maryland Wednesday, September 30, 2009 Todays Agenda The great leap forward in NLP Hidden Markov models (HMMs)

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Hidden Markov Models Pratik Lahiri Introduction A hidden Markov model (HMM) is a

Markov Models Kunsch, H.R., State Space and Hidden Markov Models . ETH- Zurich, Zurich;

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Markov Chains and Hidden Markov Models COMP 571 - Spring 2015 Luay Nakhleh, Rice University

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Hidden Markov Models Markov Model (Finite State Machine with Probs) Modeling a sequence of

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

CS 4495 Computer Vision Hidden Markov Models Aaron Bobick School of Interactive Computing

Outline Sequential Data - Part 2 Greg Mori - CMPT 419/726 Hidden Markov Models - Most Likely

rs s rss r

On hyperfiniteness of boundary actions of hyperbolic groups Marcin Sabok Prague, July 25, 2016

A compositional approach to quantum functions David Reutter University of Oxford First Symposium

8 Further Topics in Moral Hazard This is designed for one 75-minute lecture using Games and

Discussion of Policy Rules and Economic Performance by Nikolkso-Rzevskyy, Papell &

Probabilit y densit y f u nctions STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin

The Promise of FinTech Something New Under the Sun? Slides to accompany a speech given by

Inflation Targeting Lars E.O. Svensson www.larseosvensson.net May 2012 Lars E.O. Svensson

Hidden Markov Models Jimmy Lin Jimmy Lin The iSchool University - PowerPoint PPT Presentation

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models Jimmy Lin Jimmy Lin The iSchool University of Maryland Wednesday, September 30, 2009 Todays Agenda The great leap forward in NLP Hidden Markov models (HMMs)

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Hidden Markov Models Pratik Lahiri Introduction A hidden Markov model (HMM) is a

Markov Models Kunsch, H.R., State Space and Hidden Markov Models . ETH- Zurich, Zurich;

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Markov Chains and Hidden Markov Models COMP 571 - Spring 2015 Luay Nakhleh, Rice University

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Hidden Markov Models Markov Model (Finite State Machine with Probs) Modeling a sequence of

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

CS 4495 Computer Vision Hidden Markov Models Aaron Bobick School of Interactive Computing

Outline Sequential Data - Part 2 Greg Mori - CMPT 419/726 Hidden Markov Models - Most Likely

rs s rss r

On hyperfiniteness of boundary actions of hyperbolic groups Marcin Sabok Prague, July 25, 2016

A compositional approach to quantum functions David Reutter University of Oxford First Symposium

8 Further Topics in Moral Hazard This is designed for one 75-minute lecture using Games and

Discussion of Policy Rules and Economic Performance by Nikolkso-Rzevskyy, Papell &amp;

Probabilit y densit y f u nctions STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin

The Promise of FinTech Something New Under the Sun? Slides to accompany a speech given by

Inflation Targeting Lars E.O. Svensson www.larseosvensson.net May 2012 Lars E.O. Svensson

Discussion of Policy Rules and Economic Performance by Nikolkso-Rzevskyy, Papell &