Introduction to Markov Models Estimating the probability of phrases - PowerPoint PPT Presentation

Introduction to Markov Models Estimating the probability of phrases of words, sentences, etc .…

But first: A few preliminaries CIS 391 - Intro to AI 2

What counts as a word? A tricky question…. CIS 391 - Intro to AI 3

How to find Sentences?? CIS 391 - Intro to AI 4

Q1: How to estimate the probability of a given sentence W ?  A crucial step in speech recognition (and lots of other applications)   ˆ (  First guess: products of unigrams ) ( ) P W P w  w W Given word lattice: form subsidy for farm subsidies far Unigram counts (in 1.7 * 10 6 words of AP text): form 183 for 18185 subsidy 15 farm 74 subsidies 55 far 570 Not quite right… CIS 391 - Intro to AI 5

Predicting a word sequence II  Next guess: products of bigrams  1 n   • For W=w 1 w 2 w 3 … w n, ˆ ( ) ( ) P W P w w  1 i i  1 i Given word lattice: form subsidy for farm subsidies far Bigram counts (in 1.7 * 10 6 words of AP text): form subsidy 0 subsidy for 2 form subsidies 0 subsidy far 0 farm subsidy 0 subsidies for 6 farm subsidies 4 subsidies far 0 Better (if not quite right) … (But the counts are tiny! Why?) CIS 391 – Intr)o to AI 6

How can we estimate P correctly ?  Problem: Naïve Bayes model for bigrams violates independence assumptions. Let’s do this right….  Let W=w 1 w 2 w 3 … w n. Then, by the chain rule,   ( ( ) ) ( ( )* ( )* ( | | )* ( )* ( | | )*...* ( )*...* ( | | ... ... ) ) P W P w P w w P w w w P w w w  P W P w P w w P w w w P w w w  1 1 2 2 3 3 2 2 1 1 n n n n 1 1 1 1 1 1  We can estimate P(w 2 |w 1 ) by the Maximum Likelihood Estimator ( ) Count w w 1 2 ( ) Count w 1 and P(w 3 |w 1 w 2 ) by ( ) Count w w w 1 2 3 ( ) Count w w 1 2 and so on… CIS 391 - Intro to AI 7

and finally, Estimating P(w n |w 1 w 2 …w n-1 ) Again, we can estimate P(w n |w 1 w 2 …w n-1 ) with the MLE ( ... ) Count w w w 1 2 n ( ... ) Count w w w  1 2 1 n So to decide pat vs. pot in Heat up the oil in a large p?t, compute for pot ("Heat up the oil in a large pot") 0 Count  ("Heat up the oil in a larg ") e 0 Count CIS 391 - Intro to AI 8

Hmm..The Web Changes Things (2008 or so) Even the web in 2008 yields low counts! CIS 391 - Intro to AI 9

Statistics and the Web II So, P(“pot”|”heat up the oil in a large___”) = 8/49  0.16 CIS 391 - Intro to AI 10

But the web has grown!!! CIS 391 - Intro to AI 11

…. 165/891=0.185 CIS 391 - Intro to AI 12

So….  A larger corpus won’t help much unless it’s HUGE …. but the web is!!! But what if we only have 100 million words for our estimates?? CIS 391 - Intro to AI 13

A BOTEC Estimate of What We Can Estimate What parameters can we estimate with 100 million words of training data?? Assuming (for now) uniform distribution over only 5000 words So even with 10 8 words of data, for even trigrams we encounter the sparse data problem ….. CIS 391 - Intro to AI 14

The Markov Assumption: Only the Immediate Past Matters CIS 391 - Intro to AI 15

The Markov Assumption: Estimation We estimate the probability of each w i given previous context by P(w i |w 1 w 2 …w i-1 ) = P(w i |w i-1 ) which can be estimated by ( ) Count w w  1 i i ( ) Count w  1 i So we’re back to counting only unigrams and bigrams!! AND we have a correct practical estimation method for P(W) given the Markov assumption ! CIS 391 - Intro to AI 16

Markov Models CIS 391 - Intro to AI 17

Visualizing an n-gram based language model: the Shannon/Miller/Selfridge method  To generate a sequence of n words given unigram estimates: • Fix some ordering of the vocabulary v 1 v 2 v 3 … v k . • For each word w i , 1 ≤ i ≤ n — Choose a random value r i between 0 and 1 j   ( ) P v r — w i = the first v j such that m i  1 m CIS 391 - Intro to AI 18

Visualizing an n-gram based language model: the Shannon/Miller/Selfridge method  To generate a sequence of n words given a 1 st order Markov model (i.e. conditioned on one previous word): • Fix some ordering of the vocabulary v 1 v 2 v 3 … v k . • Use unigram method to generate an initial word w 1 • For each remaining w i , 2 ≤ i ≤ n — Choose a random value r i between 0 and 1 j   — w i = the first v j such that ( | ) P v w r  1 m i i  1 m CIS 391 - Intro to AI 19

The Shannon/Miller/Selfridge method trained on Shakespeare (This and next two slides from Jurafsky) CIS 391 - Intro to AI 20

Wall Street Journal just isn’t Shakespeare

Shakespeare as corpus  N=884,647 tokens, V=29,066  Shakespeare produced 300,000 bigram types out of V 2 = 844 million possible bigrams. • So 99.96% of the possible bigrams were never seen (have zero entries in the table)  Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare

The Sparse Data Problem Again  How likely is a 0 count? Much more likely than I let on!!! CIS 391 - Intro to AI 23

English word frequencies well described by Zipf’s Law  Zipf (1949) characterized the relation between word frequency and rank as:   (for constant ) f r C C  r C/f  log(r) log(C) - log (f)  Purely Zipfian data plots as a straight line on a log- log scale *Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ). CIS 391 - Intro to AI 24

Word frequency & rank in Brown Corpus vs Zipf Lots of area under the tail of this curve! From: Interactive mathematics http://www.intmath.com CIS 391 - Intro to AI 25

Zipf’s law for the Brown corpus CIS 391 - Intro to AI 26

Smoothing This black art is why NLP is taught in the engineering school – Jason Eisner

Smoothing  At least one unknown word likely per sentence given Zipf!!  To fix 0’s caused by this, we can smooth the data. • Assume we know how many types never occur in the data. • Steal probability mass from types that occur at least once. • Distribute this probability mass over the types that never occur. CIS 391 - Intro to AI 28

Smoothing ….is like Robin Hood: • it steals from the rich • and gives to the poor CIS 391 - Intro to AI 29

Review: Add-One Smoothing ˆ  Estimate probabilities by assuming every possible word P type v  V actually occurred one extra time (as if by appending an unabridged dictionary)  So if there were N words in our corpus, then instead of estimating ( ) Count w ˆ ( )  P w N we estimate  ( 1) Count w ˆ ( )  P w  N V CIS 391 - Intro to AI 30

Add-One Smoothing (again)  Pro: Very simple technique  Cons: • Probability of frequent n -grams is underestimated • Probability of rare (or unseen) n -grams is overestimated • Therefore, too much probability mass is shifted towards unseen n- grams • All unseen n -grams are smoothed in the same way  Using a smaller added-count improves things but only some  More advanced techniques (Kneser Ney, Witten-Bell) use properties of component n-1 grams and the like... (Hint for this homework  ) CIS 391 - Intro to AI 31

Introduction to Markov Models Estimating the probability of phrases - PowerPoint PPT Presentation

Introduction to Markov Models Estimating the probability of phrases of words, sentences, etc . But first: A few preliminaries CIS 391 - Intro to AI 2 What counts as a word? A tricky question. CIS 391 - Intro to AI 3 How to find

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Markov Models Kunsch, H.R., State Space and Hidden Markov Models . ETH- Zurich, Zurich;

Markov Chains and Hidden Markov Models COMP 571 - Spring 2015 Luay Nakhleh, Rice University

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov

Hidden Markov Models Pratik Lahiri Introduction A hidden Markov model (HMM) is a

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

Markov Models and Hidden Markov Models Robert Platt Northeastern University Some images and

Coarse-graining Markov state models with PCCA Coarse-graining Markov state models

Water Security in a Changing World Kevin Rumsey, M.A., M.Sc Sustaineo Blue Water Consulting

I nform ation & Referral I nform ation & Referral - Building on our Mandate Building on

2CIRs Page 1 2-Nov-01 Cargo Sub Committee Title, 2 CIRs UKFSC Cargo Sub Com CABIN / CARGO

BALANCING PATIENT SAFETY WITH PATIENTS RIGHTS A comparative legal perspective What are we

Investor Presentation December 2019 Forward-Looking Statements This presentation, in addition to

Buy America(n) policies and the need for long-term protection Paul Clipsham Director of Policy

Zeynep K. Erdal, PhD, PE February 12, 2013 WateReuse LA Chapter Meeting West Basin Municipal

Additives Miracle Workers for The Plastics Industry ! Dr. Y.B. Vasudeo Sr. Vice President

Introduction to Markov Models Estimating the probability of phrases - PowerPoint PPT Presentation

Introduction to Markov Models Estimating the probability of phrases of words, sentences, etc . But first: A few preliminaries CIS 391 - Intro to AI 2 What counts as a word? A tricky question. CIS 391 - Intro to AI 3 How to find

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Markov Models Kunsch, H.R., State Space and Hidden Markov Models . ETH- Zurich, Zurich;

Markov Chains and Hidden Markov Models COMP 571 - Spring 2015 Luay Nakhleh, Rice University

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov

Hidden Markov Models Pratik Lahiri Introduction A hidden Markov model (HMM) is a

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

Markov Models and Hidden Markov Models Robert Platt Northeastern University Some images and

Coarse-graining Markov state models with PCCA Coarse-graining Markov state models

Water Security in a Changing World Kevin Rumsey, M.A., M.Sc Sustaineo Blue Water Consulting

I nform ation &amp; Referral I nform ation &amp; Referral - Building on our Mandate Building on

2CIRs Page 1 2-Nov-01 Cargo Sub Committee Title, 2 CIRs UKFSC Cargo Sub Com CABIN / CARGO

BALANCING PATIENT SAFETY WITH PATIENTS RIGHTS A comparative legal perspective What are we

Investor Presentation December 2019 Forward-Looking Statements This presentation, in addition to

Buy America(n) policies and the need for long-term protection Paul Clipsham Director of Policy

Zeynep K. Erdal, PhD, PE February 12, 2013 WateReuse LA Chapter Meeting West Basin Municipal

Additives Miracle Workers for The Plastics Industry ! Dr. Y.B. Vasudeo Sr. Vice President

I nform ation & Referral I nform ation & Referral - Building on our Mandate Building on