III.4 Statistical Language Models • III.4 Statistical LM (MRS book, Chapter 12*) – 4.1 What is a statistical language model? – 4.2 Smoothing Methods – 4.3 Extended LMs *With extensions from: C. Zhai, J. Lafferty: A Study of Smoothing Methods for Language Models Applied to Information Retrieval, TOIS 22(2), 2004 IR&DM, WS'11/12 November 10, 2011 III.1
III.4.1 What is a Statistical Language Model? Generative model for word sequences (generates probability distribution of word sequences, or bag-of-words, or set-of-words, or structured doc, or ...) Example: P[“Today is Tuesday”] = 0.01 P[“The Eigenvalue is positive”] = 0.001 P[“Today Wednesday is”] = 0.000001 LM itself highly context- / application-dependent Application examples: • speech recognition : given that we heard “Julia” and “feels”, how likely will we next hear “happy” or “habit”? • text classification : given that we saw “soccer” 3 times and “game” 2 times, how likely is the news about sports? • information retrieval : given that the user is interested in math, how likely would the user use “distribution” in a query? IR&DM, WS'11/12 November 10, 2011 III.2
Types of Language Models Key idea: A document is a good match to a query if the document model is likely to generate the query , i.e., if P(q|d ) “is high”. A language model is well-formed over alphabet ∑ if . P ( s ) 1 * s Generic Language Model Unigram Language Model “Today” “Today is Tuesday” 0.1 0.01 “is” “The Eigenvalue is positive” 0.001 0.3 “Tuesday” “Today Wednesday is” 0.2 0.00001 “Wednesday” … 0.2 … How to handle sequences? Bigram Language Model • Chain Rule (requires long chains of cond. prob.): P ( t t t t ) P ( t ) P ( t | t ) P ( t | t t ) P ( t | t t t ) “Today” 0.1 1 2 3 4 1 2 1 3 1 2 4 1 2 3 “is” | “Today” 0.4 • Bigram LM (pairwise cond. prob.): “Tuesday” | “is” 0.8 P ( t t t t ) P ( t ) P ( t | t ) P ( t | t ) P ( t | t ) … bi 1 2 3 4 1 2 1 3 2 4 3 • Unigram LM (no cond. prob.): P ( t t t t ) P ( t ) P ( t ) P ( t ) P ( t ) uni 1 2 3 4 1 2 3 4 IR&DM, WS'11/12 November 10, 2011 III.3
Text Generation with (Unigram) LM LM d : P[word | d ] sample document d LM for text 0.2 mining 0.1 topic 1: Article n-gram 0.01 IR&DM on cluster 0.02 “T ext ... Mining” healthy 0.000001 … different d for different d LM for food 0.25 topic 2: nutrition 0.1 Article healthy 0.05 Health on diet 0.02 “Food Nutrition” ... n-gram 0.00002 … IR&DM, WS'11/12 November 10, 2011 III.4
Basic LM for IR Which LM is more likely parameter estimation to generate q? text ? (better explains q) mining ? Article n-gram ? on cluster ? “Text Mining” ... ? healthy ? … Query q: “data mining algorithms” food ? ? nutrition ? Article healthy ? on “Food diet ? Nutrition” ... n-gram ? … IR&DM, WS'11/12 November 10, 2011 III.5
LM Illustration: Document as Model and Query as Sample Model M A A A A B B estimate likelihood C C C of observing the query D P [ | M] E E E E E A A B C E E query document d: sample of M used for parameter estimation IR&DM, WS'11/12 November 10, 2011 III.6
LM Illustration: Document as Model and Query as Sample Model M + A A A A B B C estimate likelihood D C C C A of observing the query F D A E B P [ | M] E E E E E A A B C E E query document d + background corpus and/or smoothing used for parameter estimation IR&DM, WS'11/12 November 10, 2011 III.7
Prob.-IR vs. Language Models User likes doc (R) P[R|d,q] given that it has features d and user poses query q P [ d | R , q ] P [ q , d | R ] P [ R ] P [ d | R , q ] P [ q | d , R ] P [ d | R ] P [ R ] prob. IR statist. LM P [ q | d ] (ranking proportional to (ranking proportional to relevance odds) query likelihood) query likelihood: s ( q , d ) log P [ q | d ] log P [ j | d ] j q top-k query result: k - argmax log P [ q | d ] d MLE would be tf j / |d| IR&DM, WS'11/12 November 10, 2011 III.8
Multi-Bernoulli vs. Multinomial LM Multi-Bernoulli: X ( q ) 1 X ( q ) P [ q | d ] p ( d ) ( 1 p ( d )) j j j j j with X j (q)=1 if j q, 0 otherwise Multinomial: | q | f ( q ) P [ q | d ] p ( d ) j j q j f ( j ) f ( j ) ... f ( j ) 1 2 | q | with f j (q) = f(j) = frequency of j in q and ∑ j f(j) = |q| multinomial LM more expressive and usually preferred IR&DM, WS'11/12 November 10, 2011 III.9
LM Scoring by Kullback-Leibler Divergence | q | f ( q ) log P [ q | d ] log p ( d ) j 2 2 j q j f ( j ) f ( j ) ... f ( j ) 1 2 | q | f ( q ) log p ( d ) j 2 j j q H ( f ( q ), p ( d )) neg. cross-entropy neg. cross-entropy H ( f ( q ), p ( d )) H ( f ( q )) + entropy D ( f ( q ) || p ( d )) f ( q ) j f ( q ) log neg. KL divergence j 2 of q and d j p ( d ) j IR&DM, WS'11/12 November 10, 2011 III.10
III.4.2 Smoothing Methods Absolutely crucial to avoid overfitting and make LMs useful in practice (one LM per doc, one LM per query)! Possible methods: • Laplace smoothing Choice and • Absolute Discounting parameter setting • Jelinek-Mercer smoothing still mostly • Dirichlet-prior smoothing “black art” • Katz smoothing (or empirical) • Good-Turing smoothing • ... most with their own parameters IR&DM, WS'11/12 November 10, 2011 III.11
Laplace Smoothing and Absolute Discounting freq ( j , d ) Estimation of d : p j (d) by MLE would yield | d | where | d | freq ( j , d ) j Additive Laplace smoothing: freq ( j , d ) 1 for multinomial over ˆ p j ( d ) | d | m vocabulary W with |W|=m Absolute discounting: max( freq ( j , d ) , 0 ) freq ( j , C ) with corpus C, ˆ p ( d ) j d [0,1] | d | | C | # distinct terms in d where d | d | IR&DM, WS'11/12 November 10, 2011 III.12
Jelinek-Mercer Smoothing Idea: use linear combination of doc LM with background LM (corpus, common language); could also consider freq ( j , d ) freq ( j , C ) query log as ˆ p j ( d ) ( 1 ) background LM | d | | C | for query Parameter tuning of by cross-validation with held-out data: • divide set of relevant (d,q) pairs into n partitions • build LM on the pairs from n-1 partitions • choose to maximize precision (or recall or F1) on n th partition • iterate with different choice of n th partition and average IR&DM, WS'11/12 November 10, 2011 III.13
Jelinek-Mercer Smoothing: Relationship to TF*IDF P [ q | ] P [ q | d ] ( 1 ) P [ q | C ] tf ( i , d ) df ( i ) with absolute log ( 1 ) i q tf ( k , d ) df ( k ) frequencies tf, df k k df ( k ) tf ( i , d ) k log 1 i q tf ( k , d ) 1 df ( i ) k relative tf ~ relative idf IR&DM, WS'11/12 November 10, 2011 III.14
Dirichlet-Prior Smoothing ~ Dirichlet ( ) prior MAP for with P [ f | ] P [ ] M ( ) : P [ | f ] Dirichlet distribution P [ f | ] P [ ] d as prior with term frequencies f ~ Dirichlet ( f ) in document d posterior f 1 ˆ | d | P [ j | d ] P [ j | C ] ˆ p ( d ) arg max M ( ) j j j j j n m | d | | d | j with j set to P[j|C]+1 for the Dirichlet hypergenerator and > 1 set to multiple of average document length with ( ) j 1 .. m j 1 f ( ,..., ; ,..., ) 1 j Dirichlet: 1 m 1 m j 1 .. m j j j 1 .. m ( ) j 1 .. m j (Dirichlet is conjugate prior for parameters of multinomial distribution: Dirichlet prior implies Dirichlet posterior, only with different parameters) IR&DM, WS'11/12 November 10, 2011 III.15
Recommend
More recommend