cs54701 information retrieval
play

CS54701: Information Retrieval CS-54701 Information Retrieval - PowerPoint PPT Presentation

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models Luo Si Department of Computer Science Purdue University Retrieval Model: Language Model Introduction to language model Unigram language


  1. CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models Luo Si Department of Computer Science Purdue University

  2. Retrieval Model: Language Model  Introduction to language model  Unigram language model  Document language model estimation  Maximum Likelihood estimation  Maximum a posterior estimation  Jelinek Mercer Smoothing  Model-based feedback

  3. Language Models: Motivation  Vector space model for information retrieval  Documents and queries are vectors in the term space  Relevance is measure by the similarity between document vectors and query vector  Problems for vector space model  Ad-hoc term weighting schemes  Ad-hoc similarity measurement No justification of relationship between relevance and similarity  We need more principled retrieval models …

  4. Introduction to Language Models:  Language model can be created for any language sample  A document  A collection of documents  Sentence, paragraph, chapter, query…  The size of language sample affects the quality of language model  Long documents have more accurate model  Short documents have less accurate model  Model for sentence, paragraph or query may not be reliable

  5. Introduction to Language Models:  A document language model defines a probability distribution over indexed terms  E.g., the probability of generating a term  Sum of the probabilities is 1  A query can be seen as observed data from unknown models  Query also defines a language model (more on this later)  How might the models be used for IR?  Rank documents by Pr( | ) q d i  Rank documents by language models of and based on q d i kullback-Leibler (KL) divergence between the models (come later)

  6. Language Model for IR: Example Generate retrieval results q Estimate the generation probability of Pr( | ) q d sport, basketball i Language Language Language Model for Model for d d Model for d 3 2 1 Estimating language model for each document d d d 2 3 1 basketball, ticket, stock, finance, sport, basketball, finance, ticket, sport finance, stock ticket, sport

  7. Language Models Three basic problems for language models  What type of probabilistic distribution can be used to construct language models?  How to estimate the parameters of the distribution of the language models?  How to compute the likelihood of generating queries given the language modes of documents?

  8. Multinomial/Unigram Language Models  Language model built by multinomial distribution on single terms (i.e., unigram) in the vocabulary Examples: Five words in vocabulary (sport, basketball, ticket, finance, stock) For a document , its language mode is: d i {P i (“ sport ”), P i (“ basketball ”), P i (“ ticket ”), P i (“ finance ”), P i (“ stock ”)} Formally: The language model is: {P i (w) for any word w in vocabulary V}     ( ) 1 0 ( ) 1 P w P w i k i k k

  9. Multinomial/Unigram Language Models Multinomial Multinomial Multinomial Model for Model for 2 Model for 1 d d 3 d Estimating language model for each document d d d 2 3 1 basketball, ticket, stock, finance, sport, basketball, finance, ticket, sport finance, stock ticket, sport

  10. Maximum Likelihood Estimation (MLE) Maximum Likelihood Estimation:  Find model parameters that make generation likelihood reach maximum: 1 I d ,...,d M*=argmax M Pr(D|M) There are K words in vocabulary, w 1 ...w K (e.g., 5) Data: one document with counts tf i (w 1 ), …, tf i (w K ), 1 d ,...,d I d i and length | | d i Model: multinomial M with parameters {p i (w k )} Likelihood: Pr( | M) d i d ,...,d I 1 M*=argmax M Pr( |M) d i

  11. Maximum Likelihood Estimation (MLE)   K K | | d   i  tf ( w )  tf ( w )   ( | ) ( ) ( ) p d M p w i k p w i k   i i k i k ( )... ( ) tf w tf w     1 1 k k i 1 i K    ( | ) lo g ( | ) ( ) lo g ( ) l d M p d M tf w p w i i i k i k k   '     ( | ) ( ) lo g ( ) ( ( ) 1) l d M tf w p w p w i i k i k i k k k  ' ( ) ( ) l tf w t f w        i k i k 0 ( ) p w i k   ( ) ( ) p w p w i k i k ( ) c w         i k ( ) 1, ( ) | | , ( ) S in ce p w tf w d S o p w i i k i k i k | | d k k i Use Lagrange multiplier approach Set partial derivatives to zero Get maximum likelihood estimate

  12. Maximum Likelihood Estimation (MLE) (p sp , p b , p t , p f , p st ) = (p sp , p b , p t , p f , p st ) = (p sp , p b , p t , p f , p st ) = (0.5,0.25,0.25,0,0) (0.2,0.2,0.4,0.2,0) (0,0,0,0.5,0.5) Estimating language model for each document d d d 2 3 1 basketball, ticket, stock, finance, sport, basketball, finance, ticket, sport finance, stock ticket, sport

  13. Maximum Likelihood Estimation (MLE) Maximum Likelihood Estimation: Assign zero probabilities to unseen words in small sample  A specific example: Only two words in vocabulary (w 1 =sport, w 2 =business) like (head, tail) for a coin; A document generates sequence of two words or draw a d i coin for many times   d i ( ) ( )  tf w  tf w   P r( | ) ( ) 1 (1 ( )) 2 d M p w i p w i   i i 1 i 1 1 d ,...,d I ( ) ( ) tf w tf w   1 2 i i Only observe two words (flip the coin twice) and MLE estimators are: “business sport” P i (w 1 )=0.5 “sport sport” P i (w 1 )=1 ? “business business” P i (w 1 )=0 ?

  14. Maximum Likelihood Estimation (MLE) A specific example: Only observe two words (flip the coin twice) and MLE estimators are: “business sport” P i (w 1 )*=0.5 “sport sport” P i (w 1 )*=1 ? “business business” P i (w 1 )*=0 ? Data sparseness problem

  15. Solution to Sparse Data Problems  Maximum a posterior (MAP) estimation  Shrinkage  Bayesian ensemble approach

  16. Maximum A Posterior (MAP) Estimation Maximum A Posterior Estimation: Select a model that maximizes the probability of model given  observed data M*=argmax M Pr(M|D)=argmax M Pr(D|M)Pr(M)  Pr(M): Prior belief/knowledge  Use prior Pr(M) to avoid zero probabilities A specific examples: Only two words in vocabulary (sport, business) For a document : Prior Distribution d i   d   i tf ( w ) tf ( w )    P r( | ) ( ) 1 ( ) 2 P r M d p w i p w i M i   1 2 i i ( ) ( )  tf w tf w  i 1 i 2

  17. Maximum A Posterior (MAP) Estimation Maximum A Posterior Estimation: Introduce prior on the multinomial distribution   Use prior Pr(M) to avoid zero probabilities, most of coins are more or less unbiased  Use Dirichlet prior on p(w)      ( )     1       1 K ( | , , ) ( ) k , ( ) 1, 0 ( ) 1 D ir p p w p w p w 1 K i k i k i k i     ( ) ( ) k k 1 K Hyper-parameters Constant for p K     t x 1   ( ) x e t d x  ( x ) is gamma function 0     ( 1) ! if n n n

  18. Maximum A Posterior (MAP) Estimation For the two word example: 2 2 a Dirichlet prior   P r( ) ( ) (1 ( )) M p w p w 1 1 P(w 1 ) 2 (1-P(w 1 ) 2 )

  19. Maximum A Posterior (MAP) Estimation Maximum A Posterior: M*=argmax M Pr(M|D)=argmax M Pr(D|M)Pr(M)     tf ( w ) tf ( w ) 1 1   P r( | ) P r( ) ( ) 1 (1 ( )) 2 ( ) ( ) d M M p w i p w i p w 1 p w 2 i 1 1 1 1 i i i i       tf ( w ) 1 tf ( w ) 1   ( ) 1 1 (1 ( )) 2 2 p w i p w i 1 1 i i Pseudo Counts d ,...,d 1 I       ( ) 1 ( ) 1 *  tf w  tf w arg m ax ( ) i 1 1 (1 ( )) i 2 2 M p w p w i 1 i 1 ( ) p w i 1

  20. Maximum A Posterior (MAP) Estimation A specific example: Only observe two words (flip a coin twice): “sport sport” P i (w 1 )*=1 ? P(w 1 ) 2 (1-P(w 1 ) 2 ) times

  21. Maximum A Posterior (MAP) Estimation A specific example: Only observe two words (flip a coin twice): “sport sport” P i (w 1 )*=1 ?    ( ) 1 tf w 1 1 i  ( )* p w        1 ( ) 1 ( ) 1 tf w tf w 1 1 2 2 i i   2 3 1 4 2         2 3 1 0 3 1 6 3

  22. MAP Estimation Unigram Language Model Maximum A Posterior Estimation: Use Dirichlet prior for multinomial distribution  How to set the parameters for Dirichlet prior 

  23. MAP Estimation Unigram Language Model Maximum A Posterior Estimation: Use Dirichlet prior for multinomial distribution  There are K terms in the vocabulary:      : { ( ),...., ( )}, ( ) 1, 0 ( ) 1 M ultinom ial p p w p w p w p w i i 1 K i i k i k k      ( )     1       1 K ( | , , ) ( ) k , ( ) 1, 0 ( ) 1 D ir p p w p w p w 1 K i k i k i k i     ( ) ( ) k k 1 K Hyper-parameters Constant for p K

Recommend


More recommend