npfl103 information retrieval 8
play

NPFL103: Information Retrieval (8) Language Models for Information - PowerPoint PPT Presentation

Language models Text classification Naive Bayes Evaluation of text classification NPFL103: Information Retrieval (8) Language Models for Information Retrieval, Text Classification Pavel Pecina Institute of Formal and Applied Linguistics


  1. Language models Text classification Naive Bayes Evaluation of text classification NPFL103: Information Retrieval (8) Language Models for Information Retrieval, Text Classification Pavel Pecina Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Original slides are courtesy of Hinrich Schütze, University of Stutugart. 1 / 59 pecina@ufal.mff.cuni.cz

  2. Language models Text classification Naive Bayes Evaluation of text classification Contents Language models Text classification Naive Bayes Evaluation of text classification 2 / 59

  3. Language models Text classification Naive Bayes Evaluation of text classification Language models 3 / 59

  4. Language models Text classification Naive Bayes Evaluation of text classification Using language models for Information Retrieval View the document d as a generative model that generates the query q . What we need to do: 1. Define the precise generative model we want to use 2. Estimate parameters (difgerent for each document’s model) 3. Smooth to avoid zeros 4. Apply to query and find document most likely to generate the query 5. Present most likely document(s) to user 4 / 59

  5. Language models Text classification Naive Bayes Evaluation of text classification What is a language model? We can view a finite state automaton as a deterministic language model. I wish I wish I wish I wish I wish … Cannot generate: “wish I wish” or “I wish I” Our basic model: each document was generated by a difgerent automaton 5 / 59 like this except that these automata are probabilistic.

  6. Language models said Example: frog said that toad likes frog STOP STOP is a special symbol indicating that the automaton stops. … … 0.04 that 0.01 frog 0.02 likes 0.1 a Text classification 0.03 0.2 w Naive Bayes Evaluation of text classification A probabilistic language model the w STOP 0.2 toad 0.01 6 / 59 P ( w | q 1 ) P ( w | q 1 ) q 1 This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q 1 . P ( string ) = 0 . 01 · 0 . 03 · 0 . 04 · 0 . 01 · 0 . 02 · 0 . 01 · 0 . 2 = 0 . 0000000000048

  7. Language models .03 Text classification w w STOP .20 toad .02 the .15 said a .04 .08 likes .02 frog .01 that .05 … … query: frog said that toad likes frog STOP … … that the Naive Bayes Evaluation of text classification A difgerent language model for each document w w .01 toad .01 STOP .20 .20 .10 frog likes .02 said 7 / 59 .03 a language model of d 1 language model of d 2 P ( w | . ) P ( w | . ) P ( w | . ) P ( w | . ) P ( query | M d 1 ) = 0 . 01 · 0 . 03 · 0 . 04 · 0 . 01 · 0 . 02 · 0 . 01 · 0 . 2 = 4 . 8 · 10 − 12 P ( query | M d 2 ) = 0 . 01 · 0 . 03 · 0 . 05 · 0 . 02 · 0 . 02 · 0 . 01 · 0 . 2 = 12 · 10 − 12 P ( query | M d 1 ) < P ( query | M d 2 ) : d 2 is more relevant to the query than d 1

  8. Language models Text classification Naive Bayes Evaluation of text classification Using language models in IR higher prior to “high-quality” documents (e.g. by PageRank) 8 / 59 ▶ Each document is treated as (the basis for) a language model. ▶ Given a query q , rank documents based on P ( d | q ) P ( d | q ) = P ( q | d ) P ( d ) P ( q ) ▶ P ( q ) is the same for all documents, so ignore ▶ P ( d ) is the prior – ofuen treated as the same for all d , but we can give a ▶ P ( q | d ) is the probability of q given d . ▶ Under the assumptions we made, ranking documents according according to P ( q | d ) and P ( d | q ) is equivalent.

  9. Language models Text classification Naive Bayes Evaluation of text classification Where we are observed as a random sample from the respective document model. 9 / 59 ▶ In the LM approach to IR, we model the query generation process. ▶ Then we rank documents by the probability that a query would be ▶ That is, we rank according to P ( q | d ) . ▶ Next: how do we compute P ( q | d ) ?

  10. Language models Text classification distinct term t in q 10 / 59 Evaluation of text classification Naive Bayes How to compute P ( q | d ) ▶ The conditional independence assumption: ∏ P ( q | M d ) = P ( ⟨ t 1 , . . . , t | q | ⟩| M d ) = P ( t k | M d ) 1 ≤ k ≤| q | ▶ | q | : length of q ▶ t k : the token occurring at position k in q ▶ This is equivalent to: ∏ P ( q | M d ) = P ( t | M d ) tf t , q ▶ tf t , q : term frequency (# occurrences) of t in q

  11. Language models Parameter estimation Text classification 11 / 59 Evaluation of text classification Naive Bayes ▶ Missing piece: Where do the parameters P ( t | M d ) come from? ▶ Start with maximum likelihood estimates ˆ P ( t | M d ) = tf t , d | d | ▶ | d | : length of d ▶ tf t , d : # occurrences of t in d ▶ The zero problem (in nominator and denominator) ▶ A single t with P ( t | M d ) = 0 will make P ( q | M d ) = ∏ P ( t | M d ) zero. ▶ Example: for query [Michael Jackson top hits] a document about “top songs” (but not with the word “hits”) would have P ( q | M d ) = 0 ▶ We need to smooth the estimates to avoid zeros.

  12. Language models …but no more likely than expected by chance in the collection. T Text classification 12 / 59 Smoothing Evaluation of text classification Naive Bayes ▶ Idea: A nonoccurring term is possible (even though it didn’t occur) ▶ We will use ˆ P ( t | M c ) to “smooth” P ( t | d ) away from zero. ˆ P ( t | M c ) = cf t ▶ M c : the collection model ▶ cf t : the number of occurrences of t in the collection ▶ T = ∑ t cf t : the total number of tokens in the collection.

  13. Language models Text classification Naive Bayes Evaluation of text classification Jelinek-Mercer smoothing collection frequency of the word. documents containing all query words. 13 / 59 ▶ Intuition: Mixing the probability from the document with the general P ( t | d ) = λ P ( t | M d ) + (1 − λ ) P ( t | M c ) ▶ High value of λ : “conjunctive-like” search – tends to retrieve ▶ Low value of λ : more disjunctive, suitable for long queries. ▶ Correctly setuing λ is very important for good performance.

  14. Language models Text classification Naive Bayes Evaluation of text classification Jelinek-Mercer smoothing: Summary query from this document. user had in mind was in fact this one. 14 / 59 ∏ P ( q | d ) ∝ ( λ P ( t k | M d ) + (1 − λ ) P ( t k | M c )) 1 ≤ k ≤| q | ▶ What we model: The user has a document in mind and generates the ▶ The equation represents the probability that the document that the

  15. Language models Text classification Naive Bayes Evaluation of text classification Example 15 / 59 ▶ Collection: d 1 and d 2 ▶ d 1 : Jackson was one of the most talented entertainers of all time. ▶ d 2 : Michael Jackson anointed himself King of Pop. ▶ Qvery q : ▶ q : Michael Jackson ▶ Use mixture model with λ = 1/2 ▶ P ( q | d 1 ) = [(0/11 + 1/18)/2] · [(1/11 + 2/18)/2] ≈ 0 . 003 ▶ P ( q | d 2 ) = [(1/7 + 1/18)/2] · [(1/7 + 2/18)/2] ≈ 0 . 013 ▶ Ranking: d 2 > d 1

  16. Language models the background distribution as our estimate. distribution. Text classification 16 / 59 Dirichlet smoothing Evaluation of text classification Naive Bayes ▶ Intuition: Before having seen any part of the document we start with P ( t | d ) = tf t , d + µ ˆ P ( t | M c ) ˆ L d + µ ▶ The background distribution ˆ P ( t | M c ) is the prior for ˆ P ( t | d ) . ▶ As we read the document and count terms we update the background ▶ The weight factor µ determines how strong an efgect the prior has.

  17. Language models Text classification Naive Bayes Evaluation of text classification Jelinek-Mercer or Dirichlet? performs betuer for verbose queries. shouldn’t use these models without parameter tuning. 17 / 59 ▶ Dirichlet performs betuer for keyword queries, Jelinek-Mercer ▶ Both models are sensitive to the smoothing parameters – you

  18. Language models Text classification Naive Bayes Evaluation of text classification Sensitivity of Dirichlet to smoothing parameter 18 / 59

  19. Language models * * 0.6 0.1024 0.1405 +37.1 * 0.8 0.0160 0.0432 +169.6 1.0 0.2572 0.0028 0.0050 +76.9 average 0.1868 0.2233 +19.6 * The language modeling approach always does betuer in these experiments …but significant gains are shown at higher levels of recall. +22.9 0.2093 Text classification 0.7439 Naive Bayes Evaluation of text classification Language model vs. Vector space model: Example Precision Recall TF-IDF LM significant 0.0 0.7590 0.4 +2.0 0.1 0.4521 0.4910 +8.6 0.2 0.3514 0.4045 +15.1 * 19 / 59 % ∆

  20. Language models Text classification Naive Bayes Evaluation of text classification Language model vs. Vector space model: Things in common 2. Probabilities are inherently “length-normalized”. 3. Mixing document/collection frequencies has an efgect similar to idf. will have a greater influence on the ranking. 20 / 59 1. Term frequency is directly in the model. ▶ But it is not scaled in LMs. ▶ Cosine normalization does something similar for vector space. ▶ Terms rare in the general collection, but common in some documents

Recommend


More recommend