model divergence retrieval
play

Model Divergence Retrieval LM, session 10 CS6200: Information - PowerPoint PPT Presentation

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse Anderton Retrieval With Language Models There are three obvious ways to perform retrieval using language models: 1. Query Likelihood Retrieval trains a


  1. Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse Anderton

  2. Retrieval With Language Models There are three obvious ways to perform retrieval using language models: 1. Query Likelihood Retrieval trains a model on the document and estimates the query’s likelihood. We’ve focused on these so far. 2. Document Likelihood Retrieval trains a model on the query and estimates the document’s likelihood. Queries are very short, so these seem less promising. 3. Model Divergence Retrieval trains models on both the document and the query, and compares them.

  3. Comparing Distributions The most common way to compare probability distributions is with Kullback-Liebler (“KL”) Divergence . This is a measure from Information p ( e ) log p ( e ) Theory which can be interpreted as � D KL ( p � q ) = the expected number of bits you q ( e ) e would waste if you compressed data distributed along p as if it was distributed along q . If p = q , D KL ( p || q ) = 0 .

  4. Divergence-based Retrieval Model Divergence Retrieval works as follows: D KL ( p ( w | q ) � p ( w | d )) 1. Choose a language model for the query, p ( w | q ) . p ( w | q ) log p ( w | q ) � = p ( w | d ) w 2. Choose a language model for the � � = p ( w | q ) log p ( w | q ) � p ( w | q ) log p ( w | d ) document, p ( w | d ) . w w 3. Rank by –D KL ( p ( w | q ) || p ( w | d )) – more rank � = � p ( w | q ) log p ( w | d ) divergence means a worse match. w This can be simplified to a cross-entropy calculation, as shown to the right.

  5. Retrieval Flexibility Model Divergence Retrieval generalizes the Query and Document Likelihood models, and is the most Pick p ( w | q ) := tf w , q | q | = 1 flexible of the three. | q | Any language model can be used for rank � D KL ( p ( w | q ) � p ( w | d )) = � p ( w | q ) log p ( w | d ) the query or document. They don’t w 1 have to be the same. It can help to � = � | q | log p ( w | d ) smooth or normalize them differently. w Equivalence to Query Likelihood Model If you pick the maximum likelihood model for the query, this is equivalent to the query likelihood model.

  6. Example: Model Divergence Retrieval We make the following model choices: Let qf w := count ( word w in query log ) • p ( w | q ) is Dirichlet-smoothed with a qf w tf w , q + 2 � � w qf w background of words used in p ( w | q , μ = 2 ) = | q | + 2 historical queries. cf w tf w , d + 2 , 000 � � w cf w p ( w | d , μ = 2000 ) = | d | + 2 , 000 • p ( w | d ) is Dirichlet-smoothed with a rank � background of words used in D KL ( p ( w | q ) � p ( w | d )) = � p ( w | q ) log p ( w | d ) documents from the corpus. w qf w cf w tf w , q + 2 � tf w , d + 2 , 000 � � � w qf w w cf w � = � log • Σ w qf w = 500,000 | q | + 2 | d | + 2 , 000 w • Σ w cf w = 1,000,000,000

  7. Example: Model Divergence Retrieval qf w cf w tf w , q + 2 × tf w , d + 2 , 000 × � � w qf w w cf w � log | q | + 2 | d | + 2 , 000 w Wikipedia: WWI World War I ( WWI or WW1 or World War Query: “world war one” One ), also known as the First World War or qf w cf w p(w|q) p(w|d) Score the Great War , was a global war centred in Europe that began on 28 July 1914 and world 2,500 90,000 0.202 0.002 -1.891 lasted until 11 November 1918. More than 9 million combatants and 7 million civilians war 2,000 35,000 0.202 0.003 -1.700 died as a result of the war, a casualty rate exacerbated by the belligerents' one 6,000 5E+07 0.205 0.049 -0.893 technological and industrial sophistication, and tactical stalemate. It was one of the -4.484

  8. Example: Model Divergence Retrieval qf w cf w tf w , q + 2 × tf w , d + 2 , 000 × � � w qf w w cf w � log | q | + 2 | d | + 2 , 000 w Wikipedia: Taiping Rebellion Query: “world war one” The Taiping Rebellion was a massive civil war in qf w cf w p(w|q) p(w|d) Score southern China from 1850 to 1864, against the ruling Manchu Qing dynasty. It was a millenarian movement world 2,500 90,000 0.202 8.75E-05 -2.723 led by Hong Xiuquan, who announced that he had received visions, in which he learned that he was the war 2,000 35,000 0.202 0.001 -2.199 younger brother of Jesus. At least 20 million people died, mainly civilians, in one of the deadliest military one 6,000 5E+07 0.205 0.049 -0.890 conflicts in history. -5.812

  9. Wrapping Up Ranking by (negative) KL-Divergence provides a very flexible and theoretically-sound retrieval system. You are free to model queries and documents any way you like, so you don’t have to assume people use the same linguistic behaviors to write each. Next, we’ll see how to use a divergence retrieval model to build a pseudo-relevance feedback method that outperforms the Rocchio algorithm.

Recommend


More recommend