III.4 Statistical Language Models 1. Basics of Statistical Language - PowerPoint PPT Presentation

                III.4 Statistical Language Models 1. Basics of Statistical Language Models 2. Query-Likelihood Approaches 3. Smoothing Methods 4. Divergence Approaches 5. Extensions   Based on MRS Chapter 12 and [Zhai 2008] IR&DM ’13/’14 ! 78

1. Basics of Statistical Language Models • Statistical language models (LMs) are generative models of   word sequences (or, bags of words, sets of words, etc.) dog : 0.5 ! cat : 0.4 0.1 hog : 0.1 ! P ( h hog i ) = 0 . 1 ⇥ 0 . 1 ! P ( h cat , dog i ) = 0 . 4 ⇥ 0 . 9 ⇥ 0 . 5 ⇥ 0 . 1 ! 0.9 P ( h dog , dog , hog i ) = 0 . 5 ⇥ 0 . 9 ⇥ 0 . 5 ⇥ 0 . 9 ⇥ 0 . 1 ⇥ 0 . 1 • Application examples: • Speech recognition , e.g., to select among multiple phonetically similar sentences (“ get up at 8 o’clock ” vs. “ get a potato clock ”) • Statistical machine translation , e.g., to select among multiple candidate   translations (“ logical closing ” vs. “ logical reasoning ”) • Information retrieval , e.g., to rank documents in response to a query IR&DM ’13/’14 ! 79

Types of Language Models • Unigram LM based on only single words (unigrams), considers no context , and assumes independent generation of words m ! Y P ( h t 1 , . . . , t m i ) = P ( t i ) i =1 ! • Bigram LM conditions on the preceding term m ! Y P ( h t 1 , . . . , t m i ) = P ( t 1 ) P ( t i | t i − 1 ) ! i =2 • n -Gram LM conditions on the preceding ( n -1) terms m Y P ( h t 1 , . . . , t m i ) = P ( t 1 ) P ( t 2 | t 1 ) . . . P ( t i | t i − n +1 . . . t i − 1 ) i = n IR&DM ’13/’14 ! 80

Parameter Estimation • Parameters (e.g., P ( t i ), P ( t i | t i -1 )) of language model θ are   estimated based on a sample of documents , which are   assumed to have been generated by θ • Example: Unigram language models θ Sports and θ Politics   estimated from documents about sports and politics θ Sports Sample soccer : 0.20 goal : 0.15 tennis : 0.10 generates player : 0.05 : θ Politics party : 0.20 Sample debate : 0.20 scandal : 0.15 generates election : 0.05 : IR&DM ’13/’14 ! 81

2. Query-Likelihood Approaches ! θ d1 Sample apple : 0.20 d 1 P ( q | d 1 ) ! pie : 0.15 : ! q θ d2 ! Sample d 2 cake : 0.20 P ( q | d 2 ) apple : 0.15 : ! • P ( q | d ) is the likelihood that the query was generated by   the language model θ d estimated from document d • Intuition: • User formulates query q by selecting words from a prototype document • Which document is “closest” to that prototype document IR&DM ’13/’14 ! 83

                Multinomial LM • Query q is seen as a bag of terms and generated from document d   by drawing terms from the bag of terms corresponding to d   ✓ ◆ Q | q | P ( t i | d ) tf ( t i ,q ) P ( q | d ) = tf ( t 1 , q ) . . . tf ( t | q | , q ) t i ∈ q P ( t i | d ) tf ( t i ,q ) Q ∝ t i ∈ q Q P ( t i | d ) ( assuming ∀ t i ∈ q : tf ( t i , q ) = 1) ≈ t i ∈ q • Multinomial LM is more expressive than Multi-Bernoulli LM   and therefore usually preferred IR&DM ’13/’14 ! 85

          Multinomial LM (cont’d) • Maximum-likelihood estimate for parameters P ( t i | d )   P ( t i | d ) = tf ( t i , d ) | d | is prone to overfitting and leads to • bias in favor of short documents / against long documents • conjunctive query semantics , i.e., query can not be generated from language models of documents that miss one of the query terms   IR&DM ’13/’14 ! 86

3. Smoothing • Smoothing methods avoid overfitting to the sample (often: one document) and are essential for LMs to work in practice • Laplace smoothing (cf. Chapter III.3) • Absolute discounting • Jelinek-Mercer smoothing • Dirichlet smoothing • Good-Turing smoothing • Katz’s back-off model • … • Choice of smoothing method and parameter setting still mostly   “black art” (or empirical, i.e., based on training data) IR&DM ’13/’14 ! 87

      Jelinek-Mercer Smoothing • Uses a linear combination (mixture) of document language model θ d and document-collection language model θ D   P ( t | d ) = λ tf ( t, d ) + (1 − λ ) tf ( t, D ) | d | | D | with document D as concatenation of entire document collection • Parameter λ can be tuned by cross-validation with held-out data • divide set of relevant ( q , d ) pairs into n partitions • build LM on the pairs from n -1 partitions • choose λ to maximize precision (or recall or F1) on held-out partition • iterate with different choice of n th partition and average • Parameter λ can be made document- or term-dependent IR&DM ’13/’14 ! 88

Jelinek-Mercer Smoothing vs. TF*IDF ! Q P ( q | d ) = P ( t | d ) t ∈ q ! ⇣ ⌘ λ tf ( t,d ) + (1 − λ ) tf ( t,D ) Q = | d | | D | ! t ∈ q ! ⇣ ⌘ λ tf ( t,d ) + (1 − λ ) tf ( t,D ) P log ∝ | d | | D | t ∈ q ! ⇣ ⌘ tf ( t,d ) | D | λ P log 1 + ∝ ! 1 − λ | d | tf ( t,D ) t ∈ q ~ tf ~ idf ! • (Jelinek-Mercer) smoothing has effect similar to IDF weighting • Jelinek-Mercer smoothing leads to a TF*IDF-style model IR&DM ’13/’14 ! 89

Dirichlet-Prior Smoothing • Uses Bayesian estimation with a conjugate Dirichlet prior   instead of the Maximum-Likelihood Estimation tf ( t, d ) + α tf ( t,D ) ! | D | P ( t | d ) = | d | + α ! • Intuition: Document d is extended by α terms generated   by the document-collection language model • Parameter α usually set as multiple of average document length IR&DM ’13/’14 ! 90

Dirichlet Smoothing vs. Jelinek-Mercer Smoothing ! λ tf ( t,d ) + (1 − λ ) tf ( t,D ) P ( t | d ) = | d | | D | ! | d | tf ( t,d ) tf ( t,D ) | d | = + ( set λ = | d | + α ) α ! | d | + α | d | | d | + α | D | ! tf ( t,d )+ α tf ( t,D ) | D | = | d | + α ! • Jelinek-Mercer smoothing with document-dependent λ   becomes a special case of Dirichlet smoothing IR&DM ’13/’14 ! 91

4. Divergence Approaches ! θ d1 D ( θ q || θ d 1 ) apple : 0.20 d 1 ! pie : 0.15 : θ q ! q apple : 0.20 muffin : 0.15 : θ d2 ! d 2 cake : 0.20 apple : 0.15 D ( θ q || θ d 2 ) : ! ! • Query-likelihood approaches see query as a sample from a LM • Query expansion, relevance feedback, etc. are difficult to express as query-likelihood approaches , since they would require tinkering with the sample (i.e., the query) and more fine-grained control than adding/removing terms IR&DM ’13/’14 ! 92

Kullback-Leibler Divergence • Kullback-Leibler divergence (aka. information gain or relative entropy) is an information-theoretic non-symmetric measure of distance between probability distributions P ( t | θ q ) log P ( t | θ q ) D ( θ q || θ d ) = P ! P ( t | θ d ) t ∈ V ! • Example: θ q P (apple | θ q ) log P (apple | θ q ) P (apple | θ d ) + P (mu ffi n | θ q ) log P (mu ffi n | θ q ) apple : 0.50 D ( θ q k θ d ) = P (mu ffi n | θ d ) muffin : 0.50 0 . 50 log 0 . 50 0 . 25 + 0 . 50 log 0 . 50 θ d = 0 . 25 apple : 0.25 muffin : 0.25 = 1 . 00 recipe : 0.10 water : 0.10 sugar : 0.30 IR&DM ’13/’14 ! 93

          Relevance Feedback LM • [Zhai and Lafferty ’01] re-estimate query language model as   P ( t | θ 0 q ) = (1 − α ) P ( t | θ q ) + α P ( t | θ F ) with F as the set of documents with positive feedback from user • MLE of θ F obtained by maximizing log-likelihood function   X log P ( F | θ F ) = tf ( t, F ) log ((1 − λ ) P ( t | θ F ) + λ P ( t | θ D )) t ∈ V with tf ( t , F ) as the total term frequency of t in documents from F   and θ D as the document-collection language model IR&DM ’13/’14 ! 94

5. Extensions • Statistical language models have been one of the highly active   areas in IR research during the past decade and continue to be • Extensions: • Term-specific and document-specific smoothing   (JM-style smoothing with term-specific λ t or document-specific λ d) • (Semantic) Translation LMs   (e.g., to consider synonyms or support cross-lingual IR) • Time-based LMs   (e.g., with time-dependent document prior to favor recent documents) • LMs for (semi-)structured XML and RDF data   (e.g., for entity search or question answering) • … IR&DM ’13/’14 ! 95

III.4 Statistical Language Models 1. Basics of Statistical Language - PowerPoint PPT Presentation

III.4 Statistical Language Models 1. Basics of Statistical Language Models 2. Query-Likelihood Approaches 3. Smoothing Methods 4. Divergence Approaches 5. Extensions Based on MRS Chapter 12 and

III.4 Statistical Language Models III.4 Statistical LM (MRS book, Chapter 12*) 4.1 What is

Chapter 7 Language models Statistical Machine Translation Language models Language models

Models of Language Evolution models thereof its evolution language Models of Language Evolution

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

I III IV I III IV I III IV BUILDING TRUST Radical Candor Chart HIGH I III IV

R i f R i f Reinforcement Learning III Reinforcement Learning III t L t L i i III III Dec

Introduction to Information Retrieval http://informationretrieval.org IIR 12: Language Models for

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

Language Models Dan Klein, John DeNero UC Berkeley Language Models Language Models Acoustic

Overlay-based IP Routing Richard Hartmann Chair for Network Architectures and Services

Matali Crasset Matali Crasset My Style I try not to work object by object; I prefer to make

CS 3700 Networks and Distributed Systems Overlay Networks (P2P DHT via KBR FTW) Revised

Brief Summary on Topology and Performance of Distributed Hash Tables Zhirong Yang Helsinki

Efficient Triangulation for P2P Networked Virtual Environments Eliya Buyukkaya, Maha Abdallah

Visual Question Answering and Visual Reasoning Zhe Gan 6/15/2020 Overview Goal of this part

VoroNet: A scalable object network based on Voronoi tessellations Olivier Beaumont, Anne-Marie

Types of Types Types of Types natural numbers. A type is a (possibly infinite) set of values.

III.4 Statistical Language Models 1. Basics of Statistical Language - PowerPoint PPT Presentation

III.4 Statistical Language Models 1. Basics of Statistical Language Models 2. Query-Likelihood Approaches 3. Smoothing Methods 4. Divergence Approaches 5. Extensions Based on MRS Chapter 12 and

III.4 Statistical Language Models III.4 Statistical LM (MRS book, Chapter 12*) 4.1 What is

Chapter 7 Language models Statistical Machine Translation Language models Language models

Models of Language Evolution models thereof its evolution language Models of Language Evolution

STA 214: Probability &amp; Statistical Models STA 214: Analysis of Statistical Models

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

I III IV I III IV I III IV BUILDING TRUST Radical Candor Chart HIGH I III IV

R i f R i f Reinforcement Learning III Reinforcement Learning III t L t L i i III III Dec

Introduction to Information Retrieval http://informationretrieval.org IIR 12: Language Models for

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

Language Models Dan Klein, John DeNero UC Berkeley Language Models Language Models Acoustic

Overlay-based IP Routing Richard Hartmann Chair for Network Architectures and Services

Matali Crasset Matali Crasset My Style I try not to work object by object; I prefer to make

CS 3700 Networks and Distributed Systems Overlay Networks (P2P DHT via KBR FTW) Revised

Brief Summary on Topology and Performance of Distributed Hash Tables Zhirong Yang Helsinki

Efficient Triangulation for P2P Networked Virtual Environments Eliya Buyukkaya, Maha Abdallah

Visual Question Answering and Visual Reasoning Zhe Gan 6/15/2020 Overview Goal of this part

VoroNet: A scalable object network based on Voronoi tessellations Olivier Beaumont, Anne-Marie

Types of Types Types of Types natural numbers. A type is a (possibly infinite) set of values.

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models