models for retrieval models for retrieval
play

Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. - PowerPoint PPT Presentation

Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. Latent Semantic Indexing (LSI) 3. Probabilistic Latent Semantic Analysis (PLSA) Berlin Chen 2003 References: 1. Berlin Chen et al., An HMM/N-gram-based Linguistic Processing


  1. Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. Latent Semantic Indexing (LSI) 3. Probabilistic Latent Semantic Analysis (PLSA) Berlin Chen 2003 References: 1. Berlin Chen et al., “An HMM/N-gram-based Linguistic Processing Approach for Mandarin Spoken Document Retrieval,” EUROSPEECH 2001 2. M. W. Berry et al., “Using Linear Algebra for Intelligent Information Retrieval,” technical report, 1994 3. Thomas Hofmann, “Unsupervised Learning by Probabilistic Latent Semantic Analysis,” Machine Learning, 2001

  2. HMM/N-gram-based Model • Model the query as a sequence of input Q observations (index terms), = Q q q .. q n q .. 1 2 N • Model the doc as a discrete HMM D composed of distribution of N -gram parameters ( ) • The relevance measure, , can be P Q D is R estimated by the N -gram probabilities of the index term sequence for the query, = Q q q .. q n q .. 1 2 N , predicted by the doc D – A generative model for IR ( ) = D * arg max P D is R Q D ( ) ( ) ≈ arg max P Q D is R P D is R D ( ) ≈ with the assumption that …… arg max P Q D is R 2 D

  3. HMM/N-gram-based Model ( ) { } = P W W w w .. w .. w 1 2 n N ( ) = P w w .. w .. w 1 2 n N ) ( ) ( ) ( ) ( = P w P w w P w w w ..... P w w w .... w − 1 2 1 3 1 2 N 1 2 N 1 • N -gram approximation (Language Model) – Unigram ( ) ( ) ( ) ( ) ( ) = P W P w P w P w ..... P w 1 2 3 N – Bigram ) ( ) ( ) ( ) ( ) ( = P W P w P w w P w w ..... P w N w − 1 2 1 3 2 N 1 – Trigram ) ( ) ( ) ( ) ( ) ( = P W P w P w w P w w w ..... P w w w − − 1 2 1 3 1 2 N N 2 N 1 – …….. 3

  4. HMM/N-gram-based Model • A discrete HMM composed of distribution of N - gram parameters ( ) P q D m n 1 m ( ) P q Corpus 2 n ∑ ( ) m P q q , D = Q q q .. q .. q 3 − n n 1 1 2 n N + + + = m m m m 1 1 2 3 4 m ( ) 4 P q q , Corpus − n n 1 [ ] ( ) ( ) ( ) = + P Q D is R m P q D m P q Corpus 1 1 2 1 [ ) ] ( ) ( ) ( ) ( N ⋅ + + + m P q D m P q Corpus m P q q , D m P q q , Corpus ∏ − − 1 n 2 n 3 n n 1 4 n n 1 = n 2 4

  5. HMM/N-gram-based Model • Three Types of HMM Structures – Type I: Unigram-Based (Uni) [ ] ( ) ( ) ( ) N = + P Q D is R ∏ m P q D m P q Corpus 1 n 2 n = n 1 – Type II: Unigram/Bigram-Based (Uni+Bi) [ ] ( ) ( ) ( ) = + P Q D is R m P q D m P q Corpus 1 1 2 1 [ ] ( ) ( ) ( ) N ⋅ + + m P q D m P q Corpus m P q q , D ∏ − 1 n 2 n 3 n n 1 = n 2 – Type III: Unigram/Bigram/Corpus-Based (Uni+Bi*) [ ] ( ) ( ) ( ) = + P Q D is R m P q D m P q Corpus 1 1 2 1 [ ) ] ( ) ( ) ( ) ( N ⋅ + + + m P q D m P q Corpus m P q q , D m P q q , Corpus ∏ − − 1 n 2 n 3 n n 1 4 n n 1 = n 2 P ( 陳水扁 總統 視察 阿里山 小火車 | D ) =[ m 1 P ( 陳水扁 | D )+ m 2 P ( 陳水扁 | C )] x [ m 1 P ( 總統 | D )+ m 2 P ( 總統 | C )+ m 3 P ( 總統 | 陳水扁, D )+ m 4 P ( 總統 | 陳水扁, C )] x[ m 1 P ( 視察 | D )+ m 2 P ( 視察 | C )+ m 3 P ( 視察 | 總統, D )+ m 4 P ( 視察 | 總統, C )] x ………. 5

  6. HMM/N-gram-based Model ( ) P q Corpus • The role of the corpus N -gram probabilities n ( ) P q q , Corpus − n n 1 – Model the general distribution of the index terms ( ) • Help to solve zero-frequency problem = P q D 0 ! n • Help to differentiate the contributions of different missing terms in a doc – The corpus N -gram probabilities were estimated using an outside corpus P ( q a | D )=0.4 P ( q b | D )=0.3 Doc D P ( q c | D )=0.2 q c q b q a q b P ( q d | D )=0.1 q a q a q c q d P ( q e | D )=0.0 q a q b P ( q f | D )=0.0 6

  7. HMM/N-gram-based Model • Estimation of N -grams (Language Models) – Maximum likelihood estimation (MLE) for doc N -grams • Unigram Counts of term q i in the doc D ( ) ( ) = ∑ ( ) C q C q = P q D D i D i ( ) i D C q Length of the doc D D j ∈ q D j • Bigram Counts of term pair ( q j , q i ) in the doc D ( ) ( ) C q , q = D j i P q q , D ( ) i j C q Counts of term q i in the doc D D j – Similar formulas for corpus N-grams ( ) ( ) ( ) C q , q C q ( ) = = Corpus j i Corpus i P q q , D P q Corpus ( ) i j i C q Corpus Corpus j Corpus : an outside corpus or just the doc collection 7

  8. HMM/N-gram-based Model • Basically, m 1 , m 2 , m 3 , m 4 , can be estimated by using the Expectation-Maximization (EM) algorithm because of the insufficiency of training data – All docs share the same weights here – The N -gram probability distributions also can be estimated using the EM algorithm instead of the maximum likelihood estimation • For those docs with training queries, m 1 , m 2 , m 3 , m 4 , can be estimated by using the Minimum Classification Error (MCE) training algorithm – The docs can have different weights 8

  9. HMM/N-gram-based Model • Expectation-Maximum Training – The weights are tied among the documents – E.g. m 1 of Type I HMM can be trained using the following equation: the old weight ( )   ˆ m P q D ≦ 2265 docs 819 queries 1 n ∑ ∑ ∑   ( ) ( ) the new weight + ˆ ˆ m P q D m P q Corpus [ ] [ ] ∈ ∈ ∈   Q TrainSet D Doc q Q n Q R to Q = 1 n 2 n m [ ] ⋅ 1 Q Doc ∑ R to Q [ ] ∈ Q TrainSet Q [ ] Q TrainSet • Where is the set of training query exemplars, [ ] Doc is the set of docs that are relevant to a specific R to Q Q training query exemplar , is the length of the query , Q [ ] and is the total number of docs relevant to the Doc R to Q query Q 9

  10. HMM/N-gram-based Model • Expectation-Maximum Training ( ) ( ) ˆ > P Q | D P Q | D ? Empirical Derivation The old model ( ) ( ) ˆ − log P Q | D log P Q | D The new model ) ( ) ( ) ( ) ( ) (  ˆ    ∑ ∑ P q , k | D ∑ ∑ P q , k | D ( ) ˆ ˆ ˆ = − ( ) P k | q , D log P q | D n P k | q , D log  P q | D n    ( ) ˆ n n n n P q , k | D    P q , k | D    q ∈ Q k n q ∈ Q k n n n ( ) ( ) ( ) ( ) ˆ P q , k | D P q , k | D ∑ ∑ ∑ ∑ ˆ ˆ = − ( ) P k | q , D log P k | q , D log n n ( ) ˆ n n P k | q , D P k | q , D ∈ ∈ q Q k n q Q k n n n ˆ ˆ ˆ Φ − Φ ( D , D ) ( D , D ) ( ) ( ) ( )   ∑ ∑ ∑ ( ) ˆ ˆ ˆ = − P k | q , D log P q , k | D P k | q , D log P q , k | D   n n n n   ∈ q Q k k n ˆ ˆ ˆ − ≥ H ( D , D ) H ( D , D ) 0 ( ) ( ) ( )   ∑ ∑ ∑ ( ) ˆ ˆ ˆ + − P k | q , D log P k | q , D P k | q , D log P k | q , D   n n n n   ∈ q Q k k n ( ) ( ) ( )   ∑ ∑ ∑ ( ) ˆ ˆ ˆ ≥ − P k | q , D log P q , k | D P k | q , D log P q , k | D   n n n n   q ∈ Q k k n      q q  ∑ ∑ ∑ ∑   − = ≤ − = p log q p log p p log i p i 1 0  Q  Jensen’s inequality    i i i i i p i p      i i i i i i ≤ x − ( log x 1 ) Q ( ) ( ) ( ) ∑ ∑ ∑ ∑ ( ) ˆ ˆ ˆ ∴ ≥ If P k | q , D log P q , k | D P k | q , D log P q , k | D n n n n ∈ ∈ q Q k q Q k n n ( ) ( ) ˆ 10 > then P Q | D P Q | D

Recommend


More recommend