statistical modeling approaches for statistical modeling
play

Statistical Modeling Approaches for Statistical Modeling Approaches - PowerPoint PPT Presentation

Statistical Modeling Approaches for Statistical Modeling Approaches for Information Retrieval Information Retrieval 1. HMM/N-gram-based 2. Topical Mixture Model (TMM) 2. Latent Semantic Indexing (LSI) 3. Probabilistic Latent Semantic Analysis


  1. Statistical Modeling Approaches for Statistical Modeling Approaches for Information Retrieval Information Retrieval 1. HMM/N-gram-based 2. Topical Mixture Model (TMM) 2. Latent Semantic Indexing (LSI) 3. Probabilistic Latent Semantic Analysis (PLSA) Berlin Chen 2004 References: 1. W. B. Croft and J. Lafferty (Editors). Language Modeling for Information Retrieval . July 2003 2. Berlin Chen et al. A Discriminative HMM/N-Gram-Based Retrieval Approach for Mandarin Spoken Documents . ACM Transactions on Asian Language Information Processing, June 2004 3. Berlin Chen. Exploring the Use of Latent Topical Information for Statistical Chinese Spoken Document Retrieval , 2004 4. M. W. Berry et al. Using Linear Algebra for Intelligent Information Retrieval . Technical report, 1994 5. Thomas Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis . Machine Learning, 2001

  2. Taxonomy of Classic IR Models Set Theoretic Fuzzy Extended Boolean Classic Models Boolean Algebraic Vector U Generalized Vector Probabilistic Retrieval: s Latent Semantic Adhoc e Indexing (LSI) Filtering Neural Networks r Structured Models Probabilistic T Non-Overlapping Lists a Inference Network Proximal Nodes s Belief Network k Browsing Hidden Markov Model Topical Mixture Model Browsing Probabilistic LSI Flat Language Model Structure Guided probability-based Hypertext IR 2004 – Berlin Chen 2

  3. Two Perspectives for IR Models • Matching Strategy – Literal term matching • Vector Space Model (VSM), Hidden Markov Model (HMM), Language Model (LM) – Concept matching • Latent Semantic Indexing (LSI), Probabilistic Latent Semantic Indexing (PLSI), Topical Mixture Model (TMM) • Learning Capability – Term weight, query expansion, document expansion, etc • Vector Space Model, Latent Semantic Indexing – Solid statistical foundations • Hidden Markov Model, Probabilistic Latent Semantic Indexing (PLSI), Topical Mixture Model (TMM) IR 2004 – Berlin Chen 3

  4. Two Perspectives for IR Models (cont.) • Literal Term Matching vs. Concept Matching 香港星島日報篇報導引述軍事觀察家的話表示到二零零 中國解放軍 五年台灣將完全喪失空中優勢原因是中國大陸戰機不論是數量 蘇愷戰機 或是性能上都將超越台灣報導指出中國在大量引進俄羅斯先進 武器的同時也得加快研發自製武器系統目前西安飛機製造廠任職 的改進型飛豹戰機即將部署尚未與蘇愷三十通道地對 地攻擊住宅飛機以督促遇到挫折的監控其戰機目前也已經 取得了重大階段性的認知成果根據日本媒體報導在台海戰 爭隨時可能爆發情況之下北京方面的基本方針使用高科技答應局部 戰爭因此解放軍打算在二零零四年前又有包括蘇愷 三十二期在內的兩百架蘇霍伊戰鬥機 中共新一代 空軍戰力 IR 2004 – Berlin Chen 4

  5. HMM/N-gram-based Model • Model the query as a sequence of input observations Q (index terms), = Q q q .. q n q .. 1 2 N • Model the doc as a discrete HMM composed of D distributions of N -gram parameters ( ) • The relevance measure, , can be estimated by P Q D is R the N -gram probabilities of the index term sequence for the query, , predicted by the doc = D Q q q .. q n q .. 1 2 N – A generative model for IR ( ) = D * arg max P D is R Q D ( ) ( ) ≈ arg max P Q D is R P D is R D ( ) ≈ with the assumption that …… arg max P Q D is R D IR 2004 – Berlin Chen 5

  6. HMM/N-gram-based Model (cont.) • Given a word sequence, , of length N W ⇒ = W w w .. w n w .. 1 2 N – How to estimate its corresponding probability ? ( ) chain rule is applied P W ( ) = P w w .. w .. w 1 2 n N ) ( ) ( ) ( ) ( = P w P w w P w w w ..... P w w w .... w − 1 2 1 3 1 2 N 1 2 N 1 Too complicate to estimate all the necessary probability items ! IR 2004 – Berlin Chen 6

  7. HMM/N-gram-based Model (cont.) • N -gram approximation (Language Model) – Unigram ( ) ( ) ( ) ( ) ( ) = P W P w P w P w ..... P w 1 2 3 N – Bigram ) ( ) ( ) ( ) ( ) ( = P W P w P w w P w w ..... P w N w − 1 2 1 3 2 N 1 – Trigram ) ( ) ( ) ( ) ( ) ( = P W P w P w w P w w w ..... P w w w − − 1 2 1 3 1 2 N N 2 N 1 – …….. IR 2004 – Berlin Chen 7

  8. HMM/N-gram-based Model (cont.) • A discrete HMM composed of distributions of N -gram parameters (viewed as a language model source) ( ) ( ) ( ) N bigram modeling = P Q D is R P q D ∏ P q q , D − 1 n n 1 = n 2 smoothing/interpolation , but reasons for what: avoiding zero prob., and …? [ ] ( ) ( ) ( ) = + P Q D is R m P q D m P q Corpus 1 1 2 1 [ ) ] ( ) ( ) ( ) ( N ⋅ + + + ∏ m P q D m P q Corpus m P q q , D m P q q , Corpus − − 1 n 2 n 3 n n 1 4 n n 1 = n 2 A mixture of N ( ) P q D m n 1 probability distributions m ( ) P q Corpus 2 n ∑ ( ) m P q q , D = Q q q .. q .. q 3 − n n 1 1 2 n N + + + = m m m m 1 1 2 3 4 m ( ) 4 P q q , Corpus − n n 1 IR 2004 – Berlin Chen 8

  9. HMM/N-gram-based Model (cont.) • Variants: Three Types of HMM Structures – Type I: Unigram-Based (Uni) [ ] ( ) ( ) ( ) N = + P Q D is R ∏ m P q D m P q Corpus 1 n 2 n = n 1 – Type II: Unigram/Bigram-Based (Uni+Bi) [ ] ( ) ( ) ( ) = + P Q D is R m P q D m P q Corpus 1 1 2 1 [ ] ( ) ( ) ( ) N ⋅ + + m P q D m P q Corpus m P q q , D ∏ − 1 n 2 n 3 n n 1 = n 2 – Type III: Unigram/Bigram/Corpus-Based (Uni+Bi*) [ ] ( ) ( ) ( ) = + P Q D is R m P q D m P q Corpus 1 1 2 1 [ ) ] ( ) ( ) ( ) ( N ⋅ + + + m P q D m P q Corpus m P q q , D m P q q , Corpus ∏ − − 1 n 2 n 3 n n 1 4 n n 1 = n 2 P ( 陳水扁 總統 視察 阿里山 小火車 | D ) =[ m 1 P ( 陳水扁 | D )+ m 2 P ( 陳水扁 | C )] x [ m 1 P ( 總統 | D )+ m 2 P ( 總統 | C )+ m 3 P ( 總統 | 陳水扁, D )+ m 4 P ( 總統 | 陳水扁, C )] x[ m 1 P ( 視察 | D )+ m 2 P ( 視察 | C )+ m 3 P ( 視察 | 總統, D )+ m 4 P ( 視察 | 總統, C )] x ………. IR 2004 – Berlin Chen 9

  10. HMM/N-gram-based Model (cont.) ( ) • The role of the corpus N -gram probabilities P q Corpus n ( ) P q q , Corpus – Model the general distribution of the index terms − n n 1 ( ) • Help to solve zero-frequency problem = P q D 0 ! n • Help to differentiate the contributions of different missing terms in a doc (global information like IDF?) – The corpus N -gram probabilities were estimated using an outside corpus P ( q a | D )=0.4 P ( q b | D )=0.3 Doc D P ( q c | D )=0.2 q c q b q a q b P ( q d | D )=0.1 q a q a q c q d P ( q e | D )=0.0 q a q b P ( q f | D )=0.0 IR 2004 – Berlin Chen 10

  11. HMM/N-gram-based Model (cont.) • Estimation of N -grams (Language Models) – Maximum likelihood estimation (MLE) for doc N -grams • Unigram Counts of term q i in the doc D ( ) ( ) = ∑ ( ) C q C q = Length of the doc D P q D D i D i ( ) i D C q Or number of terms in the doc D D j ∈ q D j • Bigram ( ) Counts of term pair ( q j , q i ) in the doc D ( ) C q , q = D j i P q q , D ( ) i j C q D j – Similar formulas for corpus N-grams Counts of term q i in the Corpus ( ) ( ) ( ) C q ( ) C q , q = Corpus i = P q Corpus Corpus j i P q q , D ( ) i i j Corpus C q Corpus j Corpus : an outside corpus or just the doc collection IR 2004 – Berlin Chen 11

  12. HMM/N-gram-based Model (cont.) • Basically, m 1 , m 2 , m 3 , m 4 , can be estimated by using the Expectation-Maximization (EM) algorithm because of the insufficiency of training data – All docs share the same weights m i here – The N -gram probability distributions also can be estimated using the EM algorithm instead of the maximum likelihood (ML) estimation • Unsupervised: using doc itself, ML • Supervised: using query exemplars, EM • For those docs with training queries, m 1 , m 2 , m 3 , m 4 , can be estimated by using the Minimum Classification Error (MCE) training algorithm – The docs can have different weights IR 2004 – Berlin Chen 12

  13. HMM/N-gram-based Model (cont.) • Expectation-Maximization Training – The weights are tied among the documents – E.g. m 1 of Type I HMM can be trained using the following equation: the old weight ( ) ≦ 2265 docs 819 queries ⎡ ⎤ m P q D ∑ ∑ ∑ the new weight 1 n ⎢ ⎥ ( ) ( ) + m P q D m P q Corpus ⎣ ⎦ [ ] [ ] ∈ ∈ ∈ Q TrainSet D Doc q Q 1 n 2 n = ˆ m n Q R to Q [ ] ∑ 1 ⋅ Q Doc R to Q [ ] ∈ Q TrainSet Q [ ] Q TrainSet • Where is the set of training query exemplars, [ ] Doc is the set of docs that are relevant to a specific R to Q training query exemplar , is the length of the query , Q Q [ ] and is the total number of docs relevant to the Doc R to Q query Q IR 2004 – Berlin Chen 13

Recommend


More recommend