analysis of the paragraph vector model for information
play

Analysis of the Paragraph Vector Model for Information Retrieval - PowerPoint PPT Presentation

Analysis of the Paragraph Vector Model for Information Retrieval Qingyao Ai 1 , Liu Yang 1 , Jiafeng Guo 2 , W. Bruce Croft 1 1 College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, MA, USA {aiqy, lyang,


  1. Analysis of the Paragraph Vector Model for Information Retrieval Qingyao Ai 1 , Liu Yang 1 , Jiafeng Guo 2 , W. Bruce Croft 1 1 College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, MA, USA {aiqy, lyang, croft}@cs.umass.edu 2 CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, China guojiafeng@ict.ac.cn

  2. Motivation Most tasks in IR benefit from representations that reflect the semantic • relationships between words and documents. Word-document matching is essential for language modeling approaches. • Topic models Bags-of-words • Topic/Neural Models Representations Embeddings PLSA – president president car car LDA – … – [ 0,0,1,0,0,1,0,0 ] Query [ 0,0, 1 ,0,0, 1 ,0,0 ] Neural models • Word2vec – = 0 > 0 Paragraph vector – [ 2,0,0,0,0,0,0,3 ] Document model government vehicle No priori topic number • Highly efficient in training • Automatically learn document representations • Language model • Optimize a weighting scheme widely used in IR •

  3. Outline • Paragraph Vector Based Retrieval Model – What is paragraph vector model – How to use it for retrieval • Issues of Paragraph Vector Model in Retrieval Scenario – Over-fitting on short documents – Improper noise distribution – Insufficient modeling for word substitution • Experiments – Experiment setup – Results – Parameter sensitivity

  4. Paragraph Vector Model Paragraph vector model [13] jointly learns embedding for words and • documents through optimizing the probabilities of observed word-document pairs defined as: w · ~ exp ( ~ d ) P ( w | d ) = (1) w 0 · ~ w 0 2 V w exp ( ~ P d ) • The following figure describes the structure of Paragraph vector model with distributed bag-of-words assumption (PV-DBOW). food research vaccine drug … Semantic Space Document d

  5. Language Estimation with Paragraph Vector Model • Inspired by LDA-based retrieval model [24], we apply paragraph vector model by smoothing the probability estimation in language modeling approaches with PV- DBOW and propose a paragraph vector based retrieval model (PV-LM). Query: food drug law P ( q 1 | d ) = λ P P V ( q 1 | d ) + (1 − λ ) P LM ( q 1 | d ) q 1 (2) q 2 drug … food research vaccine drug q 3 law Semantic Space Document d

  6. Language Estimation with Paragraph Vector Model • However, PV-LM did not PV-LM QL 0.260 produce promising results: 0.259 – The performance of PV-LM is 0.258 highly sensitive to the training 0.257 iteration of PV-DBOW. MAP 0.256 – The mean average precision (MAP) of PV-LM does not outperform 0.255 LDA-LM [24] on Robust04 (0.259). 0.254 0.253 0.252 10 20 30 40 50 60 70 80 90 Iteration number Figure 1: The MAP of QL and the PV-based retrieval model with the original PV-DBOW on Robust04 with title queries in respect of different training iteration.

  7. Outline • Paragraph Vector Based Retrieval Model – What is paragraph vector model – How to use it for retrieval • Issues of Paragraph Vector Model in Retrieval Scenario – Over-fitting on short documents – Improper noise distribution – Insufficient modeling for word substitution • Experiments – Experiment setup – Results – Parameter sensitivity

  8. Overfitting on Short Documents Iter 5 Iter 20 Iter 80 900 10 800 9 Frequency in Top 50 documents 700 8 7 600 Vector Norm 6 500 5 400 4 300 3 200 2 1 100 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 2500 Document Length (words) Document Length (words) Figure 2: The distribution of documents in respect of Figure 3: The distribution of vector norms in respect of document length for top 50 documents retrieved by PV- document length for 10,000 documents randomly based retrieval model on Robust04 (title queries). sampled from Robust04. The PV-based retrieval model tends to retrieve more short documents as • training iteration increases. In a subset of 10,000 random sampled documents, we observed significant • norm increase for short documents’ vectors.

  9. Overfitting on Short Documents Iter 5 Iter 20 Iter 80 900 10 800 9 Frequency in Top 50 documents 700 8 7 600 Vector Norm 6 500 5 400 4 300 3 200 2 1 100 0 0 0 500 1000 1500 2000 0 500 1000 1500 2000 2500 Document Length (words) Document Length (words) Figure 2: The distribution of documents in respect of Figure 3: The distribution of vector norms in respect of document length for top 50 documents retrieved by PV- document length for 10,000 documents randomly based retrieval model on Robust04 (title queries). sampled from Robust04. Long document vector norms change the probability distribution of • document language models and makes them focus on observed words. One direct solution to this problem is L2 regularization: • ` 0 ( w, d ) = ` ( w, d ) − � # d || ~ d || 2 (3)

  10. Negative Sampling Proposed by Mikolov et al. [17], negative • Corpus sampling is a technique that approximates the k samples global objective of PV-DBOW by sampling … “negative” terms from corpus: computer X X w · ~ ` = #( w, d ) log( � ( ~ d )) food morning w ∈ V w d ∈ V d - w N · ~ X X + #( w, d )( k · E w N ∼ P V [log � ( − ~ d )]) - + America w ∈ V w d ∈ V d - (4) • If we derived the local objective of a specific word-doc pair and let its partial derivative equal to zero. Then we have: d = log(#( w, d ) 1 w · ~ Document d P V ( w )) − log k ~ · (5) #( d )

  11. Improper Noise Distribution The original negative sampling technique • PD PV adopts a empirical word distribution as : 0.2 P V ( w N ) = # w N 0.18 (6) Negative Sampling Probability | C | 0.16 which makes the original PV-DBOW 0.14 0.12 optimizing a variation of TF-ICF weighting 0.1 scheme. 0.08 0.06 • Empirically: 0.04 CF-based negative sampling suppresses - 0.02 frequent words too much. 0 TF-ICF weighting loses the document structure 2 3 4 5 6 7 - information Corpus Frequency (power of 10) • We proposed a document-frequency based Figure 4: The distribution of the original negative noise distribution: sampling (PV ) and the document-frequency based negative sampling (PD). The horizontal axis represents # D ( w N ) log value of word frequency (base 10). P D ( w N ) = (7) P w 0 2 V w # D ( w 0 ) d = log(#( w, d ) 1 w · ~ which makes the PV-DBOW optimizing a P V ( w )) − log k (5) ~ · #( d ) variation of TF-IDF weighting scheme.

  12. Insufficient Modeling for Word Substitution Table 1: The cosine similarities between “ clothing” , “ garment” and four relevant documents in Robust04 query 361 (“ clothing sweatshops ”). PV-DBOW clothing garment clothing 1.000 0.632 LA112689-0194 ( TF clothing = 2 , TF garment = 26) 0.044 0.134 LA112889-0108 ( TF clothing = 0 , TF garment = 10) -0.003 0.100 LA021090-0137 ( TF clothing = 7 , TF garment = 9) 0.052 0.092 LA022890-0105 ( TF clothing = 6 , TF garment = 6) 0.066 0.079 Existing topic models and embedding models mainly focus on two types of • word relations: co-occurrence (e.g. topic related words) and substitution (e.g. synonyms) PV-DBOW focuses on capturing word co-occurrence but ignores word- • context information, which makes it difficult to understand word substitution relation (e.g. “ clothing ” and “ garment ”).

  13. Insufficient Modeling for Word Substitution Table 1: The cosine similarities between “ clothing” , “ garment” and four relevant documents in Robust04 query 361 (“ clothing sweatshops ”). PV-DBOW PV joint objective clothing garment clothing garment 1.000 0.632 1.000 0.638 clothing LA112689-0194 ( TF clothing = 2 , TF garment = 26) 0.044 0.134 0.107 0.169 LA112889-0108 ( TF clothing = 0 , TF garment = 10) -0.003 0.100 0.126 0.155 LA021090-0137 ( TF clothing = 7 , TF garment = 9) 0.052 0.092 0.147 0.119 LA022890-0105 ( TF clothing = 6 , TF garment = 6) 0.066 0.079 0.107 0.107 • As suggested by Dai et al. [5] and Sun et al. [22], one approach to alleviate the problem is regularizing PV-DBOW by requiring word vectors to predict their context. Specifically, we apply a joint objective as: w i · ~ w N · ~ ` = log( � ( ~ d )) + k · E w N ⇠ P V [log � ( − ~ d )] i + L (8) X + log( � ( ~ w i · ~ c j )) + k · E c N ⇠ P V [log � ( − ~ w i · ~ c N )] j = i � L j 6 = i

  14. Outline • Paragraph Vector Based Retrieval Model – What is paragraph vector model – How to use it for retrieval • Issues of Paragraph Vector Model in Retrieval Scenario – Over-fitting on short documents – Improper noise distribution – Insufficient modeling for word substitution • Experiments – Experiment setup – Results – Parameter sensitivity

  15. Experiment Setup Datasets: • – TREC collections: Robust04, GOV2* with title and description queries – Five-fold cross validation – Evaluation: mean average precision (MAP), normalized discounted cumulative gain (NDCG@20) and precision (P@20) Reported Models: • – QL: Query likelihood model [19] with Dirichlet smoothing. – LDA-LM: LDA-based retrieval model proposed by Wei and Croft [15]. – PV-LM: the PV-based retrieval model with the PV-DBOW proposed by Le et al. [13] – EPV-R-LM: the PV-LM model with L2 regularization. – EPV-DR-LM: the EPV-R-LM model with document frequency based negative sampling. – EPV-DRJ-LM: the EPV-DR-LM model with joint objective. * Due to the efficiency issues, we used a random subset with 500k documents to train LDA and PV on GOV2

Recommend


More recommend