Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Dimensionality Reduction for Information Retrieval using Vector Replacement of Rare Terms Tobias Berka, Marian Vajterˇ sic April 30, 2011 Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Outline 1 Introduction Dimensionality Reduction 2 Rare Term Vector Replacement Zipf’s Law Replacement Vectors Rare Term Replacement 3 Evaluation Retrieval Performance Computational Performance Stability 4 Summary & Conclusions Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Rare Term Vector Replacement Dimensionality Reduction Evaluation Summary & Conclusions Introduction Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Rare Term Vector Replacement Dimensionality Reduction Evaluation Summary & Conclusions Goals Reduce dimensionality, Preserve or improve... Pair-wise distances, Cross-class scatter, Retrieval / clustering / classification performance. Detect... Contributing factors, Individual components, Signals or noise. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Rare Term Vector Replacement Dimensionality Reduction Evaluation Summary & Conclusions Methods Great Classics: Linear Methods, Singular Value Decomposition, Principal Component Analysis (PCA), Non-negative Matrix Factorization(s), Independent Component Analysis. Canonical Extension: Kernel Methods, Maps: Mesh Fitting, Self-Organization, Manifold Learning: Local Linearization, Local Non-linear Reduction. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Rare Term Vector Replacement Dimensionality Reduction Evaluation Summary & Conclusions My Interest Better retrieval, More complete retrieval. Dynamic searching, Less reliance on static indices, Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Rare Term Vector Replacement Dimensionality Reduction Evaluation Summary & Conclusions My Interest Interactive semi-supervised clustering, Exploratory data analysis, Search. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Rare Term Vector Replacement Dimensionality Reduction Evaluation Summary & Conclusions My Interest Sparse ❀ dense. Good for super-scalar CPUs, More efficient parallelism. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Rare Term Vector Replacement Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Zipf’s Law “The [document] frequency of a word is reciprocally proportional to its frequency rank.” 1 f i ∝ rank ( f i ) . Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Zipf’s Law in Practice “Most words occur only in a very small number of documents.” Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Zipf’s Law in Pictures 10000 occurrences Q1=1 Q2=2 Q3=7 mean=75.93 1000 cut-off=694 occurrences 100 10 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 feature (relative) Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Zipf’s Law in Pictures occurrences Q1=1 Q2=1 Q3=4 100 mean=6.52 cut-off=10 occurrences 10 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 feature (relative) Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Zipf’s Law vs. Dimensionality Reduction Eliminate rare terms? High importance for information retrieval! Can we compress them? Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Replacement Vectors Let’s compute replacement vectors! Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Centroid Summarization We operate on a corpus in vector form. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Centroid Summarization Select the vectors containing a rare term. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Centroid Summarization Compute the centroid. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Vector Truncation Discard the rare features. occurrences Q1=1 Q2=1 Q3=4 100 mean=6.52 cut-off=10 occurrences 10 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 feature (relative) Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Computing Replacement Vectors For all rare terms, we compute the following: Select all documents containing the rare term, Compute the (weighted) average vector, Truncate all rare terms from this vector. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions A More Efficient Algorithm For all documents, For all rare terms, Add the common terms to the average vector. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions New Document Representation For all documents, we compute the following: Truncate all rare terms from the document vectors (i.e. retain only common terms), Add the linear combination of all replacement vectors, For all rare terms in the document, Scaled by the weighted term frequency, Normalize the result to unit length. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Subsequent Rank Reduction Once we have computed the replacement vectors, we compute a rank-reduced PCA, Reduces number of features by 50%, improves the retrieval performance, Low number of features, dense data matrix – use a symmetric eigensolver, In LAPACK terms: xSPEVX , xSYEV , etc. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Retrieval Performance Rare Term Vector Replacement Computational Performance Evaluation Stability Summary & Conclusions Evaluation Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Introduction Retrieval Performance Rare Term Vector Replacement Computational Performance Evaluation Stability Summary & Conclusions Reuters Corpus Reuters corpus, training set, all categories. 0.8 sparse TD-IDF (47,236) vector replacement (535) rank-reduced vector replacement (392) 0.75 0.7 mean precision 0.65 0.6 0.55 0.5 0 10 20 30 40 50 60 70 80 90 100 hit list rank Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto
Recommend
More recommend