searching string collections for the most relevant
play

Searching String Collections for the Most Relevant Documents the - PowerPoint PPT Presentation

Searching String Collections for the Most Relevant Documents the Most Relevant Documents Wing Kai Hon (NTHU, Taiwan) Rahul Shah (LSU) Rahul Shah (LSU) Jeff Vitter (Texas A&M Univ.) Outline Outline Background on compressed data


  1. Searching String Collections for the Most Relevant Documents the Most Relevant Documents Wing ‐ Kai Hon (NTHU, Taiwan) Rahul Shah (LSU) Rahul Shah (LSU) Jeff Vitter (Texas A&M Univ.)

  2. Outline Outline • Background on compressed data structures Background on compressed data structures • Our framework • Achieving optimal results hi i i l l • Construction algorithms • Succinct solutions • Conclusions Conclusions

  3. The Attack of Massive Data The Attack of Massive Data • Lots of massive data sets being generated Lots of massive data sets being generated – Web publishing, bioinformatics, XML, e ‐ mail, satellite geographical data – IP address information, UPCs, credit cards, ISBN numbers, large inverted files • Data sets need to be compressed (and are compressible) – Mobile devices have limited storage available – I/O overhead is reduced – There is never enough memory! • Goal: design data structures to manage massive data sets – Near ‐ minimum amount of space • Measure space in data ‐ aware way i e in terms of each individual data set • Measure space in data ‐ aware way, i.e. in terms of each individual data set – Near ‐ optimal query times for powerful queries – Efficient in external memory 3

  4. Parallel Disk Model [Vitter, Shriver 90, 94] 80 GB – 100 TB and more! N = problem size N problem size M = internal memory size B = disk block size B = disk block size D = # independent disks Scan: O( N / DB ) 8 – 500 KB 1 1 – 4 GB 4 GB Sorting: O( ( N / DB ) log B ( N / M ) ) Search: O( log DB N ) See book [Vitter 08] for overview

  5. Indexing all the books in a library g y  10 ‐ floor library  catalogue of books  each title and some keywords y  negligible additional space li ibl dditi l  a small card (few bytes) per book ( y ) p  one bookshelf to store the cards  limited search operations!

  6. Word ‐ level indexing (à la Google) ( (search for a word using inverted index) h f d i i d i d ) i 1 i 2 1. Split the text into words. 2. Collect all distinct words in a dictionary. 2. Collect all distinct words in a dictionary. 3. For each word w , store the inverted list of its locations i i 1 , i 2 ,  i w w in the text: i 1 , i 2 , 

  7. Word ‐ level indexing Simple implementation: one pointer per location Avg. word size ¸ pointer size i d index space = ¼ text size i ¼  Much better implementation: Much better implementation: compress the inverted lists by encoding the gaps between encoding the gaps between adjacent entries (e.g.,  and  codes WMB99]) codes WMB99])  Index space is 10%-15% 1 ½ floor + 10 floors

  8. Full ‐ text indexing (searching for a general pattern P ) (searching for a general pattern P ) • Not handled efficiently by Google • No clear notion of word is always available: • Some Eastern languages • unknown structure (e.g., DNA sequences) • Alphabet  , text T of size n bytes (i.e., n log |  | bits) : each text position is the start of a potential occurrence of P h h f l f Naive approach: blow-up with O( n 2 ) words of space Can we do better with O(n) words (i.e., O( n log n ) bits)? Or even better with linear space O(n log |  |) bits? Or best yet with compressed space n H k (1 + o (1)) bits?

  9. 160 Suffix tree / Patricia trie, |  |=2 floors 10 10 floors • Compact trie storing the suffixes of input string bababa# (assuming a < # < b) • Space is O(n log n) bits >> text size of n bits • In practice, space is roughly 16 n bytes [MM93]

  10. Suffix array • Sorted list of suffixes (assuming a < b < #) • Sorted list of suffixes (assuming a < b < #) 40-50 floors • Better space occupancy: n log n bits, 4 n bytes in practice 4 n bytes in practice • Additional n bytes for the lcps [MM93] 10 10 floors • Can find pattern P by binary search. (Actually there are better ways.)

  11. Space reduction Space reduction • The importance of space saving (money saving): p p g ( y g) – Portable computing with limited memory – Search engines use DRAM in place of hard disks – Next generation cellular phones will cost # bits transmitted • Sparse suffix tree [KU96] and other data structures based on suffix trees, arrays, and automata [K95,CV97,...] • Practical implementations of suffix trees reduce space but still 10 n bytes [K99] or 2.5 n bytes [M01] on average

  12. Compressed Suffix Array (Grossi, Gupta, Vitter 03) 50-60 40-50 O( |P| + polylog( n )) search time O( |P| + polylog( n )) search time floors floors fl (in RAM model). Size of index equals size of text q New indexes New indexes in entropy-compressed form (such as our CSA) (with multiplicative constant 1)! require 20%-40% of the text size of the text size Self-indexing text: Self indexing text: no need to keep the text! 11 ½ 10 floors Any portion of the text can be y p floors floors decoded from the index. 2-4 1 ½ floors Decoding is fast and does not floors require scanning the whole text. i i th h l t t inverted inverted suffix suffix Can cut search time further by new new text index index array array log n factor (word size). og acto ( o d s e) First external memory version in SPIRE 2009.

  13. Fundamental Problems in Text Search Fundamental Problems in Text Search • Pattern Matching: Given a text T and pattern P g p drawn from alphabet Σ , find all locations of P in T. – data structures: Suffix Trees and Suffix arrays – Better: Compressed Suffix Arrays [GGV03], FM ‐ Index [FM05] • Document Listing: Given a collection of text strings ( documents ) d 1 ,d 2 ,…d D Given a collection of text strings ( documents ) d 1 ,d 2 ,…d D of total length n, search for query pattern P (of length p). – Output the ndoc documents which contain pattern P. – Issue: Total number ndoc of documents output might be much smaller than Issue: Total number ndoc of documents output might be much smaller than total number of pattern occurrences, so going though all occurrences is inefficient. – Muthukrishnan: O(n) words of space, answers queries in optimal O(p + ndoc).

  14. Modified Problem—using Relevance Modified Problem using Relevance • Instead of listing all documents (strings) in which g ( g ) pattern occurs, list only highly ``relevant” documents. – Frequency: where pattern P occurs most frequently. – Proximity: where two occurrences of P are close to each P i it h t f P l t h other. – Importance: where each document has a static weight (e.g., Google’s PageRank). • Threshold vs. Top ‐ k – Thresholding: K ‐ mine and K ‐ repeats problem (Muthu). Thresholding: K mine and K repeats problem (Muthu) – Top ‐ k: Retrieve only the k most ‐ relevant documents. • Intuitive for User

  15. Approaches Approaches • Inverted Indexes – Popular in IR community. – Need to know patterns in advance (words). – In strings the word boundaries are not well defined In strings the word boundaries are not well defined. – Inverted indexes for all possible substrings can take a lot more space. – Else they may not be able to answer arbitrary pattern Else they may not be able to answer arbitrary pattern queries (provably efficiently). • Muthukrishnan’s Structures (based on suffix trees) M th k i h ’ St t (b d ffi t ) – Take O(n log n) words of space for K ‐ mine and K ‐ repeats problem (thresholding) while answering queries i O(P in O(P + ndoc) time. d ) ti – Top ‐ k queries require additional overhead.

  16. Suffix tree ‐ based solutions • Document Retrieval Problem – Store all suffixes of the D documents. ll ff f h d – Each leaf in suffix tree contains • Document id • D: Leaf ‐ rank of previous leaf of the same document D L f k f i l f f th d t – Traverse the suffix tree and get the range [L,R] such that all the occurrences of the pattern correspond to the leaves from leaf ‐ rank L to R. the leaves from leaf rank L to R. – To obtain each document uniquely, output only those leaves with D ‐ values ≤ L (i.e., the smallest leaf rank for the document). • 3 ‐ sided query in 2 dimensions ‐‐ (2,1) ‐ range query. • Can be done using repeated application of RMQs. – O(P+ ndoc) time… see figure. • K ‐ mine and K ‐ repeats p – Fixed K, separate structure for each K value : O(n log n) words space.

  17. Suffix tree based solutions Suffix tree based solutions d1: banana Suffixes: d2: urban a$ a$ ($ ($ < a < b) b) a urban$ an$ $ n ban ana$ d2 d1 n a anana$ rban$ ba $ $ $ $ $ d2 d2 $ ban$ d2 banana$ d2 a ana$ $ na$ $ n$ d2 na$ na$ na$ na$ d1 d1 d1 d1 d1 d1 d1 nana$ d1 rban$ urban$ • Search pattern: “an” • We look at the node’s subtree: d1 appears twice and d2 appears once in this subtree

  18. Preliminary : RMQs for top ‐ k on array • Range Maximum Query: Given an array A and query (i,j), d ( ) report the maximum of A[i..j] – Linear space, linear preprocessing time DS with O(1) query Linear space linear preprocessing time DS with O(1) query time • Range threshold: Given an array A, and a query (i,j, τ ), Range threshold: Given an array A, and a query (i,j, τ ), report all the numbers in A[i..j] which are >= τ – Can be done using repeated RMQs in O(output) time • Range top ‐ k: Given an array A, and a query (i,j,k) report top ‐ k highest numbers in A[i..j] – Repeated RMQs + heap = O(k log k) time • Generalization: Given array A, and query specifies set of t ranges [i 1 ,j 1 ], [i 2 ,j 2 ] ,…[i t ,j t ] – Threshold : O(t +output), top ‐ k : O(t + k log k)

Recommend


More recommend