1 Indexing Approaches Indexing Approaches R. Baeza-Yates and B. R. Baeza-Yates and B. Ribeiro-Neto Ribeiro-Neto: : Modern Informa Modern Information Retrie ion Retrieval, l, Chapter 8. Chapt Chapter Chapter 8 1999 . 1999 1999 Addision 1999. . Addision Addision Wesley Addision Wesley Wesley Wesley. Jon Atle Gulla TDT4215 – Indexing & Searching 2 Indexing and Searching Indexing and Searching •TDT4215 TDT4215 – Indexing & Searching
3 Outline • Indexing approaches: – Inverted files – Suffix Trees & Suffix Arrays – Signature Files • Search part of chapter optional TDT4215 – Indexing & Searching 4 Inverted Files - Definition • The inverted file stucture is composed of two elements: • The inverted file stucture is composed of two elements: – Vocabulary the set of all different words in the text. – Occurrences . O For each word a list of all the text positions where the word appears is stored. The set of all those list is called the occurrences. •TDT4215 TDT4215 – Indexing & Searching
5 Inverted Files - Block Inverted Files - Block Addressing (1) g ( ) • The space required for the vocabulary is rather small. • Occurrences demand much more space since each word • Occurrences demand much more space, since each word appearing in the text is referenced once in that structure. • To reduce space requirements a technique called block To reduce space requirements, a technique called block addressing is used: – the text is divided in blocks; – the occurrences point to the blocks where the word appears. • By using block addressing – the pointers are smaller, because there are fewer blocks than positions, th i t ll b th f bl k th iti – all the occurrences of a word inside a single block are collapsed to one reference. TDT4215 – Indexing & Searching 6 Inverted Files - Block Inverted Files - Block Addressing (2) g ( ) Note that we do not know now that there are 2 occurrences of “words” in block 3 words in block 3 • Size: – Fixed – imposing a logical block structure over the text database. Fixed imposing a logical block structure over the text database – Natural division – text collection into files, documents, Web pages and others. TDT4215 – Indexing & Searching
7 Index Size (1) • Vocabulary: O(n ) typically between 0.4 and 0.6 • Occurrences: O(n) O O( ) • Example: E l – TREC-2 collection 1 Gb – Vocabulary: 5 Mb Vocabulary: 5 Mb TDT4215 – Indexing & Searching 8 Index Size(2) I d Si (2) Index Small collection (1 Medium collection Large collection (2 Mb) Mb) (200 Mb) (200 Mb) Gb) Gb) stopwords stopwords stopwords Adressing 45% 73% 36% 64% 35% 63% words Adressing 19% 26% 18% 32% 26% 47% documents (10 Kb) Adressing 18% 25% 1.7% 2.4% 0.5% 0.7% 256 blocks E.g. For document addressing indexes, the index size is 26% of the total document collection, provided that stopwords are deleted. If stopwords are not deleted from collection, the index is 47% . IMPORTANT NOTE: The left and right column for each collection type should be switched. f g f yp This error also appears in the book (Table 8.1, page 195). TDT4215 – Indexing & Searching
9 Inverted Files. Searching • What to search on inverted files: – Single-word queries – the process ends by delivering the list of Single word queries the process ends by delivering the list of occurrences. – Context queries are more difficult to solve with inverted indices • each element must be searched separately and a list generated for h l t t b h d t l d li t t d f each one. • then the list of all elements are traversed in synchronization to find places where all the words appear in sequence or appear close places where all the words appear in sequence or appear close enough. TDT4215 – Indexing & Searching 10 Inverted Files Search Inverted Files. Search Algorithm g The search algorithm on an inverted index follows three general steps: s eps 1. Vocabulary search. – The words and patterns present in the query are isolated and searched in the vocabulary. y – Notice that phrases and proximity queries are split into single words. 2. Retrieval of occurrences. – The list of the occurences of the words found are retrieved. The list of the occurences of the words found are retrieved. 3. Manipulation of occurrences. – The occurrences are processed to solve phrases, proximity or Boolean operations. – If block addressing is used it may be necessary to directly search the text to find the If block addressing is used it may be necessary to directly search the text to find the information missing from the occurrences. TDT4215 – Indexing & Searching
11 Inverted Files. Construction • All the vocabulary known up to a moment is kept in a trie data structure storing for each All the vocabulary known up to a moment is kept in a trie data structure, storing for each word a list of its occurrences. • Each word of the text is read and searched in the trie. • If it is not found it is added to the trie with empty list of occurrences. p y • Once it is in the trie, the new position is added to the end of its list of occurrences. •TDT4215 TDT4215 – Indexing & Searching 12 Inverted Files. Construction • It is a good practice to split the index into two files. – In the first the list of occurrences are stored contiguously (posting In the first the list of occurrences are stored contiguously (posting file). – In the second file, the vocabulary is stored in lexicographical order and for each word a pointer to its list in the first file is also and, for each word, a pointer to its list in the first file is also included. – This allows the vocabulary to be kept in memory at search time in many cases. TDT4215 – Indexing & Searching
13 Inverted Files Construction of Inverted Files. Construction of large text g • The algorithm is not practical for large text where the index does not fit in main memory: index does not fit in main memory: – The algorithm is used until the main memory is exhausted. – The partial index I i obtained up to now written to disk and erased from main memory before continuing with the rest of the text memory before continuing with the rest of the text. – Finally, a number of partial indices I i exist on disk. These indices are then merged in a hierarchical fashion. • I 1 and I 2 , I 3 and I 4 and so on • I and I I and I and so on • I 1..2 and I 3..4 , I 5..6 and I 7..8 and so on • This continued until there is just one index comprising the whole text. – Merging two indices consists of: Merging two indices consists of: • merging the sorted vocabularies and whenever the same word appears in both indices; • merging both list of occurrences • merging both list of occurrences. TDT4215 – Indexing & Searching 14 Inverted Files Construction of Inverted Files. Construction of large text g TDT4215 – Indexing & Searching
15 Suffix Trees and Suffix Arrays Suffix Trees and Suffix Arrays. Definition • Suffix Trees and Suffix Arrays indexes see the text as one long string. Each position in the text is considered as a text suffix . Each suffix is thus uniquely identified by its position identified by its position. • Index points are selected from the text, which point to the beginning of the text positions which will be retrievable. • • This structure can be used to index words or characters This structure can be used to index words or characters. •TDT4215 TDT4215 – Indexing & Searching 16 Suffix Trees and Suffix Arrays. Suffix Trees and Suffix Arrays Structure • A suffix tree is a trie data structure built over all the structure built over all the suffixes of the text. • The pointers to the suffixes are stored at the leaf nodes. • The problem with this structure is its space. st uctu e s ts space • Compression of the trie structure by compressing unary paths unary paths. TDT4215 – Indexing & Searching
17 Suffix Trees and Suffix Arrays. Suffix Trees and Suffix Arrays Structure • Suffix arrays provide essentially the same functionality as suffix trees with much less space requirements. • • A suffix array is simply an array containing all the pointers to the text suffixes listed A suffix array is simply an array containing all the pointers to the text suffixes listed in lexicographical order. • Suffix arrays are designed to allow binary searches done by comparing the contents of each pointer. co te ts o eac po te TDT4215 – Indexing & Searching 18 Suffix Trees and Suffix Arrays Suffix Trees and Suffix Arrays. Structure • Supra-indices over the suffix arrays. • The simplest supra-index is no more than a sampling of one out of b suffix array entries, where for each sample the first l suffix characters are stored array entries, where for each sample the first l suffix characters are stored in the supra-index. • This supra-index is then used as a first step of the search to reduce external accesses external accesses. •TDT4215 TDT4215 – Indexing & Searching
Recommend
More recommend