npfl103 information retrieval 3
play

NPFL103: Information Retrieval (3) Index construction, Distributed - PowerPoint PPT Presentation

Index construction Distributed indexing Dynamic indexing Index compression NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index compression Pavel Pecina Institute of Formal and Applied Linguistics


  1. Index construction Distributed indexing Dynamic indexing Index compression NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index compression Pavel Pecina Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Original slides are courtesy of Hinrich Schütze, University of Stutugart. 1 / 73 pecina@ufal.mff.cuni.cz

  2. Index construction MapReduce Postings compression Dictionary compression Term statistics Index compression Logarithmic merge Dynamic indexing Distributed indexing Distributed indexing SPIMI algorithm BSBI algorithm Index construction Contents Index compression Dynamic indexing 2 / 73

  3. Index construction Distributed indexing Dynamic indexing Index compression Index construction 3 / 73

  4. Index construction Distributed indexing machines than one fault tolerant machine. RAM, and TBs of disk space. opposed to smaller chunks). Block sizes: 8KB to 256 KB faster than many small chunks. 4 / 73 disk head is being positioned. Hardware basics Index compression Dynamic indexing ▶ Data access much faster in memory than on HD disk (approx. 10 × ) ▶ Disk seeks are “idle” time: No data is transferred from disk while the ▶ To optimize transfer time from disk to memory: one large chunk is ▶ Disk I/O is block-based: Reading and writing of entire blocks (as ▶ Servers used in IR systems typically have tens or hundreds of GBs of ▶ Fault tolerance is expensive: It’s cheaper to use many regular

  5. Index construction b 1 TB or more size of disk space several GBs size of main memory lowlevel operation (e.g., compare+swap a word) p processor’s clock rate Distributed indexing transfer time per byte average seek time s value statistic symbol Some HW statistics Index compression Dynamic indexing 5 / 73 5 ms = 5 × 10 − 3 s 0.02 µ s = 2 × 10 − 8 s 10 9 s − 1 0.01 µ s = 10 − 8 s ▶ SSD (Solid State Drive) faster but smaller, more expensive, limitued write cycles

  6. Index construction Distributed indexing Dynamic indexing Index compression RCV1 collection demonstrating many of the points in this course. we will use the Reuters RCV1 collection. 6 / 73 ▶ Shakespeare’s collected works are not large enough for ▶ As an example for applying scalable index construction algorithms, ▶ English newswire articles published in 1995–1996 (one year). ▶ https://trec.nist.gov/data/reuters/reuters.html

  7. Index construction Distributed indexing Dynamic indexing Index compression A Reuters RCV1 document 7 / 73

  8. Index construction 6 3. How many positional postings? 2. 4.5 bytes per token vs. 7.5 bytes per type: why the difgerence? 1. Average doc. frequency of a term (how many tokens)? Exercise: 100,000,000 non-positional postings T 7.5 bytes per term (= word type) 4.5 bytes per token (without spaces/punct.) bytes per token (incl. spaces/punct.) Distributed indexing 400,000 terms (= word types) M 200 tokens per document L 800,000 documents N Reuters RCV1 statistics Index compression Dynamic indexing 8 / 73

  9. Index construction 2 5 Distributed indexing 16 57 132 … Calpurnia 31 2 54 101 . . . dictionary postings 4 6 1 4 Dynamic indexing Index compression Goal: construct the inverted index Brutus 2 1 11 173 Caesar 174 45 31 10 / 73 − → − → − → � �� � � �� �

  10. Index construction caesar 1 I 1 I 1 hath 1 enact 1 did 2 caesar 2 1 1 caesar 1 capitol 2 brutus 1 brutus 2 be 2 ambitious docID term i’ it Distributed indexing the 2 with 2 was 1 was 2 you 2 told 2 the 1 2 2 so 2 noble 1 me 2 let 1 killed 1 killed 1 julius = 2 ambitious 1 brutus 1 capitol 1 the 1 i’ 1 killed 1 was 1 I caesar killed 1 julius 1 enact 1 did 1 I docID term Index construction: Sort postings in memory Index compression Dynamic indexing 1 1 2 me was 2 caesar 2 you 2 told 2 hath 2 brutus 2 noble 2 the 2 caesar 2 with 2 be 2 it 2 let 2 so 1 11 / 73 ⇒

  11. Index construction Distributed indexing typical current machine. collections. at the end? Index compression Sort-based index construction Dynamic indexing 12 / 73 ▶ As we build index, we parse documents one at a time. ▶ The final postings for any term are incomplete until the end. ▶ Can we keep all postings in memory and then do the sort in-memory ▶ No, not for large collections ▶ At 10–12 bytes per postings entry, we need a lot of space for large ▶ T = 100 , 000 , 000 in the case of RCV1: we can do this in memory on a ▶ In-memory index construction does not scale for large collections. ▶ Thus: We need to store intermediate results on disk.

  12. Index construction Distributed indexing Dynamic indexing Index compression Same algorithm for disk? collections, but by using disk instead of memory? disk seeks. 13 / 73 ▶ Can we use the same index construction algorithm for larger ▶ No: Sorting T = 100 , 000 , 000 records on disk is too slow – too many ▶ We need an external sorting algorithm.

  13. Index construction Distributed indexing Dynamic indexing Index compression “External” sorting algorithm (using few disk seeks) (i) accumulate postings, (ii) sort in memory, (iii) write to disk 14 / 73 ▶ We must sort T = 100 , 000 , 000 non-positional postings. ▶ Each posting has size 12 bytes (4+4+4: termID, docID, doc. freq). ▶ Define a block to consist of 10 , 000 , 000 such postings ▶ We can easily fit that many postings into memory. ▶ We will have 10 such blocks for RCV1. ▶ Basic idea of algorithm: ▶ For each block: ▶ Then merge the blocks into one long sorted order.

  14. Index construction julius brutus d2 brutus d3 caesar d1 caesar d4 d1 postings killed d2 noble d3 with d4 merged postings disk to be merged d2 Distributed indexing noble Dynamic indexing Index compression Merging two blocks Block 1 brutus d3 caesar d4 d3 killed with d4 Block 2 brutus d2 caesar d1 julius d1 15 / 73

  15. Index construction 3 2. collect [termID, docID] pairs with the same docID 1. sort [termID, docID] pairs 7 6 5 Distributed indexing 4 while (all documents have not been processed) 2 1 Blocked Sort-Based Indexing (BSBI) Index compression Dynamic indexing 16 / 73 BSBIndexConstruction () n ← 0 do n ← n + 1 block ← ParseNextBlock () BSBI-Invert ( block ) WriteBlockToDisk ( block , f n ) MergeBlocks ( f 1 , . . . , f n ; f merged ) ▶ BSBI-Invert: ▶ Key decision: What is the size of one block?

  16. Index construction Distributed indexing Dynamic indexing Index compression Problem with sort-based algorithm implement a term to termID mapping. [termID, docID] postings … with a scalable, but very slow index construction method.) 18 / 73 ▶ Our assumption was: we can keep the dictionary in memory. ▶ We need the dictionary (which grows dynamically) in order to ▶ Actually, we could work with [term, docID] postings instead of ▶ …but then intermediate files become very large. (We would end up

  17. Index construction Distributed indexing Dynamic indexing Index compression Single-pass in-memory indexing (SPIMI) maintain term-termID mapping across blocks. occur. each block. 19 / 73 ▶ Key idea 1: Generate separate dictionaries for each block – no need to ▶ Key idea 2: Don’t sort. Accumulate postings in postings lists as they ▶ With these two ideas we can generate a complete inverted index for ▶ These separate indexes can then be merged into one big index.

  18. Index construction 5 return output _ file 13 12 11 10 9 8 7 Distributed indexing 6 20 / 73 while (free memory available) 3 Dynamic indexing 2 Index compression 4 1 SPIMI-Invert SPIMI-Invert ( token _ stream ) output _ file ← NewFile () dictionary ← NewHash () do token ← next ( token _ stream ) if term ( token ) / ∈ dictionary then postings _ list ← AddToDictionary ( dictionary , term ( token )) else postings _ list ← GetPostingsList ( dictionary , term ( token )) if full ( postings _ list ) then postings _ list ← DoublePostingsList ( dictionary , term ( token )) AddToPostingsList ( postings _ list , docID ( token )) sorted _ terms ← SortTerms ( dictionary ) WriteBlockToDisk ( sorted _ terms , dictionary , output _ file ) ▶ Merging of blocks is analogous to BSBI. ▶ Compression of terms/postings makes SPIMI even more efgicient

  19. Index construction Distributed indexing Dynamic indexing Index compression Distributed indexing 21 / 73

  20. Index construction Distributed indexing Dynamic indexing Index compression Distributed indexing fail. 22 / 73 ▶ For web-scale indexing: must use a distributed computer cluster ▶ Individual machines are fault-prone: can unpredictably slow down or ▶ How do we exploit such a pool of machines?

Recommend


More recommend