information retrieval tutorial 3 index compression
play

Information Retrieval Tutorial 3: Index Compression Professor: - PowerPoint PPT Presentation

Introduction Dictionary compression Postings compression Information Retrieval Tutorial 3: Index Compression Professor: Michel Schellekens TA: Ang Gao University College Cork 2012-11-09 Index Compression 1 / 36 Introduction Dictionary


  1. Introduction Dictionary compression Postings compression How big is the term vocabulary? That is, how many distinct words are there? In practice,the vocabulary will keep growing with collection size.(eg: names of new people) Heaps’ law: M = kT b M is the size of the vocabulary, T is the number of tokens in the collection. Typical values for the parameters k and b are: 30 ≤ k ≤ 100 √ and b ≈ 0 . 5. Thus M ≈ k T Index Compression 8 / 36

  2. Introduction Dictionary compression Postings compression How big is the term vocabulary? That is, how many distinct words are there? In practice,the vocabulary will keep growing with collection size.(eg: names of new people) Heaps’ law: M = kT b M is the size of the vocabulary, T is the number of tokens in the collection. Typical values for the parameters k and b are: 30 ≤ k ≤ 100 √ and b ≈ 0 . 5. Thus M ≈ k T Notice logM = logk + blogT ( y = c + bx ) Index Compression 8 / 36

  3. Introduction Dictionary compression Postings compression How big is the term vocabulary? That is, how many distinct words are there? In practice,the vocabulary will keep growing with collection size.(eg: names of new people) Heaps’ law: M = kT b M is the size of the vocabulary, T is the number of tokens in the collection. Typical values for the parameters k and b are: 30 ≤ k ≤ 100 √ and b ≈ 0 . 5. Thus M ≈ k T Notice logM = logk + blogT ( y = c + bx ) Heaps’ law is linear in log-log space. Index Compression 8 / 36

  4. Introduction Dictionary compression Postings compression How big is the term vocabulary? That is, how many distinct words are there? In practice,the vocabulary will keep growing with collection size.(eg: names of new people) Heaps’ law: M = kT b M is the size of the vocabulary, T is the number of tokens in the collection. Typical values for the parameters k and b are: 30 ≤ k ≤ 100 √ and b ≈ 0 . 5. Thus M ≈ k T Notice logM = logk + blogT ( y = c + bx ) Heaps’ law is linear in log-log space. It is the simplest possible relationship between collection size and vocabulary size in log-log space. Index Compression 8 / 36

  5. Introduction Dictionary compression Postings compression How big is the term vocabulary? That is, how many distinct words are there? In practice,the vocabulary will keep growing with collection size.(eg: names of new people) Heaps’ law: M = kT b M is the size of the vocabulary, T is the number of tokens in the collection. Typical values for the parameters k and b are: 30 ≤ k ≤ 100 √ and b ≈ 0 . 5. Thus M ≈ k T Notice logM = logk + blogT ( y = c + bx ) Heaps’ law is linear in log-log space. It is the simplest possible relationship between collection size and vocabulary size in log-log space. An empirical finding(Empirical law). Index Compression 8 / 36

  6. Introduction Dictionary compression Postings compression Heaps’ law for Reuters Vocabulary size M as a function of collection size T (number of tokens) for 6 Reuters-RCV1. For these data, the dashed line 5 log 10 M = 0 . 49 ∗ log 10 T + 1 . 64 is the 4 best least squares fit. log10 M Thus, M = 10 1 . 64 T 0 . 49 3 and k = 10 1 . 64 ≈ 44 and b = 0 . 49. 2 1 0 0 2 4 6 8 log10 T Index Compression 9 / 36

  7. Introduction Dictionary compression Postings compression Empirical fit for Reuters Index Compression 10 / 36

  8. Introduction Dictionary compression Postings compression Empirical fit for Reuters Good, as we just saw in the graph. Index Compression 10 / 36

  9. Introduction Dictionary compression Postings compression Empirical fit for Reuters Good, as we just saw in the graph. Example: for the first 1,000,020 tokens Heaps’ law predicts 38,323 terms: 44 × 1 , 000 , 020 0 . 49 ≈ 38 , 323 Index Compression 10 / 36

  10. Introduction Dictionary compression Postings compression Empirical fit for Reuters Good, as we just saw in the graph. Example: for the first 1,000,020 tokens Heaps’ law predicts 38,323 terms: 44 × 1 , 000 , 020 0 . 49 ≈ 38 , 323 The actual number is 38,365 terms, very close to the prediction. Index Compression 10 / 36

  11. Introduction Dictionary compression Postings compression Empirical fit for Reuters Good, as we just saw in the graph. Example: for the first 1,000,020 tokens Heaps’ law predicts 38,323 terms: 44 × 1 , 000 , 020 0 . 49 ≈ 38 , 323 The actual number is 38,365 terms, very close to the prediction. Empirical observation: fit is good in general. Index Compression 10 / 36

  12. Introduction Dictionary compression Postings compression Exercise Compute vocabulary size M 1 Index Compression 11 / 36

  13. Introduction Dictionary compression Postings compression Exercise Compute vocabulary size M 1 Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. Index Compression 11 / 36

  14. Introduction Dictionary compression Postings compression Exercise Compute vocabulary size M 1 Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. Assume a search engine indexes a total of 20,000,000,000 (2 × 10 10 ) pages, containing 200 tokens on average Index Compression 11 / 36

  15. Introduction Dictionary compression Postings compression Exercise Compute vocabulary size M 1 Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. Assume a search engine indexes a total of 20,000,000,000 (2 × 10 10 ) pages, containing 200 tokens on average What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law? Index Compression 11 / 36

  16. Introduction Dictionary compression Postings compression Exercise Compute vocabulary size M 1 Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. Assume a search engine indexes a total of 20,000,000,000 (2 × 10 10 ) pages, containing 200 tokens on average What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law? log ( M 1 ) = logk + blog ( T 1 ) and M 1 = 3000 T 1 = 10 , 000 Index Compression 11 / 36

  17. Introduction Dictionary compression Postings compression Exercise Compute vocabulary size M 1 Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. Assume a search engine indexes a total of 20,000,000,000 (2 × 10 10 ) pages, containing 200 tokens on average What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law? log ( M 1 ) = logk + blog ( T 1 ) and M 1 = 3000 T 1 = 10 , 000 log (3000) = logk + blog (10 , 000) Index Compression 11 / 36

  18. Introduction Dictionary compression Postings compression Exercise Compute vocabulary size M 1 Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. Assume a search engine indexes a total of 20,000,000,000 (2 × 10 10 ) pages, containing 200 tokens on average What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law? log ( M 1 ) = logk + blog ( T 1 ) and M 1 = 3000 T 1 = 10 , 000 log (3000) = logk + blog (10 , 000) log ( M 2 ) = logk + blog ( T 2 ) and M 2 = 30 , 000, T 2 = 1 , 000 , 000 Index Compression 11 / 36

  19. Introduction Dictionary compression Postings compression Exercise Compute vocabulary size M 1 Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. Assume a search engine indexes a total of 20,000,000,000 (2 × 10 10 ) pages, containing 200 tokens on average What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law? log ( M 1 ) = logk + blog ( T 1 ) and M 1 = 3000 T 1 = 10 , 000 log (3000) = logk + blog (10 , 000) log ( M 2 ) = logk + blog ( T 2 ) and M 2 = 30 , 000, T 2 = 1 , 000 , 000 log (30 , 000) = logk + blog (1 , 000 , 000) Index Compression 11 / 36

  20. Introduction Dictionary compression Postings compression Exercise Compute vocabulary size M 1 Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. Assume a search engine indexes a total of 20,000,000,000 (2 × 10 10 ) pages, containing 200 tokens on average What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law? log ( M 1 ) = logk + blog ( T 1 ) and M 1 = 3000 T 1 = 10 , 000 log (3000) = logk + blog (10 , 000) log ( M 2 ) = logk + blog ( T 2 ) and M 2 = 30 , 000, T 2 = 1 , 000 , 000 log (30 , 000) = logk + blog (1 , 000 , 000) thus logk = log (3000) − 2 ≈ 1 . 477, k ≈ 29 . 99 and b = 0 . 5 Index Compression 11 / 36

  21. Introduction Dictionary compression Postings compression Exercise Compute vocabulary size M 1 Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens. Assume a search engine indexes a total of 20,000,000,000 (2 × 10 10 ) pages, containing 200 tokens on average What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law? log ( M 1 ) = logk + blog ( T 1 ) and M 1 = 3000 T 1 = 10 , 000 log (3000) = logk + blog (10 , 000) log ( M 2 ) = logk + blog ( T 2 ) and M 2 = 30 , 000, T 2 = 1 , 000 , 000 log (30 , 000) = logk + blog (1 , 000 , 000) thus logk = log (3000) − 2 ≈ 1 . 477, k ≈ 29 . 99 and b = 0 . 5 log ( M ) = logk + 1 2 log (20 , 000 , 000 , 000 ∗ 200) = 7 . 778 thus M = 10 7 . 778 ≈ 6 ∗ 10 7 Index Compression 11 / 36

  22. Introduction Dictionary compression Postings compression Basic knowledge to remember To binary represent an integer n, number of bits need = ⌊ log 2 ( n ) ⌋ + 1 n = { 2 } 10 = { 10 } 2 n = { 3 } 10 = { 11 } 2 n = { 4 } 10 = { 100 } 2 Index Compression 12 / 36

  23. Introduction Dictionary compression Postings compression Outline Introduction 1 Dictionary compression 2 Postings compression 3 Index Compression 13 / 36

  24. Introduction Dictionary compression Postings compression Dictionary compression Index Compression 14 / 36

  25. Introduction Dictionary compression Postings compression Dictionary compression The dictionary is small compared to the postings file. Index Compression 14 / 36

  26. Introduction Dictionary compression Postings compression Dictionary compression The dictionary is small compared to the postings file. But we want to keep it in memory. Index Compression 14 / 36

  27. Introduction Dictionary compression Postings compression Dictionary compression The dictionary is small compared to the postings file. But we want to keep it in memory. Also: competition with other applications, cell phones, onboard computers, fast startup time Index Compression 14 / 36

  28. Introduction Dictionary compression Postings compression Dictionary compression The dictionary is small compared to the postings file. But we want to keep it in memory. Also: competition with other applications, cell phones, onboard computers, fast startup time So compressing the dictionary is important. Index Compression 14 / 36

  29. Introduction Dictionary compression Postings compression Recall: Dictionary as array of fixed-width entries Index Compression 15 / 36

  30. Introduction Dictionary compression Postings compression Recall: Dictionary as array of fixed-width entries term document pointer to frequency postings list a 656,265 − → aachen 65 − → . . . . . . . . . zulu 221 − → space needed: 20 bytes 4 bytes 4 bytes Space for Reuters: (20+4+4)*400,000 = 11.2 MB Index Compression 15 / 36

  31. Introduction Dictionary compression Postings compression Fixed-width entries are bad. Index Compression 16 / 36

  32. Introduction Dictionary compression Postings compression Fixed-width entries are bad. Most of the bytes in the term column are wasted. Index Compression 16 / 36

  33. Introduction Dictionary compression Postings compression Fixed-width entries are bad. Most of the bytes in the term column are wasted. We allot 20 bytes for terms of length 1. Index Compression 16 / 36

  34. Introduction Dictionary compression Postings compression Fixed-width entries are bad. Most of the bytes in the term column are wasted. We allot 20 bytes for terms of length 1. We can’t handle hydrochlorofluorocarbons and supercalifragilisticexpialidocious Index Compression 16 / 36

  35. Introduction Dictionary compression Postings compression Fixed-width entries are bad. Most of the bytes in the term column are wasted. We allot 20 bytes for terms of length 1. We can’t handle hydrochlorofluorocarbons and supercalifragilisticexpialidocious Average length of a term in English: 8 characters Index Compression 16 / 36

  36. Introduction Dictionary compression Postings compression Fixed-width entries are bad. Most of the bytes in the term column are wasted. We allot 20 bytes for terms of length 1. We can’t handle hydrochlorofluorocarbons and supercalifragilisticexpialidocious Average length of a term in English: 8 characters How can we use on average 8 characters per term? Index Compression 16 / 36

  37. Introduction Dictionary compression Postings compression Dictionary as a string Index Compression 17 / 36

  38. Introduction Dictionary compression Postings compression Dictionary as a string . . . syst i l esyzyget i csyzygial syzygyszaibe ly i teszec inszono. . . freq. postings ptr. term ptr. 9 → 92 → 5 → 71 → 12 → . . . . . . . . . 4 bytes 4 bytes 3 bytes Index Compression 17 / 36

  39. Introduction Dictionary compression Postings compression Space for dictionary as a string Index Compression 18 / 36

  40. Introduction Dictionary compression Postings compression Space for dictionary as a string 4 bytes per term for frequency Index Compression 18 / 36

  41. Introduction Dictionary compression Postings compression Space for dictionary as a string 4 bytes per term for frequency 4 bytes per term for pointer to postings list Index Compression 18 / 36

  42. Introduction Dictionary compression Postings compression Space for dictionary as a string 4 bytes per term for frequency 4 bytes per term for pointer to postings list 8 bytes (on average) for term in string Index Compression 18 / 36

  43. Introduction Dictionary compression Postings compression Space for dictionary as a string 4 bytes per term for frequency 4 bytes per term for pointer to postings list 8 bytes (on average) for term in string 3 bytes per pointer into string (need log 2 8 · 400000 < 24 bits to resolve 8 · 400 , 000 positions) Index Compression 18 / 36

  44. Introduction Dictionary compression Postings compression Space for dictionary as a string 4 bytes per term for frequency 4 bytes per term for pointer to postings list 8 bytes (on average) for term in string 3 bytes per pointer into string (need log 2 8 · 400000 < 24 bits to resolve 8 · 400 , 000 positions) Space: 400 , 000 × (4 + 4 + 3 + 8) = 7 . 6MB (compared to 11.2 MB for fixed-width array) Index Compression 18 / 36

  45. Introduction Dictionary compression Postings compression Dictionary as a string with blocking Index Compression 19 / 36

  46. Introduction Dictionary compression Postings compression Dictionary as a string with blocking . . . 7 sys t i l e 9 syzyget i c 8 syzyg i a l 6 syzygy 11 sza i be l y i te 6 szec i n. . . freq. postings ptr. term ptr. 9 → 92 → 5 → 71 → 12 → . . . . . . . . . Index Compression 19 / 36

  47. Introduction Dictionary compression Postings compression Space for dictionary as a string with blocking Index Compression 20 / 36

  48. Introduction Dictionary compression Postings compression Space for dictionary as a string with blocking Example block size k = 4 Index Compression 20 / 36

  49. Introduction Dictionary compression Postings compression Space for dictionary as a string with blocking Example block size k = 4 Where we used 4 × 3 bytes for term pointers without blocking . . . Index Compression 20 / 36

  50. Introduction Dictionary compression Postings compression Space for dictionary as a string with blocking Example block size k = 4 Where we used 4 × 3 bytes for term pointers without blocking . . . . . . we now use 3 bytes for one pointer plus 4 bytes for indicating the length of each term. Index Compression 20 / 36

  51. Introduction Dictionary compression Postings compression Space for dictionary as a string with blocking Example block size k = 4 Where we used 4 × 3 bytes for term pointers without blocking . . . . . . we now use 3 bytes for one pointer plus 4 bytes for indicating the length of each term. We save 12 − (3 + 4) = 5 bytes per block. Index Compression 20 / 36

  52. Introduction Dictionary compression Postings compression Space for dictionary as a string with blocking Example block size k = 4 Where we used 4 × 3 bytes for term pointers without blocking . . . . . . we now use 3 bytes for one pointer plus 4 bytes for indicating the length of each term. We save 12 − (3 + 4) = 5 bytes per block. Total savings: 400 , 000 / 4 ∗ 5 = 0.5 MB Index Compression 20 / 36

  53. Introduction Dictionary compression Postings compression Space for dictionary as a string with blocking Example block size k = 4 Where we used 4 × 3 bytes for term pointers without blocking . . . . . . we now use 3 bytes for one pointer plus 4 bytes for indicating the length of each term. We save 12 − (3 + 4) = 5 bytes per block. Total savings: 400 , 000 / 4 ∗ 5 = 0.5 MB This reduces the size of the dictionary from 7.6 MB to 7.1 MB. Index Compression 20 / 36

  54. Introduction Dictionary compression Postings compression Lookup of a term without blocking Average search cost: (1 + 2 ∗ 2 + 4 ∗ 3 + 1 ∗ 4) / 8 ≈ 2 . 6 steps aid box den ex job ox pit win Index Compression 21 / 36

  55. Introduction Dictionary compression Postings compression Lookup of a term with blocking: (slightly) slower aid box den ex job ox pit win Average search cost: (2 + 3 + 4 + 5 + 1 + 2 + 3 + 4) / 8 ≈ 3 steps. Index Compression 22 / 36

  56. Introduction Dictionary compression Postings compression Question: Can we increase K arbitrarily, is there any problem with it ? Index Compression 23 / 36

  57. Introduction Dictionary compression Postings compression Question: Can we increase K arbitrarily, is there any problem with it ? Ans:We can’t increase K arbitrarily, term look up time will go up. If we only have one pointer, we can’t do binary search, have to go from beginning to the end to find the term. Index Compression 23 / 36

  58. Introduction Dictionary compression Postings compression Front coding One block in blocked compression ( k = 4) . . . 8 a u t o m a t a 8 a u t o m a t e 9 a u t o m a t i c 10 a u t o m a t i o n ⇓ . . . further compressed with front coding. 8 a u t o m a t ∗ a 1 ⋄ e 2 ⋄ i c 3 ⋄ i o n Index Compression 24 / 36

  59. Introduction Dictionary compression Postings compression Dictionary compression for Reuters: Summary Index Compression 25 / 36

  60. Introduction Dictionary compression Postings compression Dictionary compression for Reuters: Summary data structure size in MB dictionary, fixed-width 11.2 dictionary, term pointers into string 7.6 ∼ , with blocking, k = 4 7.1 ∼ , with blocking & front coding 5.9 Index Compression 25 / 36

  61. Introduction Dictionary compression Postings compression Outline Introduction 1 Dictionary compression 2 Postings compression 3 Index Compression 26 / 36

  62. Introduction Dictionary compression Postings compression Postings compression Index Compression 27 / 36

  63. Introduction Dictionary compression Postings compression Postings compression The postings file is much larger than the dictionary, factor of at least 10. Index Compression 27 / 36

  64. Introduction Dictionary compression Postings compression Postings compression The postings file is much larger than the dictionary, factor of at least 10. Key desideratum: store each posting compactly Index Compression 27 / 36

  65. Introduction Dictionary compression Postings compression Postings compression The postings file is much larger than the dictionary, factor of at least 10. Key desideratum: store each posting compactly A posting for our purposes is a docID. Index Compression 27 / 36

  66. Introduction Dictionary compression Postings compression Postings compression The postings file is much larger than the dictionary, factor of at least 10. Key desideratum: store each posting compactly A posting for our purposes is a docID. For Reuters (800,000 documents), we would use 32 bits per docID when using 4-byte integers. Index Compression 27 / 36

Recommend


More recommend