indices
play

Indices Tomasz Bartoszewski Inverted Index Search Construction - PowerPoint PPT Presentation

Indices Tomasz Bartoszewski Inverted Index Search Construction Compression Inverted Index In its simplest form, the inverted index of a document collection is basically a data structure that attaches each distinctive term with a


  1. Indices Tomasz Bartoszewski

  2. Inverted Index • Search • Construction • Compression

  3. Inverted Index • In its simplest form, the inverted index of a document collection is basically a data structure that attaches each distinctive term with a list of all documents that contains the term.

  4. Search Using an Inverted Index

  5. Step 1 – vocabulary search finds each query term in the vocabulary If (Single term in query){ goto step3; } Else{ goto step2; }

  6. Step 2 – results merging • merging of the lists is performed to find their intersection • use the shortest list as the base • partial match is possible

  7. Step 3 – rank score computation • based on a relevance function (e.g. okapi, cosine) • score used in the final ranking

  8. Example

  9. Index Construction

  10. Time complexity • O(T), where T is the number of all terms (including duplicates) in the document collection (after pre-processing)

  11. Index Compression

  12. Why? • avoid disk I/O • the size of an inverted index can be reduced dramatically • the original index can also be reconstructed • all the information is represented with positive integers -> integer compression

  13. Use gaps • 4, 10, 300, and 305 -> 4, 6, 290 and 5 • Smaller numbers • Large for rare terms – not a big problem

  14. All in one

  15. Unary • For x: X-1 bits of 0 and one of 1 e.g. 5 -> 00001 7 -> 0000001

  16. Elias Gamma Coding • 1 + log 2 𝑦 in unary (i.e., log 2 𝑦 0-bits followed by a 1-bit) • followed by the binary representation of x without its most significant bit. • efficient for small integers but is not suited to large integers • 1 + log 2 𝑦 is simply the number of bits of x in binary • 9 -> 000 1001

  17. Elias Delta Coding • For small int longer than gamma codes (better for larger) • gamma code representation of 1 + log 2 𝑦 • followed by the binary representation of x less the most significant bit • Dla 9: 1 + log 2 9 = 4 -> 00100 9 -> 00100 001

  18. Golomb Coding • values relative to a constant b • several variations of the original Golomb • E.g. 𝑟 = 𝑦/𝑐 Remainder r = 𝑦 − 𝑟𝑐 (b possible reminders e.g. b=3: 0,1,2) binary representation of a remainder requires log 2 𝑐 or log 2 𝑐 write the first few remainders using log 2 𝑐 r est log 2 𝑐

  19. Example • b=3 and x=9 • 𝑟 = 9/3 = 3 • 𝑗 = log 2 3 = 1 => 𝑒 = 1 ( 𝑒 = 2 𝑗+1 − 𝑐 ) • 𝑠 = 9 − 3 ∗ 3 = 0 • Result 00010

  20. The coding tree for b=5

  21. Selection of b • 𝑐 ≈ 0.69 ∗ 𝑂 𝑜 𝑢 • N – total number of documents • 𝑜 𝑢 – number of documents that contain term t

  22. Variable-Byte Coding • seven bits in each byte are used to code an integer • last bit 0 – end, 1 – continue • E.g. 135 -> 00000011 00001110

  23. Summary • Golomb coding better than Elias • Gamma coding does not work well • Variable-byte integers are often faster than Variable-bit (higher storage costs) • compression technique can allow retrieval to be up to twice as fast than without compression • space requirement averages 20% – 25% of the cost of storing uncompressed integers

  24. Latent Semantic Indexing

  25. Reason • many concepts or objects can be described in multiple ways • find using synonyms of the words in the user query • deal with this problem through the identification of statistical associations of terms

  26. Singular value decomposition (SVD) • estimate latent structure, and to remove the “noise” • hidden “concept” space, which associates syntactically different but semantically similar terms and documents

  27. LSI • LSI starts with an m*n termdocument matrix A • row = term; column = document • value e.g. term frequency

  28. Singular Value Decomposition • factor matrix A into three matrices: 𝐵 = 𝑉𝐹𝑊 𝑈 m is the number of row in A n is the number of columns in A r is the rank of A, r ≤ min(𝑛, 𝑜)

  29. Singular Value Decomposition • U is a 𝑛 ∗ 𝑠 matrix and its columns, called left singular vectors, are eigenvectors associated with the r non-zero eigenvalues of 𝐵𝐵 𝑈 • V is an n ∗ 𝑠 matrix and its columns, called right singular vectors, are eigenvectors associated with the r non-zero eigenvalues of 𝐵 𝑈 𝐵 • E is a r ∗ 𝑠 diagonal matrix, E = diag( 𝜏 1 , 𝜏 2 , …, 𝜏 𝑠 ), 𝜏 1 > 0 . 𝜏 1 , 𝜏 2 , …, 𝜏 𝑠 , called singular values, are the non-negative square roots of r non-zero eigenvalues of 𝐵𝐵 𝑈 they are arranged in decreasing order, i.e., 𝜏 1 ≥ 𝜏 2 ≥ ⋯ ≥ 𝜏 𝑠 > 0 • reduce the size of the matrices

  30. 𝑈 𝐵 𝑙 = 𝑉 𝑙 𝐹 𝑙 𝑊 𝑙

  31. Query and Retrieval • q - user query (treated as a new document) • document in the k-concept space, denoted by 𝑟 𝑙 • 𝑟 𝑙 = 𝑟 𝑈 𝑉 𝑙 𝐹 𝑙 −1

  32. Example

  33. Example

  34. Example

  35. Example

  36. Example

  37. Example

  38. Example q - “user interface”

  39. Example

  40. Summary • The original paper of LSI suggests 50 – 350 dimensions. • k needs to be determined based on the specific document collection • association rules may be able to approximate the results of LSI

Recommend


More recommend