Indices Tomasz Bartoszewski Inverted Index Search Construction - PowerPoint PPT Presentation

Indices Tomasz Bartoszewski

Inverted Index • Search • Construction • Compression

Inverted Index • In its simplest form, the inverted index of a document collection is basically a data structure that attaches each distinctive term with a list of all documents that contains the term.

Search Using an Inverted Index

Step 1 – vocabulary search finds each query term in the vocabulary If (Single term in query){ goto step3; } Else{ goto step2; }

Step 2 – results merging • merging of the lists is performed to find their intersection • use the shortest list as the base • partial match is possible

Step 3 – rank score computation • based on a relevance function (e.g. okapi, cosine) • score used in the final ranking

Example

Index Construction

Time complexity • O(T), where T is the number of all terms (including duplicates) in the document collection (after pre-processing)

Index Compression

Why? • avoid disk I/O • the size of an inverted index can be reduced dramatically • the original index can also be reconstructed • all the information is represented with positive integers -> integer compression

Use gaps • 4, 10, 300, and 305 -> 4, 6, 290 and 5 • Smaller numbers • Large for rare terms – not a big problem

All in one

Unary • For x: X-1 bits of 0 and one of 1 e.g. 5 -> 00001 7 -> 0000001

Elias Gamma Coding • 1 + log 2 𝑦 in unary (i.e., log 2 𝑦 0-bits followed by a 1-bit) • followed by the binary representation of x without its most significant bit. • efficient for small integers but is not suited to large integers • 1 + log 2 𝑦 is simply the number of bits of x in binary • 9 -> 000 1001

Elias Delta Coding • For small int longer than gamma codes (better for larger) • gamma code representation of 1 + log 2 𝑦 • followed by the binary representation of x less the most significant bit • Dla 9: 1 + log 2 9 = 4 -> 00100 9 -> 00100 001

Golomb Coding • values relative to a constant b • several variations of the original Golomb • E.g. 𝑟 = 𝑦/𝑐 Remainder r = 𝑦 − 𝑟𝑐 (b possible reminders e.g. b=3: 0,1,2) binary representation of a remainder requires log 2 𝑐 or log 2 𝑐 write the first few remainders using log 2 𝑐 r est log 2 𝑐

Example • b=3 and x=9 • 𝑟 = 9/3 = 3 • 𝑗 = log 2 3 = 1 => 𝑒 = 1 ( 𝑒 = 2 𝑗+1 − 𝑐 ) • 𝑠 = 9 − 3 ∗ 3 = 0 • Result 00010

The coding tree for b=5

Selection of b • 𝑐 ≈ 0.69 ∗ 𝑂 𝑜 𝑢 • N – total number of documents • 𝑜 𝑢 – number of documents that contain term t

Variable-Byte Coding • seven bits in each byte are used to code an integer • last bit 0 – end, 1 – continue • E.g. 135 -> 00000011 00001110

Summary • Golomb coding better than Elias • Gamma coding does not work well • Variable-byte integers are often faster than Variable-bit (higher storage costs) • compression technique can allow retrieval to be up to twice as fast than without compression • space requirement averages 20% – 25% of the cost of storing uncompressed integers

Latent Semantic Indexing

Reason • many concepts or objects can be described in multiple ways • find using synonyms of the words in the user query • deal with this problem through the identification of statistical associations of terms

Singular value decomposition (SVD) • estimate latent structure, and to remove the “noise” • hidden “concept” space, which associates syntactically different but semantically similar terms and documents

LSI • LSI starts with an m*n termdocument matrix A • row = term; column = document • value e.g. term frequency

Singular Value Decomposition • factor matrix A into three matrices: 𝐵 = 𝑉𝐹𝑊 𝑈 m is the number of row in A n is the number of columns in A r is the rank of A, r ≤ min(𝑛, 𝑜)

Singular Value Decomposition • U is a 𝑛 ∗ 𝑠 matrix and its columns, called left singular vectors, are eigenvectors associated with the r non-zero eigenvalues of 𝐵𝐵 𝑈 • V is an n ∗ 𝑠 matrix and its columns, called right singular vectors, are eigenvectors associated with the r non-zero eigenvalues of 𝐵 𝑈 𝐵 • E is a r ∗ 𝑠 diagonal matrix, E = diag( 𝜏 1 , 𝜏 2 , …, 𝜏 𝑠 ), 𝜏 1 > 0 . 𝜏 1 , 𝜏 2 , …, 𝜏 𝑠 , called singular values, are the non-negative square roots of r non-zero eigenvalues of 𝐵𝐵 𝑈 they are arranged in decreasing order, i.e., 𝜏 1 ≥ 𝜏 2 ≥ ⋯ ≥ 𝜏 𝑠 > 0 • reduce the size of the matrices

𝑈 𝐵 𝑙 = 𝑉 𝑙 𝐹 𝑙 𝑊 𝑙

Query and Retrieval • q - user query (treated as a new document) • document in the k-concept space, denoted by 𝑟 𝑙 • 𝑟 𝑙 = 𝑟 𝑈 𝑉 𝑙 𝐹 𝑙 −1

Example

Example q - “user interface”

Example

Summary • The original paper of LSI suggests 50 – 350 dimensions. • k needs to be determined based on the specific document collection • association rules may be able to approximate the results of LSI

Indices Tomasz Bartoszewski Inverted Index Search Construction - PowerPoint PPT Presentation

Indices Tomasz Bartoszewski Inverted Index Search Construction Compression Inverted Index In its simplest form, the inverted index of a document collection is basically a data structure that attaches each distinctive term with a

JUST THE MATHS SLIDES NUMBER 1.3 ALGEBRA 3 (Indices and radicals (or surds)) by

BENCHMARK AND PROPRIETARY INDICES February 2019 WHATS AN

CMBX Indices The New US Commercial Mortgage Backed Credit Default Swap Benchmark Indices March

ABX Indices The New US Asset Backed Credit Default Swap Benchmark Indices January 2006 CDS

Cost-of-living indices for Germany Claus C. Breuer University of Duisburg-Essen, Germany

A Tale of Two Indices: Positive vs. Normative Indexation in the Emerging Markets April 2020 A

SCALABILITY OF COMPOSITE INDICES OF WELL-BEING: INDICES OF WELL-BEING: THE CASE OF THE THE CASE

Retrospective Price Indices and Substitution Bias Retrospective Price Indices and Substitution

Aggregate Indices and Their Corresponding Elementary Indices Jens Mehrhoff* Deutsche Bundesbank

Unit Value Bias (Indices) Reconsidered Price- and Unit-Value-Indices in Germany Peter von der

IRESS EXPERT iress.com EXPERT Landing Page An overview of a few indices, currencies and

Miller Indices David Holub Ahmad Asi Introduction Developed by William Miller 1839

? Did you know? Indices are also referred to as Exponents This is where exponential graphs come

DoTS: integrated gene indices for human and mouse built from transcribed sequences Running Title:

Black Hole Entropy from 5D Twisted Itamar Yaakov University of Indices Tokyo - Kavli IPMU

3.3: Time Series and Index Numbers 1. Time series: Plots Components 2. Index numbers: Simple

Scaling Methods to obtain Doubly stochastic matrices Krishna Acharya, Nacim Oijid December 10,

TYRION A Hardware Accelerator for SVD Chae Jubb Ruchir Khaitan A Singularly Valuable

Math 221: LINEAR ALGEBRA 8-4. QR Factorization Le Chen 1 Emory University, 2020 Fall (last

Methods for estimating the diagonal of matrix functions Jesse Laeuchli Andreas Stathopoulos CSC

Structured adaptive control, or how to solve LMIs with Simulink Alexandru - Razvan LUZI Dimitri

Chemometric Methods for the Kinetic Hard-modelling of Spectroscopic Data ETH Zurich, March 23 rd

Community detection with the non-backtracking operator Marc Lelarge INRIA-ENS Aalto University,

are: Opposite sides of a rectangle are parallel. 1. Opposite sides of a rectangle are equal. 2.