Indexing Extracts from: Witten, Moffat, and Bell, Managing Gigabytes - PowerPoint PPT Presentation

May 10, 2023 •127 likes •234 views

Indexing Extracts from: Witten, Moffat, and Bell, Managing Gigabytes , 2nd ed., Morgan Kaufmann, 1999. Melnik et al., Building a Distributed Full-Text Index for the Web . Proc. 10th Int. WWW Conf., 2001. 1 Indexing Documents Basic task:

Indexing • Extracts from: Witten, Moffat, and Bell, Managing Gigabytes , 2nd ed., Morgan Kaufmann, 1999. • Melnik et al., Building a Distributed Full-Text Index for the Web . Proc. 10th Int. WWW Conf., 2001. 1
Indexing Documents Basic task: Process document collection so docs containing a query word can be retrieved fast. Input: document collection. Output: search structure for collection. 2
Standard Solution Inverted file + lexicon • Inverted file = for each word w , list of docs containing w . • Lexicon = dictionary over all words occuring in doc collection (key = word, value = pointer to inverted file + additional info for word, e.g. length of inverted list). 3
Lexicon • Sorted list of occuring words + binary search. How to store variable length strings? – Array of pointes into concatenated strings. – Do. + blocking – Do. + blocking + front coding (prefix compression). • Hash tables. • Tries, ternary search trees, suffix arrays (later) • External: blocking + lexicon over first string in each block. Repeat ⇒ prefix B-tree. 4
Inverted File Simple (one occurence per doc): w 1 : DocID, DocID, DocID w 2 : DocID, DocID w 3 : DocID, DocID, DocID, DocID, DocID, DocID. . . Detailed (all occurences in docs): w 1 : DocID, Position, Position, DocID, Position. . . Even more detailed: Position annotated with info (heading, boldface, anchor text,. . . ). Useful for ranking. 5
Compressing the inverted file • “Hand coding” – Store diffs between DocIDs, not absolute DocIDs – Code this diff efficiently (unary, γ , δ , Bernoulli (global or local),...). • Use generic compression tools (gzip,. . . ) • Compress each entire inverted list • Block the list file, compress each block. 6
Combine inverted list and lexicon Melnik et al.: • Use standard (embedded) DB library (e.g. Berkeley DB). • Sample entries in inverted file evenly (such that parts between samples can be coded in a page size). Use DB with (key,value) = (sample, next coded part). Generic compression can be applied to parts too. 7
Preprocessing • Find words – Remove mark-up, scripts,. . . – Coding scheme? Unicode, latin-1, ascii? – Lowercase – Definition of word? (suggestion: alphanumeric sequence, max 4 digits, max 256 chars). • Stemming? (don’t). • Stop words? (probably don’t - store all words, and allow stop words at query time). 8
Building the index • Hashing only good within RAM. Normally not relevant for web. • I/O-efficient sorting: OK. Distribution • Split on DocID (“local inverted files”). • Split on WordID (“global inverted files”). Split on DocID is probably better since for AND-queries, filtering of lists can be done at each machine (less communication). Melnik et al. give further considerations on efficient distributed building. Among other things: interleave CPU, disk I/O, and net traffic (idea of interleaving CPU time and I/O is also useful for external sorting). 9

Recommend

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton Distributing Indexing The scale of web indexing makes it infeasible to maintain an index on a single computer. Instead, we distribute the task

384 views • 36 slides

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia http://www http:// www- -db.deis.unibo.it db.deis.unibo.it/ /courses courses/SI /SI- -LS/ LS/

425 views • 19 slides

Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing

Steve Stedman Freelance SQL Server Consultant http://stevestedman.com Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing that was created to educate software developers on the basics of indexing.

173 views • 4 slides

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing & Searching 3

1 Indexing Approaches Indexing Approaches R. Baeza-Yates and B. R. Baeza-Yates and B. Ribeiro-Neto Ribeiro-Neto: : Modern Informa Modern Information Retrie ion Retrieval, l, Chapter 8. Chapt Chapter Chapter 8 1999 . 1999 1999

371 views • 13 slides

Bitmap Indexing and related indexing techniques Presented by: El Ghailani Maher Outline I

Bitmap Indexing and related indexing techniques Presented by: El Ghailani Maher Outline I ntroduction ! W hy I ndexing? ! Factors that determ ine the convenient I ndexing ! technique Criteria to develop a new indexing technique ! Bitm

516 views • 41 slides

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Hash-Based Indexing Torsten Grust Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static Hashing Hash Functions Architecture and Implementation of Database Systems Extendible Hashing Summer 2016

930 views • 39 slides

Indexing December 12, 2008 Indexing Introduction New tuple is stored without any order next

Indexing December 12, 2008 Indexing Introduction New tuple is stored without any order next available space Indexing Introduction New tuple is stored without any order next available space Access will require inspection of every tuple

287 views • 13 slides

Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung Audio Indexing and Retrieval

Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung Audio Indexing and Retrieval Motivation Main Audio Features Audio Classification Speech Recognition Music Retrieval Using Audio Features for Video

590 views • 30 slides

Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu Content Motivation for Indexing

Indexing CS6320 1/29/2018 Shachi Deshpande, Yunhe Liu Content Motivation for Indexing B-tree B-tree basics The cost of B-tree operations B-tree variants B-tree in multi-user Environments Learned Index

1.98k views • 61 slides

Exact Indexing of Dynamic Exact Indexing of Dynamic Time Warping Time Warping Eamonn Keogh

Exact Indexing of Dynamic Exact Indexing of Dynamic Time Warping Time Warping Eamonn Keogh Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Fair Use

554 views • 36 slides

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

The Chinese University of Hong Kong The Chinese University of Hong Kong Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao, Jeffrey Xu Yu, Philip S. Yu Zhao, Jeffrey Xu Yu,

364 views • 25 slides

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH:

Multi-Probe LSH: Efficient Indexing for Efficient Indexing for Multi-Probe LSH: High-Dimensional Similarity Search High-Dimensional Similarity Search Qin (Christine) (Christine) Lv Lv Qin Stony Brook University Stony Brook University

625 views • 25 slides

Biometric Indexing Yi Wang alice.yi.wang@ieee.org 13/Jan/2017 Outlines Introduction to

Biometric Indexing Yi Wang alice.yi.wang@ieee.org 13/Jan/2017 Outlines Introduction to biometric indexing Accuracy issues: Dealing with low quality query fingerprints Efficiency issues: Search and indexing fingerprints with

969 views • 72 slides

Chapter V: Indexing & Searching Information Retrieval & Data Mining Universitt des

Chapter V: Indexing & Searching Information Retrieval & Data Mining Universitt des Saarlandes, Saarbrcken Winter Semester 2011/12 Chapter V: Indexing & Searching* V.1 Indexing & Query processing Inverted indexes, B +

627 views • 29 slides

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

Index construction Distributed indexing Dynamic indexing Index compression NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index compression Pavel Pecina Institute of Formal and Applied Linguistics

923 views • 66 slides

Media Indexing & Retrieval Media Indexing & Retrieval Prepared by Ling Guan Jose Lay

20/09/2018 Media Indexing & Retrieval Media Indexing & Retrieval Prepared by Ling Guan Jose Lay Paisarn Muneesawang Ning Zhang Rui Zhang 1 Background More and more media are becoming available online. In the past decade, we

323 views • 20 slides

Capabilities Capabilities Indexing and Publishing Indexing and Publishing Jason M. Coposky

Capabilities Capabilities Indexing and Publishing Indexing and Publishing Jason M. Coposky June 25-28, 2019 @jason_coposky iRODS User Group Meeting 2019 Executive Director, iRODS Consortium Utrecht, Netherlands 1 iRODS Capabilities

667 views • 23 slides

XQuery Full Text Implementation in BaseX XSym/VLDB 2009 XSym/VLDB 2009 Christian Grn,

Database and Information Systems Group University of Konstanz Christian Grn Germany XSym/VLDB: Sixth International XML Database Symposium,2009 XQuery Full Text Implementation in BaseX XSym/VLDB 2009 XSym/VLDB 2009 Christian Grn,

644 views • 20 slides

Advanced Document Similarity With Apache Lucene Alessandro Benedetti, Software Engineer, Sease

Advanced Document Similarity With Apache Lucene Alessandro Benedetti, Software Engineer, Sease Ltd. Who I am Alessandro Benedetti Search Consultant R&D Software Engineer Master in Computer Science Apache Lucene/Solr

422 views • 30 slides

DATABASE SYSTEMS Database programming in a web environment Database System Course, 2016-2017

DATABASE SYSTEMS Database programming in a web environment Database System Course, 2016-2017 AGENDA FOR TODAY The final project Advanced Mysql Database programming Recap: DB servers in the web Web programming architecture HTTP on a

845 views • 53 slides

automatically identify malware capabilities Joshua Saxe, Rafael Turner, Kristina Blokhin, Jose

CrowdSource: Applying machine learning to web technical documents to automatically identify malware capabilities Joshua Saxe, Rafael Turner, Kristina Blokhin, Jose Nazario Invincea Labs A DARPA Cyber Fast Track research effort Approved for

239 views • 23 slides

NoSQL & NewSQL Instructors: Peter Baumann email: p.baumann@jacobs-university.de tel: -3178

NoSQL & NewSQL Instructors: Peter Baumann email: p.baumann@jacobs-university.de tel: -3178 office: room 88, Research 1 With material by Willem Visser 320302 Databases & Web Applications (P. Baumann) Performance Comparison On

285 views • 26 slides

Information systems for HEP: INSPIRE, arXiv and more Annette Holtkamp CERN ASP 2012 Kumasi,

Information systems for HEP: INSPIRE, arXiv and more Annette Holtkamp CERN ASP 2012 Kumasi, Ghana, Aug 3, 2012 Dominance of community services in HEP Annette Holtkamp - ASP2012 1 HEP community closely-knit community 20-30k active

950 views • 90 slides

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schtze http://www.informationretrieval.org Parametric Search In these examples we select field values Values could be

456 views • 23 slides