Toward a Multi-tier Index for Information Retrieval System Madhav - PowerPoint PPT Presentation

Toward a Multi-tier Index for Information Retrieval System Madhav Ram FJ37459

• IR systems are mainly developed to help manage huge literature that have been developed • IR systems provide users with easy access to information and its main function are representation, storage, organization and access to information items Introduction • The performance of information retrieval process decreases drastically as the information stored in the system increases

The previous work in this field follows in mainly in two directions The first is sequential processing- here one processor is at a time to construct inverted file index used for information retrieval Related work The second is through parallel processing which uses multi processor to construct inverted file index An Inverted Index is a sorted list of Keywords with each keyword having links to documents containing those keywords

Sequential Processing Fig: Implementation of inverted index using sorted array. The first file that contains list of keywords is called Dictionary Files and second that contains the • documents linkage is called the posting file The data structure that is used in here is binary search • With this sorted array can be easily implemented and reasonably fast for search •

Sequential Processing (contd..) • A system called Glimpse is implemented using block addressing idea to speedup the construction of the inverted file is developed in • The main advantage of using block addressing is the shrinking of the inverted file size to become only 5% overhead of the of the original text size • Partial indexing is the approach of dividing the original text files into into smaller buckets that fir into main memory Fig: Partial Indexing Technique merging the partial indexes in a binary fashion

Parallel Processing • In the bulk-synchronous parallel model of computing, parallelism is tackled using two approaches they are: Local Index approach and Global Index approach • Local inverted index list index is constructed in each processor by considering only the documents which are stored respectively • With Global indexing the whole collection of document is used to produce a single inverted list index which are identical to sequential ones • Three distribute algorithms are used to build global inverted files for very large text collections. The three algorithms are Local Buffer and Local List algorithm (LL Algo.); Local Buffer and Remote List algorithm (LR Algo.); Remote Buffer and Remote List Algorithm (RR Algo.)

Approach • The two methods to enhance the IR systems the first one to use special purpose hardware and the second one is to use the Multi-Tier index algorithm • The second is discussed in this paper and it is based on usage of new algorithms • The most common indexing technique used is inverted file index, which represents data as indexed data • The main disadvantage with inverted file is the updating of the index because it is expensive • The factors that affected the indexing process are construction, searching and the updating time of the inverted file index

Approach • The inverted file index constructed from the developed algorithms consists of two associated files, the first file is dictionary and the second file is called postings • The main benefits of using multi-tier design is to speedup search process for any query and easily updating • The first step in search process looking up in first-tier directory to identify the first letter in query and in second-tier determine file name to perform the search • The second step is searching in second-tier • The updating process is the third step here we create an inverted file index for updated files and remerge • Finally, posting file is updated

Experimental Results • All the datasets used for this research are synthetic datasets. • For synthetic datasets random function generator are used to create words to text document • Partial indices concept is used for constructing inverted file index • Visual basic is used for two different hardware system: first is PII 333 MHZ with 64MB RAM; second is 2.8MHZ Dell server with 1GB RAM • Measure performance of updating by different file sizes on different size of inverted file

Experimental Results : PII 333 MHz Updating by 1 MB using PII 333 Updating by 1 KB using PII 333 150 400 300 100 Updating Time Updating Time 200 Partial Partial 50 Multi-Tier Multi-Tier 100 0 0 K K M M 2 M 8 M 1 2 1 K 512 K 2 8 1 5 Inverted File Index Size Inverted File Index Size Figure b : Updating time by 1 MB file size Figure a : Updating time by 1KB file size using Partial and Multi-Tier inverted file using Partial and Multi-Tier inverted file

Experimental Results: PII 333 MHz Updating by 2 MB using PII 333 600 500 Updating Time 400 Partial 300 200 Multi-Tier 100 0 K K M M 1 2 2 8 1 5 Inverted File Index Size Figure c : Updating time by 2KB file size using Partial and Multi-Tier inverted file

Experimental Results : Dell Server 2.8 GHz Updating by 1 KB using 2 . 8 GHZ Updating by 1 MB using 2 . 8 GHZ 15 50 Updating Time 10 40 Partial Updating 30 Partial 5 Multi-Tier 20 Multi-Tier Time 0 10 0 2 M 8 M 1 K 512 K 2 M 8 M 1 K 512 K Inverted File Index Size Inverted File Index Size Figure e : Updating time by 1 MB file size Figure d : Updating time by 1KB file size using Partial and Multi-Tier inverted file using Partial and Multi-Tier inverted file

Experimental Results : Dell Server 2.8 GHz Updating by 2 MB using 2 . 8 GHz 100 80 Updating Time 60 40 Partial 20 Multi-Tier 0 2 M 8 M 1 K 512 K Inverted File Index Size Figure f : Updating time by 2MB file size using Partial and Multi-Tier inverted file

Conclusion • Multi-Tier indexing technique have superior performance than a partial index technique • Updating process using a Multi-Tier index performs better than a partial index • This is an indicator that updating can be performed for large and small file size with predictable performance

Thank You!

Toward a Multi-tier Index for Information Retrieval System Madhav - PowerPoint PPT Presentation

Toward a Multi-tier Index for Information Retrieval System Madhav Ram FJ37459 IR systems are mainly developed to help manage huge literature that have been developed IR systems provide users with easy access to information and its main

An Overview of Tier 4 Visas for Departmental Administrators Julia Jago Tier 4 Visas Officer 2.

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

WHAT ARE TIER 1, 2, 3 WATERS Tier 1 impaired Tier 2 fishable, swimmable, drinkable

Tier 3 Vehicle and Fuel Standards February 2016 1 Overview Overview of the Tier 3 Program

FCPS FY 2010 Potential Reductions Tier 1 Tier 2 Tier 3 INSTRUCTIONAL 1. Academics 1.

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

WEC Tier 3 Annual Plan 2018 Vermont System Planning Committee 24 January 2018 WEC 2018 Tier 3

The 4-tier model for CAMHS Very specialist Services, often Tier 4 children away from home

NPFL103: Information Retrieval (1) Introduction, Boolean retrieval, Inverted index, Text

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Information Retrieval Introducing Information Retrieval and Web Search

Sorting Announcements for This Lecture Finishing Up Assignment 7 Submit a course evaluation

GraphChi: Large-Scale Graph Computation on Just a PC Kyrola Et al. James Trever Could we

Least and greatest solutions of equations over sets of integers Artur Je z Alexander Okhotin

COVALENT BONDS 1 COVALENT BONDS Revised2.notebook February 21, 2013 Learning Pre Post

Programming with Constraint Solvers CS294: Program Synthesis for Everyone Ras Bodik Division of

Pair of Binary Sequences with Ideal Two-Level Crosscorrelation Seok-Yong Jin and Hong-Yeop Song

State sequence predic/on in imprecise hidden Markov models

KFS SIL2: Signal Crossing Prevention System - (ATS - Automatic Train Stop) Certified SIL2 by