Toward a Multi-tier Index for Information Retrieval System Madhav Ram FJ37459
• IR systems are mainly developed to help manage huge literature that have been developed • IR systems provide users with easy access to information and its main function are representation, storage, organization and access to information items Introduction • The performance of information retrieval process decreases drastically as the information stored in the system increases
The previous work in this field follows in mainly in two directions The first is sequential processing- here one processor is at a time to construct inverted file index used for information retrieval Related work The second is through parallel processing which uses multi processor to construct inverted file index An Inverted Index is a sorted list of Keywords with each keyword having links to documents containing those keywords
Sequential Processing Fig: Implementation of inverted index using sorted array. The first file that contains list of keywords is called Dictionary Files and second that contains the • documents linkage is called the posting file The data structure that is used in here is binary search • With this sorted array can be easily implemented and reasonably fast for search •
Sequential Processing (contd..) • A system called Glimpse is implemented using block addressing idea to speedup the construction of the inverted file is developed in • The main advantage of using block addressing is the shrinking of the inverted file size to become only 5% overhead of the of the original text size • Partial indexing is the approach of dividing the original text files into into smaller buckets that fir into main memory Fig: Partial Indexing Technique merging the partial indexes in a binary fashion
Parallel Processing • In the bulk-synchronous parallel model of computing, parallelism is tackled using two approaches they are: Local Index approach and Global Index approach • Local inverted index list index is constructed in each processor by considering only the documents which are stored respectively • With Global indexing the whole collection of document is used to produce a single inverted list index which are identical to sequential ones • Three distribute algorithms are used to build global inverted files for very large text collections. The three algorithms are Local Buffer and Local List algorithm (LL Algo.); Local Buffer and Remote List algorithm (LR Algo.); Remote Buffer and Remote List Algorithm (RR Algo.)
Approach • The two methods to enhance the IR systems the first one to use special purpose hardware and the second one is to use the Multi-Tier index algorithm • The second is discussed in this paper and it is based on usage of new algorithms • The most common indexing technique used is inverted file index, which represents data as indexed data • The main disadvantage with inverted file is the updating of the index because it is expensive • The factors that affected the indexing process are construction, searching and the updating time of the inverted file index
Approach • The inverted file index constructed from the developed algorithms consists of two associated files, the first file is dictionary and the second file is called postings • The main benefits of using multi-tier design is to speedup search process for any query and easily updating • The first step in search process looking up in first-tier directory to identify the first letter in query and in second-tier determine file name to perform the search • The second step is searching in second-tier • The updating process is the third step here we create an inverted file index for updated files and remerge • Finally, posting file is updated
Experimental Results • All the datasets used for this research are synthetic datasets. • For synthetic datasets random function generator are used to create words to text document • Partial indices concept is used for constructing inverted file index • Visual basic is used for two different hardware system: first is PII 333 MHZ with 64MB RAM; second is 2.8MHZ Dell server with 1GB RAM • Measure performance of updating by different file sizes on different size of inverted file
Experimental Results : PII 333 MHz Updating by 1 MB using PII 333 Updating by 1 KB using PII 333 150 400 300 100 Updating Time Updating Time 200 Partial Partial 50 Multi-Tier Multi-Tier 100 0 0 K K M M 2 M 8 M 1 2 1 K 512 K 2 8 1 5 Inverted File Index Size Inverted File Index Size Figure b : Updating time by 1 MB file size Figure a : Updating time by 1KB file size using Partial and Multi-Tier inverted file using Partial and Multi-Tier inverted file
Experimental Results: PII 333 MHz Updating by 2 MB using PII 333 600 500 Updating Time 400 Partial 300 200 Multi-Tier 100 0 K K M M 1 2 2 8 1 5 Inverted File Index Size Figure c : Updating time by 2KB file size using Partial and Multi-Tier inverted file
Experimental Results : Dell Server 2.8 GHz Updating by 1 KB using 2 . 8 GHZ Updating by 1 MB using 2 . 8 GHZ 15 50 Updating Time 10 40 Partial Updating 30 Partial 5 Multi-Tier 20 Multi-Tier Time 0 10 0 2 M 8 M 1 K 512 K 2 M 8 M 1 K 512 K Inverted File Index Size Inverted File Index Size Figure e : Updating time by 1 MB file size Figure d : Updating time by 1KB file size using Partial and Multi-Tier inverted file using Partial and Multi-Tier inverted file
Experimental Results : Dell Server 2.8 GHz Updating by 2 MB using 2 . 8 GHz 100 80 Updating Time 60 40 Partial 20 Multi-Tier 0 2 M 8 M 1 K 512 K Inverted File Index Size Figure f : Updating time by 2MB file size using Partial and Multi-Tier inverted file
Conclusion • Multi-Tier indexing technique have superior performance than a partial index technique • Updating process using a Multi-Tier index performs better than a partial index • This is an indicator that updating can be performed for large and small file size with predictable performance
Thank You!
Recommend
More recommend