CSE 373: Analysis of Algorithms Topic: Reinventing search engines using Tries Nov 03, 2003 Lecturer: Piyush Kumar Scribed by: Piyush Kumar Please Help us improve this draft. If you read these notes and find any errors or have an idea to improve it, please send feedback. Search others for their virtues, thyself for thy vices. Benjamin Franklin (1706 - 1790) 1 Search Engines The world wide web today contains billions of pages for you to explore. Google, Altavista, Infoseek, Yahoo and thousands of other search engines exist today to help you find what you are looking for on the web. Below is a list of approximately how many pages you can search in some of the popular search engines (Dec 2002). Search Engine Approximate number of pages in billions Google 3.1 AlltheWeb 2.1 AltaVista 1.7 WiseNut 1.5 Have you ever wondered how they work? In this set of lectures we peek into a data structure that would help us design a prototype of a search engine. Designing and Implementing a small search engine is both easy and fun. 2 The Prototype There will be two major parts of our prototype. A Crawler will be a program that gathers the web pages that our search engine will search on. Crawlers are also known as robots, bots or spiders. Real crawlers have to deal with many issues that we will not consider in our prototype design. Our prototype design will be simple enough to implement in a hundred lines of perl code. Also our prototype will not implement page ranking (Although we encourage you to think how we could incorporate page ranking in our prototype). Our prototype crawler will use a simple breadth first search for the web graph (possibly with bounds on the depth reached). Many scientists claim that breadth first search crawling tends to find high quality pages early in the crawl (Why?). Once the Crawler has gathered the pages that we want our search engine to search on, how do we implement the search data structure? 3 Occurance Lists Before we goto the design of a search data structure, we will create a set of elements for the data structure called the occurance lists . An occurance list is a list of web pages that contain a particular word w . If we assign a number to each of our web pages, then the occurance list can be created by just parsing the web pages and storing pairs of ( w, UrlID) as we go through each word in a web page. Once this big list is stored on disk, we can use our external memory sort engine (That we designed in the programming project) to sort this data so that we now can collect for each distinct word, which are the pages that contain it. Thus now we have a way to create for each distinct word or term occuring in our web pages, theire respective occurance lists. We will call these distinct word or terms, index terms . 1
4 The Search Engine Data Structure If we did not care about efficiency, we could implement the search of a few keywords using a simple script which calls grep and find . But this would be horribly inefficient if we want to answer searches fast. The core of our prototype search engine will be a dictionary called an inverted index or inverted file . For each index term that appears in the collection of our stored web pages, an inverted file lists each document where it appears. In other words it stores pairs of ( w, L w ) where w is an index term and L w is the list of pages containing the index term w (or its occurance list). This data structure is especially good at boolean queries. What are the kinds of queries we seek from our data structure. If we look for “What are Suffix Trees” Zzz Zzasz Yyy Project Qweqwer,............ Qweqwer,............ Prefixed and What we ............................ ............................ suffixes ,,............... do,....................... ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ Zzz ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ Qweqwer,............ ............................ ............................ ............................ ............................ ............................ Suffix Trees ............................ ............................ ............................ What are suffix ............................ trees?,,................. Xxx ............................ ............................ Trees ............................ Ased Are we ............................ ,........................... Trees play an ............................ ............................ Qweqwer,............ important ............................ ............................ ............................ ,,.......................... ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ we should output all the pages that have all the index terms “What”, “are”, “Suffix” and “Trees”. Note that this is a boolean query. Our data structure should retrieve (“What” , L What ), (“are” , L are ), (“Suffix” , L Suffix ), (“Trees” , L Trees ) and then compute and output L What ∩ L are ∩ L Suffix ∩ L Trees One thing this operation suggests is that we should keep all the occurance lists in sorted order so that we can do boolean operations fast. How do we implement this data structure? We will implement it as a trie for the set of index terms. A trie is a tree based data structure for storing strings in order to support fast pattern matching. The trie will have pointers to the corresponding occurance lists, so that as soon as there is a match for an index term w , we can get hold of its list L w . After looking for all the search keywords that are in the query, we have all the occurance lists, we just need to do an intersection(AND)/union(OR) computation and output the intersection/union of these lists. Hence, the main job left in the design of our prototype search engine is to do fast pattern matching using a data structure. What do we want from this data structure? Our goal would be to process the text so that the occurance of any search keyword can be found quickly in our list of words or terms. We will preprocess our set of words to facilitate fast queries of search keywords. How long does searching [a 2-3-4 tree, a treap, a balanced BST, sorted array] take? The answer which we have come to know is O(log s), where s is the number of elements in the array. This, however, is not strictly true. If the elements we are dealing with are ints, it is close enough, but what if they are Strings? In order to find out whether one string is >, <, or = to some other, we need to go through every character and compare those one by one. So the real answer is that it takes O ( Mlogs ), where M is the number of bytes in 2
Recommend
More recommend