CSE 373: Analysis of Algorithms Topic: Reinventing search engines - - PDF document

▶

Oct 03, 2023 224 likes •274 views

CSE 373: Analysis of Algorithms Topic: Reinventing search engines using Tries Nov 03, 2003 Lecturer: Piyush Kumar Scribed by: Piyush Kumar Please Help us improve this draft. If you read these notes and find any errors or have an idea to improve

SLIDE 1

CSE 373: Analysis of Algorithms

Topic: Reinventing search engines using Tries

Nov 03, 2003 Lecturer: Piyush Kumar Scribed by: Piyush Kumar Please Help us improve this draft. If you read these notes and find any errors or have an idea to improve it, please send feedback. Search others for their virtues, thyself for thy vices. Benjamin Franklin(1706 - 1790)

1 Search Engines

The world wide web today contains billions of pages for you to explore. Google, Altavista, Infoseek, Yahoo and thousands of other search engines exist today to help you find what you are looking for on the web. Below is a list of approximately how many pages you can search in some of the popular search engines (Dec 2002). Search Engine Approximate number of pages in billions Google 3.1 AlltheWeb 2.1 AltaVista 1.7 WiseNut 1.5 Have you ever wondered how they work? In this set of lectures we peek into a data structure that would help us design a prototype of a search engine. Designing and Implementing a small search engine is both easy and fun.

2 The Prototype

There will be two major parts of our prototype. A Crawler will be a program that gathers the web pages that our search engine will search on. Crawlers are also known as robots, bots or spiders. Real crawlers have to deal with many issues that we will not consider in our prototype design. Our prototype design will be simple enough to implement in a hundred lines of perl code. Also our prototype will not implement page ranking (Although we encourage you to think how we could incorporate page ranking in our prototype). Our prototype crawler will use a simple breadth first search for the web graph (possibly with bounds

n the depth reached). Many scientists claim that breadth first search crawling tends to find high quality

pages early in the crawl (Why?). Once the Crawler has gathered the pages that we want our search engine to search on, how do we implement the search data structure?

3 Occurance Lists

Before we goto the design of a search data structure, we will create a set of elements for the data structure called the occurance lists. An occurance list is a list of web pages that contain a particular word w. If we assign a number to each of our web pages, then the occurance list can be created by just parsing the web pages and storing pairs of (w, UrlID) as we go through each word in a web page. Once this big list is stored

n disk, we can use our external memory sort engine (That we designed in the programming project) to sort

this data so that we now can collect for each distinct word, which are the pages that contain it. Thus now we have a way to create for each distinct word or term occuring in our web pages, theire respective

ccurance lists. We will call these distinct word or terms, index terms.

1

SLIDE 2

4 The Search Engine Data Structure

If we did not care about efficiency, we could implement the search of a few keywords using a simple script which calls grep and find. But this would be horribly inefficient if we want to answer searches fast. The core of our prototype search engine will be a dictionary called an inverted index or inverted file. For each index term that appears in the collection of our stored web pages, an inverted file lists each document where it appears. In other words it stores pairs of (w, Lw) where w is an index term and Lw is the list of pages containing the index term w (or its occurance list). This data structure is especially good at boolean queries. What are the kinds of queries we seek from our data structure. If we look for “What are Suffix Trees”

Xxx

Are we ,........................... ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................

Project

What we do,....................... ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................

Yyy

Prefixed and ,,............... ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ suffixes

Trees

Trees play an important ,,.......................... ............................ ............................ ............................ ............................ ............................ ............................ ............................

Suffix Trees

What are suffix trees?,,................. ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................

Zzz

Qweqwer,............ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................ ............................

Zzasz

Ased

Zzz

we should output all the pages that have all the index terms “What”, “are”, “Suffix” and “Trees”. Note that this is a boolean query. Our data structure should retrieve (“What”, LWhat), (“are”, Lare), (“Suffix”, LSuffix), (“Trees”, LTrees) and then compute and output LWhat ∩ Lare ∩ LSuffix ∩ LTrees One thing this operation suggests is that we should keep all the occurance lists in sorted order so that we can do boolean operations fast. How do we implement this data structure? We will implement it as a trie for the set of index terms. A trie is a tree based data structure for storing strings in order to support fast pattern matching. The trie will have pointers to the corresponding occurance lists, so that as soon as there is a match for an index term w, we can get hold of its list Lw. After looking for all the search keywords that are in the query, we have all the occurance lists, we just need to do an intersection(AND)/union(OR) computation and output the intersection/union of these lists. Hence, the main job left in the design of our prototype search engine is to do fast pattern matching using a data structure. What do we want from this data structure? Our goal would be to process the text so that the occurance of any search keyword can be found quickly in our list of words or terms. We will preprocess

ur set of words to facilitate fast queries of search keywords.

How long does searching [a 2-3-4 tree, a treap, a balanced BST, sorted array] take? The answer which we have come to know is O(log s), where s is the number of elements in the array. This, however, is not strictly true. If the elements we are dealing with are ints, it is close enough, but what if they are Strings? In

rder to find out whether one string is >, <, or = to some other, we need to go through every character and

compare those one by one. So the real answer is that it takes O(Mlogs), where M is the number of bytes in 2

SLIDE 3

the longest string or key. Quite clearly we can’t get rid of the M: we have to compare the keys no matter what else we do. What about the logs though? The data structure that we will build in the preprocessing phase is called a full-text index. One of the most important full-text index is the Trie. A search can be done in O(m) time where m is the length of the keyword that is being searched. Note that for multiple keywords we would have to search the Trie multiple

times. Note that this is almost as fast as one can get for searching because one has to look at the keyword

to find its occurance and looking at the keyword takes Ω(m) time. So if we use a Trie for our purpose, we have the occurance list for each word almost as fast as we can read the input of keywords, the main time taken by this algorithm is to bring the occurance lists in memory (If these lists are on disk) and then do union/intersection computations on these lists. We leave it to you to come up with a fast union/intersection computation on two lists that are sorted. We now look at the design

f the magical super fast data structure for searching keywords...

5 Trie

Trie is a basic data structure for storing strings in order to support fast pattern matching. Let S be set of all our index terms. Let the number of index terms in our web pages be s. Let all our strings come from an alphabet Σ whose size is d. We will also assume that no string in S is a prefix of another string. In other words, S is prefix free. Note that this assumption can be made true for any S by adding a unique character to the end of all the words. A trie is a rooted tree whose edges are labelled with characters so that the labels on the edges from a node to its children are all different. A node of a trie represents the string resulting from the concatenation

f the labels on the path from the root to the node. In particular, the root represents the empty string.

Because of the uniqueness of child edge labels, no two nodes can represent the same string. The trie, also known as the digital search tree, is a useful data structure for storing a set of strings. Let Trie(S) denote the unique smallest trie that represents all strings of S. For simplicity, we assume that S is prefix free. Then, the set of strings represented by the leaves of Trie(S) is exactly S. An example will be shown in class. If n is the total length of the strings of S, the space requirement of Trie(S) is O(n) and it can be constructed in time O(dn). Checking whether a string P of length m is in S, and locating the leaf representing it, if it exists, takes O(dm) time (and can easily be improved to O(m) time). More complicated representations exist which will save you some space by giving up time. In the next class we will see the design of three kinds of tries, Standard Trie, Patricia Trie and Suffix Tries. Stay tuned.

References

[1] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, MIT Press, Cambridge, MA, 2nd ed., 2001. [2] M. T. Goodrich, R. Tamassia, and D. Mount, Data Structures and Algorithms in C++, SIAM, 2004. [3] D. Gusfield, Algorithms on Strings, Trees, and Sequences, Cambridge University Press, 1997. 3