Internet Engineering: Search Ali Kamandi Sharif University of Technology kamandi@ce.sharif.edu Spring 2007
Statistics � In 1994, one of the first web search engines, the World Wide Web Worm (WWWW), had an index of 110,000 web pages and web accessible documents. � Up to 09/2005 Google indexes 8,200,000,000 web pages. 2
Search Engines � A search engine is a system which collects , organizes & presents a way to select Web documents based on certain words, phrases, or patterns within documents � Model the Web as a full-text DB � Index a portion of the Web docs � Search Web documents using user-specified words/patterns in a text 3
Search Engines � Two categories of search engine � general-purpose search engine, e.g. Yahoo !, AltaVista and Google � special-purpose search engines (or Internet Portals), e.g. LinuxStart ( www.linuxstart.com ) 4
Search Engines � Two main components of a search engine: � web crawler (spider), which collects massive Web pages. � large database, which stores and indexes collected Web pages. � Ranking has to be performed without accessing the text, just the index � Ranking algorithms : all information is “top secret;” it is almost impossible to measure recall as the number of relevant pages can be quite large for simple queries 5
What’s Wrong with SQL (Search Quality) (1) select * from content where body like ‘%running%’ select * from content where upper(body) like upper(‘%running%’) 6
What’s Wrong with SQL (Search Quality) (2) select * from content where upper(body) like upper(‘%running shoes%’) select * from content where upper(body) like upper(‘%running%’) and upper(body) like upper(‘%shoes%’) 7
What’s Wrong with SQL (Search Quality) (3) � the more the user tells us about her interests the fewer documents we’ll return in response to a search, � Note that public search engines circa 2005, such as Google, Yahoo, A9, and MSN, do implicitly use AND � If there aren’t any rows with all query terms, we should probably offer the user rows that contain some of the query terms. 8
Stemming � ‘‘My brother-in-law Billy Bob ran 20 miles yesterday’’ � ‘‘My cousin Gertrude runs 15 miles every day.’’ � stemming both the query terms and the indexed terms � ‘‘running,’’ ‘‘runs,’’ and ‘‘ran’’ would all be bashed down to the stem word ‘‘run’’ for indexing and retrieval. 9
expanding queries through a thesaurus � ‘‘I attended the 100th anniversary Boston Marathon’’? � expanding queries through a thesaurus powerful enough to make the connection between ‘‘running’’ and ‘‘marathon.’’ 10
What’s Wrong with SQL (Performance) (1) select * from content where body like ‘%running%’ � The RDBMS must examine every row in the content table to answer this query, scan (O[N] time, where N is the number of rows in the table) 11
What’s Wrong with SQL (Performance) (2) � Suppose that a standard RDBMS index is defined on the body column. The values of body will be used as keys for a BTree and we could perform select * from content where body = ‘running’ � and maybe, depending on the implementation, select * from content where body like ‘running%’ � in O[log N] time. 12
Abandoning the RDBMS � We can solve both the performance and search quality problems by dumping all of our data into a full-text search system. � these systems index every word in a document, not just the first words as with the standard RDBMS B-tree. � A full-text index can answer the question ‘‘Find me the documents containing the word ‘running’ ’’ in time that approaches O[1], indexed. 13
Indexing 14
Constant Time � If there are 10 million documents in the corpus, a search through those 10 million documents will not take much longer than a search through a corpus of 1,000 documents. � Getting close to constant time in this situation would require that � the 10-million-document collection did not use a larger vocabulary than the 1,000-document collection � and that it was not the case that, say, 90 percent of the documents contained the word ‘‘running.’’ 15
Stopwords � Word “the” � O(N) � stopwords, words that are too common to be worth indexing. � For standard English, the stopword list includes such words as ‘‘a,’’ ‘‘and,’’ ‘‘as,’’ ‘‘at,’’ ‘‘for,’’ ‘‘or,’’ ‘‘the,’’ and so forth. 16
Cost � Inserting a new document into the collection will be slow. We’ll have to go through the document, word by word, and update as many rows in the index as there are distinct words in the document. 17
word-frequency histogram � suppose that there are 1,000 documents in the collection containing ‘‘running shoes’’. Which are the most relevant to the user’s query of ‘‘running shoes’’? � We need a new data structure: the word- frequency histogram. � which words occur in a document and how frequently they occur 18
Example ‘‘All happy families resemble one another, but each unhappy family is unhappy in its own way,’ the first sentence of Tolstoy’s Anna Karenina: 19
More Refinements � After the crude histogram is made, it is typically adjusted for the prevalence of words in standard English. So, for example, the appearance of ‘‘resemble’’ is more interesting than ‘‘happy’’ because ‘‘resemble’’ occurs less frequently in standard English. � Stopwords such as ‘‘is’’ are thrown away altogether. � Stemming is another useful refinement. � In the index and in queries � ‘‘families,’’ � ‘‘family’’ 20
inter-document similarity � Given a body of histograms it is possible to answer queries such as ‘‘Show me documents that are similar to this one’’ or ‘‘Show me documents whose histogram is closest to a user-entered string.’’ � The inter-document similarity query can be handled by comparing histograms already stored in the text database. 21
Working with the Public Search Engines � First, Google has to know about your server. This happens either when someone already in the Google index links to your site or when you manually add your URL from a form off the google.com home page. � Second, Google has to be able to read the text on your server. At least as of 2005 none of the public search engines implemented optical character recognition (OCR). � This means that text embedded in a GIF, Flash animation, or a Java applet won’t be indexed. � Third, Google has to be able to get into all the pages on your server. � Pages requiring registration won’t be indexed by Google unless your software is smart enough to recognize that it is Google behind the request and make an exception. 22
Prevent Search Engine from Archiving � some search engines archive what they index � prevent search engines from archiving the page � <META NAME="ROBOTS“ CONTENT="NOARCHIVE"> placed in the HEAD of your HTML documents � robots are not guaranteed to follow such directives. 23
Add Extra Keywords � in the online table of contents page for this book, we have the following META tags in the HEAD: <meta name="keywords" content="web development online communities MIT 6.171 textbook"> <meta name="description" content="This is the textbook for the MIT course Software Engineering for Internet Applications"> � The ‘‘keywords’’ tag adds some words that are relevant to the document, but not present in the visible text. 24
Tags � These tags have been routinely abused. A publisher might add popular search terms to a site that is unrelated to those terms, in hopes of capturing more readers. � A company might add the names of its competitors as keywords. � Users wouldn’t see these dirty tricks � unless they went to the trouble of using the View Source command in their browser. � Because of this history of abuse, many public search engines ignore these tags. 25
Robot.txt � Standard for Web Exclusion, a protocol for communication between Web publishers and Web crawlers � http://www.robotstxt.org/wc/norobots.html. � /robots.txt, with instructions for robots. � Example: User-agent: * # let’s keep the robots away from our half-baked stuff Disallow: /staging 26
Mobile Users � Mobile Internet devices put an even greater stress on information retrieval. � Connection speeds are slower. � Screens are smaller. � It isn’t practical for a user to drill down into 20 documents returned by a search engine as possibly relevant to a query, especially if the user is driving a car and using a voice browser. 27
Hint � Generally users prefer to browse rather than search. If users are resorting to searches in order to get standard answers or perform common tasks, there may be something wrong with a site’s navigation or information architecture. 28
A split-system approach to providing full-text search. 29
Extra Features � Who Links To You? � link:siteURL � example:link:www.google.com � Web Page Translation � Google offers the following translation pairs: English to and from Arabic, Chinese, French, German, Italian, Korean, Japanese, Spanish, and Portuguese; and German to and from French. 30
Recommend
More recommend