Internet Engineering: Search Ali Kamandi Sharif University of - PowerPoint PPT Presentation

Internet Engineering: Search Ali Kamandi Sharif University of Technology kamandi@ce.sharif.edu Spring 2007

Statistics � In 1994, one of the first web search engines, the World Wide Web Worm (WWWW), had an index of 110,000 web pages and web accessible documents. � Up to 09/2005 Google indexes 8,200,000,000 web pages. 2

Search Engines � A search engine is a system which collects , organizes & presents a way to select Web documents based on certain words, phrases, or patterns within documents � Model the Web as a full-text DB � Index a portion of the Web docs � Search Web documents using user-specified words/patterns in a text 3

Search Engines � Two categories of search engine � general-purpose search engine, e.g. Yahoo !, AltaVista and Google � special-purpose search engines (or Internet Portals), e.g. LinuxStart ( www.linuxstart.com ) 4

Search Engines � Two main components of a search engine: � web crawler (spider), which collects massive Web pages. � large database, which stores and indexes collected Web pages. � Ranking has to be performed without accessing the text, just the index � Ranking algorithms : all information is “top secret;” it is almost impossible to measure recall as the number of relevant pages can be quite large for simple queries 5

What’s Wrong with SQL (Search Quality) (1) select * from content where body like ‘%running%’ select * from content where upper(body) like upper(‘%running%’) 6

What’s Wrong with SQL (Search Quality) (2) select * from content where upper(body) like upper(‘%running shoes%’) select * from content where upper(body) like upper(‘%running%’) and upper(body) like upper(‘%shoes%’) 7

What’s Wrong with SQL (Search Quality) (3) � the more the user tells us about her interests the fewer documents we’ll return in response to a search, � Note that public search engines circa 2005, such as Google, Yahoo, A9, and MSN, do implicitly use AND � If there aren’t any rows with all query terms, we should probably offer the user rows that contain some of the query terms. 8

Stemming � ‘‘My brother-in-law Billy Bob ran 20 miles yesterday’’ � ‘‘My cousin Gertrude runs 15 miles every day.’’ � stemming both the query terms and the indexed terms � ‘‘running,’’ ‘‘runs,’’ and ‘‘ran’’ would all be bashed down to the stem word ‘‘run’’ for indexing and retrieval. 9

expanding queries through a thesaurus � ‘‘I attended the 100th anniversary Boston Marathon’’? � expanding queries through a thesaurus powerful enough to make the connection between ‘‘running’’ and ‘‘marathon.’’ 10

What’s Wrong with SQL (Performance) (1) select * from content where body like ‘%running%’ � The RDBMS must examine every row in the content table to answer this query, scan (O[N] time, where N is the number of rows in the table) 11

What’s Wrong with SQL (Performance) (2) � Suppose that a standard RDBMS index is defined on the body column. The values of body will be used as keys for a BTree and we could perform select * from content where body = ‘running’ � and maybe, depending on the implementation, select * from content where body like ‘running%’ � in O[log N] time. 12

Abandoning the RDBMS � We can solve both the performance and search quality problems by dumping all of our data into a full-text search system. � these systems index every word in a document, not just the first words as with the standard RDBMS B-tree. � A full-text index can answer the question ‘‘Find me the documents containing the word ‘running’ ’’ in time that approaches O[1], indexed. 13

Indexing 14

Constant Time � If there are 10 million documents in the corpus, a search through those 10 million documents will not take much longer than a search through a corpus of 1,000 documents. � Getting close to constant time in this situation would require that � the 10-million-document collection did not use a larger vocabulary than the 1,000-document collection � and that it was not the case that, say, 90 percent of the documents contained the word ‘‘running.’’ 15

Stopwords � Word “the” � O(N) � stopwords, words that are too common to be worth indexing. � For standard English, the stopword list includes such words as ‘‘a,’’ ‘‘and,’’ ‘‘as,’’ ‘‘at,’’ ‘‘for,’’ ‘‘or,’’ ‘‘the,’’ and so forth. 16

Cost � Inserting a new document into the collection will be slow. We’ll have to go through the document, word by word, and update as many rows in the index as there are distinct words in the document. 17

word-frequency histogram � suppose that there are 1,000 documents in the collection containing ‘‘running shoes’’. Which are the most relevant to the user’s query of ‘‘running shoes’’? � We need a new data structure: the word- frequency histogram. � which words occur in a document and how frequently they occur 18

Example ‘‘All happy families resemble one another, but each unhappy family is unhappy in its own way,’ the first sentence of Tolstoy’s Anna Karenina: 19

More Refinements � After the crude histogram is made, it is typically adjusted for the prevalence of words in standard English. So, for example, the appearance of ‘‘resemble’’ is more interesting than ‘‘happy’’ because ‘‘resemble’’ occurs less frequently in standard English. � Stopwords such as ‘‘is’’ are thrown away altogether. � Stemming is another useful refinement. � In the index and in queries � ‘‘families,’’ � ‘‘family’’ 20

inter-document similarity � Given a body of histograms it is possible to answer queries such as ‘‘Show me documents that are similar to this one’’ or ‘‘Show me documents whose histogram is closest to a user-entered string.’’ � The inter-document similarity query can be handled by comparing histograms already stored in the text database. 21

Working with the Public Search Engines � First, Google has to know about your server. This happens either when someone already in the Google index links to your site or when you manually add your URL from a form off the google.com home page. � Second, Google has to be able to read the text on your server. At least as of 2005 none of the public search engines implemented optical character recognition (OCR). � This means that text embedded in a GIF, Flash animation, or a Java applet won’t be indexed. � Third, Google has to be able to get into all the pages on your server. � Pages requiring registration won’t be indexed by Google unless your software is smart enough to recognize that it is Google behind the request and make an exception. 22

Prevent Search Engine from Archiving � some search engines archive what they index � prevent search engines from archiving the page � <META NAME="ROBOTS“ CONTENT="NOARCHIVE"> placed in the HEAD of your HTML documents � robots are not guaranteed to follow such directives. 23

Add Extra Keywords � in the online table of contents page for this book, we have the following META tags in the HEAD: <meta name="keywords" content="web development online communities MIT 6.171 textbook"> <meta name="description" content="This is the textbook for the MIT course Software Engineering for Internet Applications"> � The ‘‘keywords’’ tag adds some words that are relevant to the document, but not present in the visible text. 24

Tags � These tags have been routinely abused. A publisher might add popular search terms to a site that is unrelated to those terms, in hopes of capturing more readers. � A company might add the names of its competitors as keywords. � Users wouldn’t see these dirty tricks � unless they went to the trouble of using the View Source command in their browser. � Because of this history of abuse, many public search engines ignore these tags. 25

Robot.txt � Standard for Web Exclusion, a protocol for communication between Web publishers and Web crawlers � http://www.robotstxt.org/wc/norobots.html. � /robots.txt, with instructions for robots. � Example: User-agent: * # let’s keep the robots away from our half-baked stuff Disallow: /staging 26

Mobile Users � Mobile Internet devices put an even greater stress on information retrieval. � Connection speeds are slower. � Screens are smaller. � It isn’t practical for a user to drill down into 20 documents returned by a search engine as possibly relevant to a query, especially if the user is driving a car and using a voice browser. 27

Hint � Generally users prefer to browse rather than search. If users are resorting to searches in order to get standard answers or perform common tasks, there may be something wrong with a site’s navigation or information architecture. 28

A split-system approach to providing full-text search. 29

Extra Features � Who Links To You? � link:siteURL � example:link:www.google.com � Web Page Translation � Google offers the following translation pairs: English to and from Arabic, Chinese, French, German, Italian, Korean, Japanese, Spanish, and Portuguese; and German to and from French. 30

Internet Engineering: Search Ali Kamandi Sharif University of - PowerPoint PPT Presentation

Internet Engineering: Search Ali Kamandi Sharif University of Technology kamandi@ce.sharif.edu Spring 2007 Statistics In 1994, one of the first web search engines, the World Wide Web Worm (WWWW), had an index of 110,000 web pages and web

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

2 EBI Search 3 EBI Search 4 EBI

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Search Algorithms 3 AI Slides (6e) c Lin Zuoquan@PKU 2003-2020 3 1 3 Search Algorithms

Query DB structures Manipulation queries DB search Hits Memory search 2 Standardization of

Search 3 AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 3 1 3 Search 3.1 Problem-solving

Informed Search strategies AIMA sections 3.5, 3.6 Summary Informed Search strategies

Search Overview Introduction to Search Blind Search Techniques Heuristic Search

INTERNET FOR A MOBILE INTERNET FOR A MOBILE GENERATION GENERATION www.itu.int/mobileinternet

History of the Internet Pat Morin COMP 2405 Outline Origins of the Internet Internet

Marktoberdorf NATO Summer School 2016, Lecture 5 Assurance in the Internet of Things And for

FRI I C++ Primer Instructor: Justin Hart http://justinhart.net/teaching/2020_spring_cs309/

Internet-Wide Scanning and its Measurement Applications Zakir Durumeric University of Michigan

4. Searching the Web. Pagerank October 17, 2019 Slides by Marta Arias, Jos Luis Balczar,

Machine Learning 2007: Slides 1 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

Pl Plann annin ing g Computer ter Sc Science ce cpsc3 c322 22, , Lectur ture e 18 (Te

Lego Mindstorms Radek Pel anek Lego Mindstorms produced by Lego, http://mindstorms.lego.com

Black Boxes Boxes: : Making Making Ends Ends Meet Meet Black in Data Driven Driven

Sambuz

Useful Links

Newsletter

Mail Us