Class Overview Content from the Web Other Cool Stuff Query processing Servers + Crawlers Content Analysis Indexing Crawling Document Layer Network Layer A Closeup View Today 10/13 - Crawlers • Search Engine Overview 10/15 – DL ∈ atrium • HTTP 10/20 – No class Group • Crawlers 10/22 – IR, indexing Meetings • Server Architecture 10/27 – Alta Vista Pagerank 10/29 – No class Crawling Standard Web Search Engine Architecture store documents, store documents, check for duplicates, check for duplicates, extract links extract links crawl the crawl the web DocIds web DocIds create an user inverted index query Search inverted show results engine index To user servers Slide adapted from Marti Hearst / UC Berkeley] Slide adapted from Marti Hearst / UC Berkeley]
Query Processing Indexing DocIds •Efficient Processing •Ranking •What data is necessary? create an inverted •Format? index •Compression? •Efficient Creation Search inverted inverted show results engine index index To user servers Slide adapted from Marti Hearst / UC Berkeley] Slide adapted from Marti Hearst / UC Berkeley] Precision and Recall Precision & Recall • Precision : fraction of retrieved docs that Precision tp are relevant = P(relevant|retrieved) Actual relevant docs + tp fp • Recall : fraction of relevant docs that are Proportion of selected tn items that are correct retrieved = P(retrieved|relevant) fp tp fn tp Relevant Not Relevant Recall + tp fn Retrieved tp fp % of target items that System returned these Not fn tn were selected Retrieved Precision • Precision P = tp/(tp + fp) Precision-Recall curve • Recall R = tp/(tp + fn) Recall Shows tradeoff But Really What Did I Forget? • Precision & Recall are too simple •Little Change to UI •Faceted Interfaces • Evaluation is a very thorny problem •Personalization •Revisiting user query show results To user Slide adapted from Marti Hearst / UC Berkeley]
Your Project Architecture? Scalability store documents, check for duplicates, extract links crawl the web DocIds Classify? user query Extract show results Front end To user Relational Relational DB DB Slide adapted from Marti Hearst / UC Berkeley] Slide adapted from Marti Hearst / UC Berkeley] Outline Connecting on the WWW • Search Engine Overview • HTTP • Crawlers • Server Architecture Internet HTTP Response What happens when you click? • Suppose Status – You are at www.yahoo.com/index.html – You click on www.grippy.org/mattmarg/ HTTP/1.0 200 Found Date: Mon, 10 Feb 1997 23:48:22 GMT • Browser uses DNS => IP addr for www.grippy.org Server: Apache/1.1.1 HotWired/1.0 • Opens TCP connection to that address Content-type: text/html Last-Modified: Tues, 11 Feb 1999 22:45:55 GMT • Sends HTTP request: Image/jpeg, ... Request Get /mattmarg/ HTTP/1.0 User-Agent: Mozilla/2.0 (Macintosh; I; PPC) Request Accept: text/html; */* • One click => several responses Cookie: name = value Headers Referer: http://www.yahoo.com/index.html Host: www.grippy.org • HTTP1.0: new TCP connection for each elt/page Expires: … • HTTP1.1: KeepAlive - several requests/connection If-modified-since: ...
Response Status Lines HTTP Methods • GET • 1xx Informational – Bring back a page • HEAD • 2xx Success – Like GET but just return headers – 200 Ok • POST – Used to send data to server to be processed (e.g. CGI) • 3xx Redirection – Different from GET: – 302 Moved Temporarily • A block of data is sent with the request, in the body, usually with extra headers like Content-Type: and • 4xx Client Error Content-Length: • Request URL is not a resource to retrieve; – 404 Not Found it's a program to handle the data being sent • HTTP response is normally program output, • 5xx Server Error not a static file. • PUT, DELETE, ... Logging Web Activity Cookies • Most servers support “common logfile format” or “extended logfile format” • Small piece of info – Sent by server as part of response header 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 – Stored on disk by browser; returned in request header – May have expiration date (deleted from disk) • Apache lets you customize format • Associated with a specific domain & directory • Every HTTP event is recorded – Only given to site where originally made – Page requested – Remote host – Many sites have multiple cookies – Browser type – Some have multiple cookies per page! – Referring page • Most Data stored as name=value pairs – Time of day • See • Applications of data-mining logfiles ?? – C:\Program Files\Netscape\Users\default\cookies.txt – C:\WINDOWS\Cookies HTTPS • Secure connections • Encryption: SSL/TLS • Fairly straightforward: – Agree on crypto protocol – Exchange keys – Create a shared key CRAWLERS… – Use shared key to encrypt data • Certificates
Danger Will Robinson!! Open-Source Crawlers • GNU Wget • Consequences of a bug – Utility for downloading files from the Web. – Fine if you just need to fetch files from 2-3 sites. • Heritix – Open-source, extensible, Web-scale crawler – Easy to get running. – Web-based UI • Nutch – Featureful, industrial strength, Web search package. – Includes Lucene information retrieval part • TF/IDF and other document ranking Max 6 hits/server/minute • Optimized, inverted-index data store – You get complete control thru easy programming. plus…. http://www.cs.washington.edu/lab/policies/crawlers.html Thinking about Efficiency Search Engine Architecture • Crawler (Spider) • Clock cycle: 2 GHz – Searches the web to find pages. Follows hyperlinks. – Typically completes 2 instructions / cycle Never stops • ~10 cycles / instruction, but pipelining & parallel execution – Thus: 4 billion instructions / sec • Indexer • Disk access: 1-10ms – Produces data structures for fast searching of all – Depends on seek distance, published average is 5ms words in the pages – Thus perform 200 seeks / sec • Retriever – (And we are ignoring rotation and transfer times) – Query interface – Database lookup to find hits • Disk is 20 Million times slower !!! • 300 million documents • 300 GB RAM, terabytes of disk • Store index in Oracle database? – Ranking, summaries • Store index using files and unix filesystem? • Front End 10/13/2009 5:01 PM 28 Spiders (Crawlers, Bots) Spiders = Crawlers • Queue := initial page URL 0 • Do forever • 1000s of spiders – Dequeue URL • Various purposes: – Fetch P – Search engines – Parse P for more URLs; add them to queue – Digital rights management – Pass P to (specialized?) indexing program – Advertising • Issues… – Spam – Which page to look at next? – Link checking – site validation • keywords, recency, focus, ??? – Avoid overloading a site – How deep within a site to go? – How frequently to visit pages? – Traps!
Crawling Issues Robot Exclusion • Storage efficiency • Search strategy • Person may not want certain pages indexed. – Where to start • Crawlers should obey Robot Exclusion Protocol. – Link ordering – But some don’t – Circularities – Duplicates • Look for file robots.txt at highest directory level – Checking for changes – If domain is www.ecom.cmu.edu, robots.txt goes in • Politeness www.ecom.cmu.edu/robots.txt – Forbidden zones: robots.txt – CGI & scripts • Specific document can be shielded from a crawler – Load on remote servers by adding the line: – Bandwidth (download what need) • Parsing pages for links <META NAME="ROBOTS” CONTENT="NOINDEX"> • Scalability • Malicious servers: SEOs Danger, Danger Robots Exclusion Protocol • Ensure that your crawler obeys robots.txt . • Format of robots.txt • Don’t make any of these specific gaffes . – Two fields. User-agent to specify a robot • Provide contact info in user-agent field . – Disallow to tell the agent what to ignore • Monitor the email address • To exclude all robots from a server: • Notify the CS Lab Staff User-agent: * Disallow: / • Honor all Do Not Scan requests • To exclude one robot from two directories: • Post any "stop-scanning" requests User-agent: WebCrawler • “The scanee is always right." Disallow: /news/ Disallow: /tmp/ • View the robots.txt specification at • Max 6 hits/server/minute http://info.webcrawler.com/mak/projects/robots/norobots.html Outgoing Links? Which tags / attributes hold URLs? • Parse HTML… Anchor tag: <a href=“URL” … > … </a> • Looking for…what? Option tag: <option value=“URL”…> … </option> Map: <area href=“URL” …> Frame: <frame src=“URL” …> anns html foos ? Bar baz hhh www A href = www.cs Frame font zzz ,li> bar bbb anns Link to an image: <img src=“URL” …> html foos Bar baz hhh www A href = ffff zcfg www.cs bbbbb z Frame font zzz ,li> bar bbb Relative path vs. absolute path: <base href= …> Bonus problem: Javascript In our favor: Search Engine Optimization
Recommend
More recommend