Outline • HTTP • Crawling The Web • Server Architecture Servers + Crawlers Connecting on the WWW What happens when you click? • Suppose – You are at www.yahoo.com/index.html – You click on www.grippy.org/mattmarg/ • Browser uses DNS => IP addr for www.grippy.org • Opens TCP connection to that address • Sends HTTP request: Internet Request Get /mattmarg/ HTTP/1.0 User-Agent: Mozilla/2.0 (Macintosh; I; PPC) Request Accept: text/html; */* Headers Cookie: name = value Referer: http://www.yahoo.com/index.html Host: www.grippy.org Expires: … If-modified-since: ... HTTP Response Response Status Lines Status • 1xx Informational HTTP/1.0 200 Found • 2xx Success Date: Mon, 10 Feb 1997 23:48:22 GMT Server: Apache/1.1.1 HotWired/1.0 – 200 Ok Content-type: text/html Last-Modified: Tues, 11 Feb 1999 22:45:55 GMT • 3xx Redirection Image/jpeg, ... – 302 Moved Temporarily • 4xx Client Error • One click => several responses – 404 Not Found • 5xx Server Error • HTTP1.0: new TCP connection for each elt/page • HTTP1.1: KeepAlive - several requests/connection
Logging Web Activity HTTP Methods • GET • Most servers support “common logfile format” or “extended – Bring back a page logfile format” • HEAD 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 – Like GET but just return headers • POST • Apache lets you customize format – Used to send data to server to be processed (e.g. CGI) • Every HTTP event is recorded – Different from GET: – Page requested – Remote host • A block of data is sent with the request, in the body, – Browser type usually with extra headers like Content-Type: and – Referring page Content-Length: – Time of day • Request URL is not a resource to retrieve; • Applications of data-mining logfiles ?? it's a program to handle the data being sent • HTTP response is normally program output, not a static file. • PUT, DELETE, ... HTTPS Cookies • Secure connections • Small piece of info – Sent by server as part of response header • Encryption: SSL/TLS – Stored on disk by browser; returned in request header – May have expiration date (deleted from disk) • Fairly straightforward: • Associated with a specific domain & directory – Agree on crypto protocol – Only given to site where originally made – Many sites have multiple cookies – Exchange keys – Some have multiple cookies per page! – Create a shared key • Most Data stored as name=value pairs • See – Use shared key to encrypt data – C:\Program Files\Netscape\Users\default\cookies.txt • Certificates – C:\WINDOWS\Cookies Standard Web Search Engine Architecture store documents, check for duplicates, extract links crawl the web DocIds create an user inverted index query CRAWLERS… Search inverted show results engine index To user servers Slide adapted from Marti Hearst / UC Berkeley]
Your Project Architecture? Your Project Architecture? store documents, store documents, check for duplicates, check for duplicates, extract links extract links crawl the crawl the web DocIds web DocIds Classify? user Standard Crawler query Extract show results Front end To user Relational DB Slide adapted from Marti Hearst / UC Berkeley] Open-Source Crawlers How Inverted Files are Created • GNU Wget – Utility for downloading files from the Web. Forward Crawler Scan Repository – Fine if you just need to fetch files from 2-3 sites. Index • Heritix ptrs – Open-source, extensible, Web-scale crawler to – Easy to get running. docs NF – Web-based UI (docs) Sort • Nutch Lexicon – Featureful, industrial strength, Web search package. – Includes Lucene information retrieval part • TF/IDF and other document ranking Sorted • Optimized, inverted-index data store Inverted Scan Index – You get complete control thru easy programming. File List 4/28/2009 4:57 PM Thinking about Efficiency Search Engine Architecture • Crawler (Spider) • Clock cycle: 2 GHz – Searches the web to find pages. Follows hyperlinks. – Typically completes 2 instructions / cycle Never stops • ~10 cycles / instruction, but pipelining & parallel execution – Thus: 4 billion instructions / sec • Indexer • Disk access: 1-10ms – Produces data structures for fast searching of all – Depends on seek distance, published average is 5ms words in the pages – Thus perform 200 seeks / sec • Retriever – (And we are ignoring rotation and transfer times) – Query interface – Database lookup to find hits • Disk is 20 Million times slower !!! • 300 million documents • 300 GB RAM, terabytes of disk • Store index in Oracle database? – Ranking, summaries • Store index using files and unix filesystem? • Front End 4/28/2009 4:57 PM 18
Spiders (Crawlers, Bots) Spiders = Crawlers • Queue := initial page URL 0 • Do forever • 1000s of spiders – Dequeue URL • Various purposes: – Fetch P – Search engines – Parse P for more URLs; add them to queue – Digital rights management – Pass P to (specialized?) indexing program – Advertising • Issues… – Spam – Which page to look at next? – Link checking – site validation • keywords, recency, focus, ??? – Avoid overloading a site – How deep within a site to go? – How frequently to visit pages? – Traps! Crawling Issues Robot Exclusion • Storage efficiency • Search strategy • Person may not want certain pages indexed. – Where to start • Crawlers should obey Robot Exclusion Protocol. – Link ordering – Circularities – But some don’t – Duplicates • Look for file robots.txt at highest directory level – Checking for changes – If domain is www.ecom.cmu.edu, robots.txt goes in • Politeness www.ecom.cmu.edu/robots.txt – Forbidden zones: robots.txt – CGI & scripts • Specific document can be shielded from a crawler – Load on remote servers by adding the line: – Bandwidth (download what need) • Parsing pages for links <META NAME="ROBOTS” CONTENT="NOINDEX"> • Scalability • Malicious servers: SEOs Robots Exclusion Protocol Outgoing Links? • Parse HTML… • Format of robots.txt – Two fields. User-agent to specify a robot • Looking for…what? – Disallow to tell the agent what to ignore • To exclude all robots from a server: User-agent: * Disallow: / • To exclude one robot from two directories: ? anns html foos Bar baz hhh www A href = www.cs Frame font zzz User-agent: WebCrawler ,li> bar bbb anns html foos Bar baz hhh www Disallow: /news/ A href = ffff zcfg www.cs bbbbb z Frame font zzz ,li> bar bbb Disallow: /tmp/ • View the robots.txt specification at http://info.webcrawler.com/mak/projects/robots/norobots.html
Web Crawling Strategy Which tags / attributes hold URLs? • Starting location(s) Anchor tag: <a href=“URL” … > … </a> • Traversal order Option tag: <option value=“URL”…> … </option> – Depth first (LIFO) – Breadth first (FIFO) Map: <area href=“URL” …> – Or ??? Frame: <frame src=“URL” …> • Politeness • Cycles? Link to an image: <img src=“URL” …> • Coverage? Relative path vs. absolute path: <base href= …> Bonus problem: Javascript In our favor: Search Engine Optimization Structure of Mercator Spider URL Frontier (priority queue) • Most crawlers do breadth-first search from seeds. • Politeness constraint: don’t hammer servers! Document fingerprints – Obvious implementation: “live host table” – Will it fit in memory? – Is this efficient? • Mercator’s politeness: – One FIFO subqueue per thread. – Choose subqueue by hashing host’s name. – Dequeue first URL whose host has NO outstanding requests. 1. Remove URL from queue 5. Extract links 2. Simulate network protocols & REP 6. Download new URL? 3. Read w/ RewindInputStream (RIS) 7. Has URL been seen before? 4. Has document been seen before? 8. Add URL to frontier (checksums and fingerprints) Fetching Pages Duplicate Detection • Need to support http, ftp, gopher, .... – Extensible! • URL-seen test: has URL been seen before? • Need to fetch multiple pages at once. – To save space, store a hash • Need to cache as much as possible • Content-seen test: different URL, same doc. – DNS – Supress link extraction from mirrored pages. – robots.txt – Documents themselves (for later processing) • What to save for each doc? • Need to be defensive! – 64 bit “document fingerprint” – Need to time out http connections. – Minimize number of disk reads upon retrieval. – Watch for “crawler traps” (e.g., infinite URL names.) – See section 5 of Mercator paper. – Use URL filter module – Checkpointing!
Recommend
More recommend