web crawling
play

Web Crawling Najork and Heydon, High-Performance Web Crawling , - PowerPoint PPT Presentation

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report 173, 2001. Also in Handbook of Massive Data Sets, Kluwer, 2001. Najork and Wiener, Breadth-first search crawling yields high-quality pages . Proc.


  1. Web Crawling • Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report 173, 2001. Also in Handbook of Massive Data Sets, Kluwer, 2001. • Najork and Wiener, Breadth-first search crawling yields high-quality pages . Proc. 10th Int. WWW Conf., 2001. 1

  2. Web Crawling Web Crawling = Graph Traversal S = {startpages} repeat remove an element s from S foreach ( s, v ) if v not crawled before insert v in S 2

  3. Issues Theoretical: • Startset S • Choice of s (crawl strategy) • Refreshing of changing pages. Practical: • Load balancing (own resources and resources of crawled sites) • Size of data (compact representations) • Performance (I/Os). 3

  4. Crawl Strategy • Breath First Search • Depth First Search • Random • Priority Search Possible priorities: • Often changing pages (how to estimate change rate?). • Using global ranking scheme for queries (e.g. PageRank). • Using query dependent ranking scheme for queries (“focused crawling”, “collection building”). 4

  5. BFS is Good 8 25 Average day top N pages were crawled 20 6 Average PageRank 15 4 10 2 5 0 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 0 5 10 15 20 25 30 35 40 45 50 55 Day of crawl top N Figure 1: Average PageRank score by day of crawl Figure 2: Average day on which the top N pages were crawled [From: Najork and Wiener, 2001] Statistics for crawl of 328 million pages. 5

  6. PageRank Priority is Even Better (but computationally expensive to use. . . ) Hot pages crawled 100% 80% Ordering metric: PageRank 60% backlink breadth 40% random 20% Pages crawled 0% 0% 20% 40% 60% 80% 100% Figure 2: The performance of various ordering metrics for IB ( P ); G = 100 [From: Arasu et al., Searching the Web . ACM Trans. Internet Technology, 1, 2001] Statistics for crawl of 225.000 pages at Stanford. 6

  7. Load Balancing Own resources: • Bandwidth (control global rate of requests) • Storage (compact representations, compression) • Industrial-strength crawlers must be distributed (e.g. partition the url-space) 7

  8. Load Balancing Own resources: • Bandwidth (control global rate of requests) • Storage (compact representations, compression) • Industrial-strength crawlers must be distributed (e.g. partition the url-space) Resources of others: • BANDWIDTH. Control local rate of requests (e.g. 30 sec. between request to same site). • Identify yourself in request. Give contact info (mail and www). • Monitor the crawl. • Obey the Robots Exclusion Protocol (see www.robotstxt.org ). 7

  9. Efficiency • RAM: never enough for serious crawls. Efficient use of disk based storage important. I/O when accessing data structures is often a bottleneck. • CPU cycles: not a problem (Java and scripting languages are fine). • DNS lookup can be a bottleneck if using synchronized version. Brug asynchronous DNS (e.g. GNU adns library). Rates reported for serious crawlers: 200-400 pages/sec. 8

  10. Crawler Example: Mercator Mercator DNS Resolver Content Doc Seen? FPs 4 1 2 3 5 6 7 8 I Link URL HTTP RIS DUE URL Frontier N Extractor Filter T E R Tag N FTP Counter Log E URL Queue T Set Files GIF Gopher Log Stats Protocol Processing Modules Modules Figure 1: Mercator’s main components. [From: Najork and Heydon, 2001] 9

  11. Mercator Further features: • Uses fingerprinting ((sparse) hashfunction on strings) for URL IDs (see e.g. ex. md5 (128 bit) or the sha family (160-512 bits)). • Continuous crawling—crawled pages put back in queue (prioritized using update history). • Checkpointing (crash recovery). • Very modular structure. 10

  12. Details: Politeness Polite, Dynamic, Prioritizing Frontier Prioritizer 1 2 3 k Front−end FIFO queues (one per priority level) Random queue chooser with bias to Host−to− high−priority queues queue table A 3 C 1 Back−end queue F n router Priority queue X 2 (e.g., heap) 2 1 2 3 n C X A F Back−end C X A F n FIFO queues (many more than worker threads) C X A F 1 3 Back−end queue selector [From: Najork and Heydon, 2001] Figure 3: Our best URL frontier implementation 11

  13. Details: Efficient URL Elimination Disk file containing URLs (one per front−buffer entry) Front−buffer containing FP cache FPs and URL indices 2^16 entries 2^21 entries 035f4ca8 1 http://u.gov/gw 025ef978 • Fingerprinting 0382fc97 07f6de43 2 http://a.com/xa 05117c6f 15ef7885 3 http://z.org/gu ... 234e7676 4 http://q.net/hi 27cc67ed 5 http://m.edu/tz • Sorted file of 2f4e6710 6 http://n.mil/gd 327849c8 7 http://fq.de/pl 40678544 8 http://pa.fr/ok fingerprints of seen 42ca6ff7 9 http://tu.tw/ch ... ... ... URLs. T U FP disk file Disk file containing URLs 100m to 1b entries (one per back−buffer entry) • Cache most used Back−buffer containing FPs and URL indices 2^21 entries 025fe427 URLs. 02f567e0 1 http://x.com/hr 04ff5234 04deca01 2 http://g.org/rf 07852310 12054693 3 http://p.net/gt ... 17fc8692 4 http://w.com/ml • Non-cached URLs 230cd562 5 http://gr.be/zf 30ac8d98 6 http://gg.kw/kz 357cae05 7 http://it.il/mm checked in batches 4296634c 8 http://g.com/yt 47693621 9 http://z.gov/ew ... ... ... (merge with file I/O). F T’ U’ Figure 4: Our most efficient disk-based DUE implementation [From: Najork and Heydon, 2001] 12

  14. Details: Parallelization RIS Link− URL Host URL HTTP DUE extractor Filter Splitter Frontier module HTTP Link− module extractor URL URL Frontier Filter Host DUE Splitter RIS RIS Host DUE Splitter URL URL Filter Frontier Link− HTTP extractor module HTTP URL Host URL Link− DUE module Frontier Splitter Filter extractor RIS Figure 2: A four-node distributed crawling hive [From: Najork and Heydon, 2001] 13

  15. Some Experiences 200 − OK (81.36%) text/html (65.34%) 404 − Not Found (5.94%) image/gif (15.77%) 302 − Moved temporarily (3.04%) image/jpeg (14.36%) Excluded by robots.txt (3.92%) TCP error (3.12%) text/plain (1.24%) DNS error (1.02%) application/pdf (1.04%) Other (1.59%) Other (2.26%) Figure 6: Outcome of download attempts Figure 7: Distribution of content types 15% 10% 5% 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Figure 8: Distribution of document sizes 14

  16. Some Experiences 4M 64G 16G 1M 4G 256K 1G 256M 64K 64M 16M 16K 4M 4K 1M 256K 1K 64K 16K 256 4K 64 1K 256 16 64 16 4 4 1 1 1 10 100 1000 10000 100000 1000000 10000000 1 10 100 1000 10000 100000 1000000 10000000 (a) Distribution of pages over web servers (b) Distribution of bytes over web servers Figure 9: Document and web server size distributions .com (47.20%) .com (51.44%) .de (7.93%) .net (6.74%) .net (7.88%) .org (6.31%) .org (4.63%) .edu (5.56%) .uk (3.29%) .jp (4.09%) raw IP addresses (3.25%) .de (3.37%) .jp (1.80%) .uk (2.45%) .edu (1.53%) raw IP addresses (1.43%) .ru (1.35%) .ca (1.36%) .br (1.31%) .gov (1.19%) .kr (1.30%) .us (1.14%) .nl (1.05%) .cn (1.08%) .pl (1.02%) .au (1.08%) .au (0.95%) .ru (1.00%) Other (15.52%) Other (11.76%) (a) Distribution of hosts over (b) Distribution of pages over top-level domains top-level domains [From: Najork and Heydon, 2001] 15

  17. Robot Exclusion Protocol Simple protocol suggested by Martijn Koster in 1993. De facto standard for robot exclusion. Full details at www.robotstxt.org . • Single file named robots.txt in root of server. • Contains simple directions for exclusion of parts of site. Example: User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /joe/ User-agent: BadBot Disallow: / 16

  18. Robot Exclusion in HTML Per page exclusion through the META tag in HTML. Example: <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> Further details at www.w3.org/TR/html4/ (the HTML 4.01 specification) and at www.robotstxt.org 17

  19. HTTP Protocol One request message, one response message (over a single TCP connection). Format of messages: Request line Response line Header line Header line . . . . . . Header line Header line (Body) Body Request Response 18

  20. HTTP Example GET /somedir/page.html HTTP/1.1 HTTP/1.1 200 OK Host: www.somefirm.com Content-Type: text/html Accept: text/* Content-Length: 345 User-Agent: Mozilla 7.0 [en] <HTML> <HEAD> . . . Request Response 19

  21. URLs Absolute: http://www.somefirm.dk:80/main/test http://www.somefirm.dk/main/test#thirdEntry http://www.somefirm.dk/cgi-bin?item=123 Relative: ./dir/test.html Relative to • URL of doc containing URL • URL specified in <BASE> HTML tag. Encoded characters: www.sdu.dk/~rolf → www.sdu.dk/%7Erolf 20

  22. Normalizing URLs • Add portnumber if not present ( :80 ). • Convert escaped chars to real chars. • Remove ...#target from URL. 21

Recommend


More recommend