Web Crawling • Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report 173, 2001. Also in Handbook of Massive Data Sets, Kluwer, 2001. • Heydon and Najork, Mercator: A scalable, extensible Web crawler . World Wide Web , 4, 1999. • Najork and Wiener, Breadth-first search crawling yields high-quality pages . Proc. 10th Int. WWW Conf., 2001. • Arasu et al: Searching the Web . ACM Trans. Internet Technology, 1, 2001. 1
Web Crawling Web Crawling = Graph Traversal S = {startpage} repeat remove an element s from S foreach ( s, v ) if v not crawled before insert v in S 2
Issues Theoretical: • Startset S • Choice of s (crawl strategy) • Refreshing of changing pages. Practical: • Load balancing (own resources and resources of crawled sites) • Size of data (compact representations) • Performance (I/Os). 3
Crawl Strategy • Breath First Search • Depth First Search • Random • Priority Search Possible priorities: • Often changing pages (how to estimate change rate?). • Using global ranking scheme for queries (e.g. PageRank). • Using query dependent ranking scheme for queries (“focused crawling”, “collection building”). 4
BFS is Good 8 25 Average day top N pages were crawled 20 6 Average PageRank 15 4 10 2 5 0 0 5 10 15 20 25 30 35 40 45 50 55 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 Day of crawl top N Figure 1: Average PageRank score by day of crawl Figure 2: Average day on which the top N pages were crawled [From: Najork and Wiener, 2001] Statistics for crawl of 328 million pages. 5
PageRank Priority is Even Better (but computationally expensive to use. . . ) Hot pages crawled 100% 80% Ordering metric: PageRank 60% backlink breadth 40% random 20% Pages crawled 0% 0% 20% 40% 60% 80% 100% Figure 2: The performance of various ordering metrics for IB ( P ); G = 100 [From: Arasu et al., 2001] Statistics for crawl of 225.000 pages at Stanford. 6
Load Balancing Own resources: • Bandwidth (control global rate of requests) • Storage (compact representations, compression) • Industrial-strength crawlers must be distributed (e.g. partition the url-space) 7
Load Balancing Own resources: • Bandwidth (control global rate of requests) • Storage (compact representations, compression) • Industrial-strength crawlers must be distributed (e.g. partition the url-space) Resources of others: • BANDWIDTH. Control local rate of requests (e.g. 30 sec. between request to same site). • Identify yourself in request. Give contact info. • Monitor the crawl. • Obey the Robots Exclusion Protocol ( www.robotstxt.org ). [Also read the other material there.] 7
Efficiency • RAM: never enough for serious crawls. Efficient use of disk based storage important. I/O when accessing data structures is often a bottleneck. • CPU cycles: not a problem (Java and scripting languages are fine). • DNS lookup can be a bottleneck (as normally synchronized). Asynchronous DNS: check GNU adns library. Rates reported for serious crawlers: 200-400 pages/sec. 8
Example: Mercator Mercator DNS Resolver Content Doc Seen? FPs 4 1 2 3 5 6 7 8 I Link URL HTTP RIS DUE URL Frontier N Extractor Filter T E R Tag N FTP Counter Log E URL Queue T Set Files GIF Gopher Log Stats Protocol Processing Modules Modules Figure 1: Mercator’s main components. [From: Najork and Heydon, 2001] 9
Mercator Further ideas: • Fingerprinting ((sparse) hashfunction on strings). • Continuous crawling—crawled pages put back in queue (prioritized using update history). • Checkpointing (crash recovery). • Very modular structure. 10
Details: Politeness Polite, Dynamic, Prioritizing Frontier Prioritizer 1 2 3 k Front−end FIFO queues (one per priority level) Random queue chooser with bias to Host−to− high−priority queues queue table A 3 C 1 Back−end queue F n router Priority queue X 2 (e.g., heap) 2 1 2 3 n C X A F Back−end C X A F n FIFO queues (many more than worker threads) C X A F 1 3 Back−end queue selector [From: Najork and Heydon, 2001] Figure 3: Our best URL frontier implementation 11
Details: Efficient URL Elimination Disk file containing URLs (one per front−buffer entry) Front−buffer containing FP cache FPs and URL indices 2^16 entries 2^21 entries 035f4ca8 1 http://u.gov/gw 025ef978 • Fingerprinting 0382fc97 07f6de43 2 http://a.com/xa 05117c6f 15ef7885 3 http://z.org/gu ... 234e7676 4 http://q.net/hi 27cc67ed 5 http://m.edu/tz • Sorted file of 2f4e6710 6 http://n.mil/gd 327849c8 7 http://fq.de/pl 40678544 8 http://pa.fr/ok fingerprints of seen 42ca6ff7 9 http://tu.tw/ch ... ... ... URLs. T U FP disk file Disk file containing URLs 100m to 1b entries (one per back−buffer entry) • Cache most used Back−buffer containing FPs and URL indices 2^21 entries 025fe427 URLs. 02f567e0 1 http://x.com/hr 04ff5234 04deca01 2 http://g.org/rf 07852310 12054693 3 http://p.net/gt ... 17fc8692 4 http://w.com/ml • Non-cached URLs 230cd562 5 http://gr.be/zf 30ac8d98 6 http://gg.kw/kz 357cae05 7 http://it.il/mm checked in batches 4296634c 8 http://g.com/yt 47693621 9 http://z.gov/ew ... ... ... (merge with file I/O). F T’ U’ Figure 4: Our most effi cient disk-based DUE implementation [From: Najork and Heydon, 2001] 12
Some Experiences 200 − OK (81.36%) text/html (65.34%) 404 − Not Found (5.94%) image/gif (15.77%) 302 − Moved temporarily (3.04%) image/jpeg (14.36%) Excluded by robots.txt (3.92%) TCP error (3.12%) text/plain (1.24%) DNS error (1.02%) application/pdf (1.04%) Other (1.59%) Other (2.26%) Figure 6: Outcome of download attempts Figure 7: Distribution of content types 15% 10% 5% 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Figure 8: Distribution of document sizes 13
Some Experiences 4M 64G 16G 1M 4G 256K 1G 256M 64K 64M 16M 16K 4M 4K 1M 256K 1K 64K 16K 256 4K 64 1K 256 16 64 16 4 4 1 1 1 10 100 1000 10000 100000 1000000 10000000 1 10 100 1000 10000 100000 1000000 10000000 (a) Distribution of pages over web servers (b) Distribution of bytes over web servers Figure 9: Document and web server size distributions .com (47.20%) .com (51.44%) .de (7.93%) .net (6.74%) .net (7.88%) .org (6.31%) .org (4.63%) .edu (5.56%) .uk (3.29%) .jp (4.09%) raw IP addresses (3.25%) .de (3.37%) .jp (1.80%) .uk (2.45%) .edu (1.53%) raw IP addresses (1.43%) .ru (1.35%) .ca (1.36%) .br (1.31%) .gov (1.19%) .kr (1.30%) .us (1.14%) .nl (1.05%) .cn (1.08%) .pl (1.02%) .au (1.08%) .au (0.95%) .ru (1.00%) Other (15.52%) Other (11.76%) (a) Distribution of hosts over (b) Distribution of pages over top-level domains top-level domains [From: Najork and Heydon, 2001] 14
Further Resources Further resources for implementing a crawler: • Another good paper with practical info: Shkapenyuk and Suel: Design and Implementation of a High-Performance Distributed Web Crawler . IEEE Int. Conf. on Data Engineering (ICDE), February 2002. ( http://cis.poly.edu/suel/papers/crawl.ps ) • HTML specification ( www.w3.org ) • A free book on programming web agents. ( http://www.oreilly.com/openbook/webclient ) • Software libraries (Java, Perl, Python, C++) for net programming. 15
Recommend
More recommend