Distributed Web Crawling over DHTs Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4
Search Today Search Index Crawl Crawl
What’s Wrong? Users have a limited search interface Today’s web is dynamic and growing: � Timely re-crawls required. � Not feasible for all web sites. Search engines control your search results: � Decide which sites get crawled: � 550 billion documents estimated in 2001 (BrightPlanet) � Google indexes 3.3 billion documents. � Decide which sites gets updated more frequently � May censor or skew results rankings. Challenge: User customizable searches that scale.
Our Solution: A Distributed Crawler P2P users donate excess bandwidth and computation resources to crawl the web. � Organized using Distributed Hash tables (DHTs) � DHT and Query Processor agnostic crawler: � Designed to work over any DHT � Crawls can be expressed as declarative recursive queries � Easy for user customization. � Queries can be executed over PIER, a DHT-based relational P2P Query Processor Crawlees: Web Servers Crawlers: PI ER nodes
Potential Infrastructure for crawl personalization: � User-defined focused crawlers � Collaborative crawling/filtering (special interest groups) Other possibilities: � Bigger, better, faster web crawler � Enables new search and indexing technologies � P2P Web Search � Web archival and storage (with OceanStore) Generalized crawler for querying distributed graph structures. � Monitor file-sharing networks. E.g. Gnutella. � P2P network maintenance: � Routing information. � OceanStore meta-data.
Challenges that We Investigated Scalability and Throughput � DHT communication overheads. � Balance network load on crawlers � 2 components of network load: Download and DHT bandwidth. � Network Proximity. Exploit network locality of crawlers. Limit download rates on web sites � Prevents denial of service attacks. Main tradeoff: Tension between coordination and communication � Balance load either on crawlers or on crawlees ! � Exploit network proximity at the cost of communication.
Crawl as a Recursive Query Publish WebPage(url) Publish Link (sourceUrl, destUrl) Π : Link.destUrl � WebPage(url) Rate Throttle & Reorder Redirect Filters Crawler Thread Dup Elim Output Links Extractor CrawlWrapper Downloader Input Urls Seed Urls DupElim DHT Scan: WebPage(url)
Crawl Distribution Strategies Partition by URL � Ensures even distribution of crawler workload. � High DHT communication traffic. Partition by Hostname � One crawler per hostname. � Creates a “control point” for per-server rate throttling. � May lead to uneven crawler load distribution � Single point of failure: � “Bad” choice of crawler affects per-site crawl throughput. � Slight variation: X crawlers per hostname.
Redirection • Simple technique that allows a crawler to redirect or pass on its assigned work to another crawler (and so on….) • A second chance distribution mechanism orthogonal to the partitioning scheme. • Example: Partition by hostname Node responsible for google.com (red) dispatches work (by URL) to • grey nodes www.google.com Load balancing benefits of partition by URL • Control benefits of partition by hostname • • When? Policy-based Crawler load (queue size) • Network proximity • • Why not? Cost of redirection Increased DHT control traffic • Hence, put a limit number of redirections per URL. •
Experiments Deployment WebCrawler over PIER, Bamboo DHT, up to 80 PlanetLab nodes � 3 Crawl Threads per crawler, 15 min crawl duration � Distribution (Partition) Schemes URL � Hostname � Hostname with 8 crawlers per unique host � Hostname, one level redirection on overload. � Crawl Workload Exhaustive crawl � � Seed URL: http://www.google.com � 78244 different web servers Crawl of fixed number of sites � � Seed URL: http://www.google.com � 45 web servers within google Crawl of single site within http://groups.google.com �
Crawl of Multiple Sites I CDF of Per-crawler Downloads (80 nodes) Partition by Hostname shows poor imbalance (70% idle). Better off when more crawlers are busy Crawl Throughput Scaleup Hostname: Can only exploit at most 45 crawlers. Redirect (hybrid hostname/url) does the best.
Crawl of Multiple Sites II Per-URL DHT Overheads Redirect: The per-URL DHT overheads hit their maximum around 70 nodes. Redirection incurs higher overheads only after queue size exceeds a threshold. Hostname incurs low overheads since crawl only looks at google.com which has lots of self-links.
Network Proximity Sampled 5100 crawl targets and measured ping times from each of 80 PlanetLab hosts Partition by hostname approximates random assignment Best-3 random is “close enough” to Best-5 random Sanity check: what if a single host crawls all targets ?
Summary of Schemes Load- Load- Rate limit Network DHT balance balance proximity Communication Crawlees download DHT overheads bandwidth bandwidth URL + + - - - Hostname - - + ? + Redirect + ? + + --
Related Work Herodotus, at MIT (Chord-based) � Partition by URL � Batching with ring-based forwarding. � Experimented on 4 local machines Apoidea, at GaTech (Chord-based) � Partition by hostname. � Forwards crawl to DHT neighbor closest to website. � Experimented on 12 local machines.
Conclusion Our main contributions: � Propose a DHT and QP agnostic Distributed Crawler. � Express crawl as a query. � Permits user-customizable refinement of crawls � Discover important trade-offs in distributed crawling: � Co-ordination comes with extra communication costs � Deployment and experimentation on PlanetLab. � Examine crawl distribution strategies under different workloads on live web sources � Measure the potential benefits of network proximity.
Backup slides
Existing Crawlers Cluster-based crawlers � Google: Centralized dispatcher sends urls to be crawled. � Hash-based parallel crawlers. Focused Crawlers � BINGO! � Crawls the web given basic training set. Peer-to-Peer � Grub SETI@Home infrastructure. � 23993 members .
Exhaustive Crawl Partition by Hostname shows imbalance. Some crawlers are over-utilized for downloads. Little difference in throughput. Most crawler threads are kept busy.
Single Site URL is best, followed by redirect and hostname.
Future Work Fault Tolerance Security Single-Node Throughput Work-Sharing between Crawl Queries � Essential for overlapping users. Crawl Global Prioritization � A requirement of personalized crawls. � Online relevance feedback. Deep web retrieval.
Recommend
More recommend