Web Crawling Introduction to Information Retrieval INF 141 Donald - PowerPoint PPT Presentation

Web Crawling Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org

Web Crawlers

Robust Crawling A Robust Crawl Architecture Doc. Finger- Robots.txt URL prints DNS Index WWW Parse URL Duplicate Fetch Seen? Filter Elimination URL Frontier Queue

Parsing Parsing: URL normalization • When a fetched document is parsed • some outlink URLs are relative • For example: • http://en.wikipedia.org/wiki/Main_Page • has a link to “/wiki/Special:Statistics” • which is the same as • http://en.wikipedia.org/wiki/Special:Statistics • Parsing involves normalizing (expanding) relative URLs

Duplication Content Seen? • Duplication is widespread on the web • If a page just fetched is already in the index, don’t process it any further • This can be done by using document fingerprints/shingles • A type of hashing scheme

Filters Compliance with webmasters wishes... • Robots.txt • Filters is a regular expression for a URL to be excluded • How often do you check robots.txt? • Cache to avoid using bandwidth and loading web server • Sitemaps • A mechanism to better manage the URL frontier

Duplicate Elimination • For a one-time crawl • Test to see if an extracted,parsed, filtered URL • has already been sent to the frontier. • has already been indexed. • For a continuous crawl • See full frontier implementation: • Update the URL’s priority • Based on staleness • Based on quality • Based on politeness

Distributing the crawl • The key goal for the architecture of a distributed crawl is cache locality • We want multiple crawl threads in multiple processes at multiple nodes for robustness • Geographically distributed for speed • Partition the hosts being crawled across nodes • Hash typically used for partition • How do the nodes communicate?

Robust Crawling The output of the URL Filter at each node is sent to the Duplicate Eliminator at all other nodes Doc. Finger- Robots.txt URL prints Index DNS To Other Nodes WWW Parse URL Duplicate Host Fetch Seen? Filter Elimination Splitter From Other Nodes URL Frontier Queue

URL Frontier • Freshness • Crawl some pages more often than others • Keep track of change rate of sites • Incorporate sitemap info • Quality • High quality pages should be prioritized • Based on link-analysis, popularity, heuristics on content • Politeness • When was the last time you hit a server?

URL Frontier • Freshness, Quality and Politeness • These goals will conflict with each other • A simple priority queue will fail because links are bursty • Many sites have lots of links pointing to themselves creating bursty references • Time influences the priority • Politeness Challenges • Even if only one thread is assigned to hit a particular host it can hit it repeatedly • Heuristic : insert a time gap between successive requests

Magnitude of the crawl • To fetch 1,000,000,000 pages in one month... • a small fraction of the web • we need to fetch 400 pages per second ! • Since many fetches will be duplicates, unfetchable, filtered, etc. 400 pages per second isn’t fast enough

Web Crawling Outline Overview • Introduction • URL Frontier • Robust Crawling • DNS • Various parts of architecture • URL Frontier • Index • Distributed Indices • Connectivity Servers

Robust Crawling The output of the URL Filter at each node is sent to the Duplicate Eliminator at all other nodes Doc. Finger- Robots.txt URL prints Index DNS To Other Nodes WWW Parse URL Duplicate Host Fetch Seen? Filter Elimination Splitter From Other Nodes URL Frontier Queue

URL Frontier Implementation - Mercator • URLs flow from top to bottom Prioritizer • Front queues manage priority 1 2 F • Back queue manage politeness F "Front" • Each queue is FIFO Queues Front Queue Selector Host to Back Queue Back Queue Router Mapping Table 1 2 B B "Back" Queues Back Queue Selector Timing Heap http://research.microsoft.com/~najork/mercator.pdf

URL Frontier Implementation - Mercator • Prioritizer takes URLS and assigns a Front queues priority • Integer between 1 and F Prioritizer • Appends URL to appropriate queue 1 2 F • Priority F "Front" • Based on rate of change Queues • Based on quality (spam) • Based on application Front Queue Selector

URL Frontier Implementation - Mercator • Selection from front queues is Back queues initiated from back queues Host to Back Queue Back Queue Router Mapping Table • Pick a front queue, how? 1 2 B • Round robin B "Back" Queues • Randomly • Monte Carlo Back Queue Selector Timing Heap • Biased toward high priority

URL Frontier Implementation - Mercator • Each back queue is non-empty Back queues while crawling Host to Back Queue Back Queue Router Mapping Table • Each back queue has URLs from 1 2 B one host only B "Back" Queues • Maintain a table of URL to back queues (mapping) to help Back Queue Selector Timing Heap

URL Frontier Implementation - Mercator • Timing Heap Back queues • One entry per queue Host to Back Queue Back Queue Router Mapping Table • Has earliest time that a host can 1 2 B be hit again B "Back" Queues • Earliest time based on • Last access to that host Back Queue Selector Timing Heap • Plus any appropriate heuristic

URL Frontier Implementation - Mercator • A crawler thread needs a URL Back queues • It gets the timing heap root Host to Back Queue Back Queue Router Mapping Table • It gets the next eligible queue 1 2 B based on time, b. B "Back" Queues • It gets a URL from b • If b is empty Back Queue Selector Timing Heap • Pull a URL v from front queue • If back queue for v exists place it in that queue, repeat. • Else add v to b - update heap.

URL Frontier Implementation - Mercator • How many queues? Back queues • Keep all threads busy Host to Back Queue Back Queue Router Mapping Table • ~3 times as many back queues 1 2 B as crawler threads B "Back" Queues • Web-scale issues • This won’t fit in memory Back Queue Selector Timing Heap • Solution • Keep queues on disk and keep a portion in memory.

Web Crawling Introduction to Information Retrieval INF 141 Donald - PowerPoint PPT Presentation

Web Crawling Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schtze http://www.informationretrieval.org Web Crawlers Robust Crawling A Robust Crawl Architecture Doc. Finger- Robots.txt URL

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

Focussed Web Crawling Using RL Searching web for pages relevant to a specific subject No

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Crawling HTML Query processing Content Analysis Indexing Crawling Document Layer Network

HTTP Crawling Crawling, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton A

Crawling-based Web Application Testing Jun-Wei Lin (UC-Irvine) Farn Wang (National Taiwan

StormCrawler Low Latency Web Crawling on Apache Storm Julien Nioche julien@digitalpebble.com

Rapid, Robust, and Reliable Blind Deconvolution via Nonconvex Optimization Shuyang Ling

ROBUST LOCATION AND SCALE ESTIMATION WITH CENSORED OUTCOMES Jerome H. Friedman Stanford

Nonlinear Control Lecture # 28 Robust State Feedback Stabilization Nonlinear Control Lecture # 28

Nonlinear Control Lecture # 10 State Feedback Stabilization and Robust State Feedback

Dynamic Robust Utility P Beissner* F Maccheroni # M Marinacci # S Mukerji @ * ANU # U Bocconi and @

R-Packages for Robust Asymptotic Statistics Dr. Matthias Kohl Chair for Stochastics joint work

Trade-off between Efficiency and Robustness Doctoral Colloqium @ SenSys18, Shenzhen Robert

Robust Flows over Time: Models and Complexity Results Corinna Gottschalk 1 , Arie Koster 1 ,