Frontera: Large-Scale Open Source Web Crawling Framework Alexander - PowerPoint PPT Presentation

Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015 sibiryakov@scrapinghub.com

Hola los participantes! • Born in Yekaterinburg, RU • 5 years at Yandex, search quality department: social and QA search, snippets. • 2 years at Avast! antivirus, research team: automatic false positive solving, large scale prediction of malicious download attempts. 2

«A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier .». –Wikipedia: Web Crawler article, July 2015 3

tophdart.com

Motivation • Client needed to crawl 1B+ pages/week, and identify frequently changing HUB pages. • Scrapy is hard for broad crawling and had no crawl frontier capabilities, out of the box, • People were tend to favor Hyperlink-Induced Topic Search, Apache Nutch instead of Jon Kleinberg, 1999 Scrapy. 5

Frontera: single-threaded and distributed • Frontera is all about knowing what to crawl next and when to stop. • Single-Threaded mode can be used for up to 100 websites (parallel downloading), • for performance broad crawls there is a distributed mode. 6

Main features • Online operation: scheduling of new batch, updating of DB state. • Storage abstraction: write your own backend (sqlalchemy, HBase is included). • Canonical URLs resolution abstraction: each document has many URLs, which to use? • Scrapy ecosystem: good documentation, big community, ease of customization. 7

Single-threaded use cases • Need of URL metadata and content storage, • Need of isolation of URL ordering/queueing logic from the spider • Advanced URL ordering logic (big websites, or revisiting) 8

Single-threaded architecture 9

Frontera and Scrapy • Frontera is implemented as a set of custom scheduler and spider middleware for Scrapy. • Frontera doesn’t require Scrapy, and can be used separately. • Scrapy role is process management and fetching operation. • And we’re friends forever! 10

Single-threaded Frontera quickstart • $pip install frontera • write a spider, or take example one from Frontera repo, • edit spider settings.py changing scheduler and add Frontera’s spider middleware, • $scrapy crawl [your_spider] • Check your chosen DB contents after crawl.

Distributed use cases: broad crawls • You have set of URLs and need to revisit them (e.g. to track changes). • Building a search engine with content retrieval from the Web. • All kinds of research work on web graph: gathering links statistics, structure of graph, tracking domain count, etc. • You have a topic and you want to crawl the documents about that topic. • More general focused crawling tasks: e.g. you search for pages that are big hubs, and frequently changing in time. 12

Frontera architecture: distributed Kafka topic SW Strategy workers DB workers DB 13

Main features: distributed • Communication layer is Apache Kafka: topic partitioning, offsets mechanism. • Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module. • Polite by design : each website is downloaded by at most one spider. • Python: workers, spiders.

Software requirements • Apache HBase, CDH (100% Open source • Apache Kafka, Hadoop distribution) • Python 2.7+, • Scrapy 0.24+, • DNS Service.

Hardware requirements • Single-thread Scrapy spider gives 1200 pages/min. from about 100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example: • 12 spiders ~ 14.4K pages/min., • 3 SW and 3 DB workers, • Total 18 cores.

Hardware requirements: gotchas • Network could be a bottleneck for internal communication. Solution: increase count of network interfaces. • HBase can be backed by HDDs, and free RAM would be great for caching the priority queue. • Kafka throughput is key performance issue, make sure that Kafka brokers has enough IOPS.

Quickstart for distributed Frontera • $pip install distributed-frontera • prepare HBase and Kafka, • simple Scrapy spider, passing links and/or content, • configure Frontera workers and spiders, • run workers, spiders and pull in the seeds. Consult http://distributed-frontera.readthedocs.org/ for more information.

Quick spanish (.es) internet crawl • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites • 68.7K domains found, • 46.5M crawled pages overall, • 1.5 months, • 22 websites with more than 50M pages For more info and graphs check the poster

Feature plans: distributed version • Revisit strategy, • PageRank or HITS-based strategy, • Own url parsing and html parsing, • Integration to Scrapinghub’s paid services, • Testing at larger scales.

Preguntas! Thank you! Alexander Sibiryakov, sibiryakov@scrapinghub.com

Frontera: Large-Scale Open Source Web Crawling Framework Alexander - PowerPoint PPT Presentation

Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015 sibiryakov@scrapinghub.com Hola los participantes! Born in Yekaterinburg, RU 5 years at Yandex, search quality department: social and QA

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Frontera: open source, large scale web crawling framework Alexander Sibiryakov, October 1, 2015

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

Focussed Web Crawling Using RL Searching web for pages relevant to a specific subject No

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

The New Frontera September 2017 Advisories This presentation contains forward-looking

The New Frontera October 2017 Advisories This presentation contains forward-looking statements.

SPAIN 2015 jerez de la frontera D7 Panels From Zero to Hero in 2000 Seconds

Large-Scale Web Applications Mendel Rosenblum CS142 Lecture Notes - Large-Scale Web Apps Web

Set11 Search Engines & SEO Outline How do search engines work? Basic operation

Web Crawling and Web Dynamics Knut Magne Risvik and Rolf Michelsen, Search engines and Web

WEBCOP: LOCATING NEIGHBORHOODS OF MALWARE ON THE WEB Reid Andersen Jay Stokes

a framework for historical analysis and real-4me monitoring of BGP data Chiara Orsini, Alistair

A C r a w l i n g A p p l i c a t i o n w i t h R Wh a t a b o u t

jk: Using Dynamic Analysis to Crawl and Test Modern Web Applications Giancarlo Pellegrino (1) ,

Mining Second Life: Characterizing User Mobility in a Popular Virtual World Chi-Anh La - Pietro

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities &

Sambuz

Useful Links

Newsletter

Mail Us

Frontera: Large-Scale Open Source Web Crawling Framework Alexander - PowerPoint PPT Presentation

Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015 sibiryakov@scrapinghub.com Hola los participantes! Born in Yekaterinburg, RU 5 years at Yandex, search quality department: social and QA

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Frontera: open source, large scale web crawling framework Alexander Sibiryakov, October 1, 2015

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

Focussed Web Crawling Using RL Searching web for pages relevant to a specific subject No

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

The New Frontera September 2017 Advisories This presentation contains forward-looking

The New Frontera October 2017 Advisories This presentation contains forward-looking statements.

SPAIN 2015 jerez de la frontera D7 Panels From Zero to Hero in 2000 Seconds

Large-Scale Web Applications Mendel Rosenblum CS142 Lecture Notes - Large-Scale Web Apps Web

Set11 Search Engines &amp; SEO Outline How do search engines work? Basic operation

Web Crawling and Web Dynamics Knut Magne Risvik and Rolf Michelsen, Search engines and Web

WEBCOP: LOCATING NEIGHBORHOODS OF MALWARE ON THE WEB Reid Andersen Jay Stokes

a framework for historical analysis and real-4me monitoring of BGP data Chiara Orsini, Alistair

A C r a w l i n g A p p l i c a t i o n w i t h R Wh a t a b o u t

jk: Using Dynamic Analysis to Crawl and Test Modern Web Applications Giancarlo Pellegrino (1) ,

Mining Second Life: Characterizing User Mobility in a Popular Virtual World Chi-Anh La - Pietro

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities &amp;

Sambuz

Useful Links

Newsletter

Mail Us

Set11 Search Engines & SEO Outline How do search engines work? Basic operation

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities &