Frontera: open source, large scale web crawling framework Alexander - PowerPoint PPT Presentation

Frontera: open source, large scale web crawling framework Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com

Sziasztok résztvev ő k! • Born in Yekaterinburg, RU • 5 years at Yandex, search quality department: social and QA search, snippets. • 2 years at Avast! antivirus, research team: automatic false positive solving, large scale prediction of malicious download attempts. 2

Task • Crawl Spanish web to gather statistics about hosts and their sizes. • Limit crawl to .es zone. • Breadth-first strategy: first crawl 1-click distance documents, next 2-clicks, and so on, • Finishing condition: absence of hosts with less than 100 crawled documents. • Low costs. 3

Spanish internet (.es) in 2012 • Domain names registered - 1,56 М (39% growth per year) • Web server in zone - 283,4K (33,1%) • Hosts - 4,2M (21%) • Spanish web sites in DMOZ catalog - 22043   * - отчет OECD Communications Outlook 2013 4

Solution • Scrapy * - network operations. • Apache Kafka - data bus (offsets, partitioning). • Apache HBase - storage (random access, linear scanning, scalability). • Twisted.Internet - library for async primitives for use in workers. • Snappy - efficient compression algorithm for IO-bounded applications. * - network operations in Scrapy are implemented asynchronously, based on the same Twisted.Internet 5

Architecture Kafka topic Crawling strategy SW workers DB Storage workers 6

1. Big and small hosts problem • When crawler comes to huge number of links from some host, along with usage of simple prioritization models, it turns out queue is flooded with URLs from the same host. • That causes underuse of spider resources. • We adopted additional per- host (optionally per-IP) queue and metering algorithm: URLs from big hosts are cached in memory. 7

3. DDoS DNS service Amazon AWS • Breadth-first strategy assumes first visiting of previously unknown hosts, therefore generating huge amount of DNS request. • Recursive DNS server on each downloading node, with upstream set to Verizon and OpenDNS. • We used dnsmasq. 8

4. Tuning Scrapy thread pool’ а for efficient DNS resolution • Scrapy uses a thread pool to resolve DNS name to IP. • When ip is absent in cache, request is sent to DNS server in it’s own thread, which is blocking. • Scrapy reported numerous errors related to DNS name resolution and timeouts. • We added option to Scrapy for thread pool size and timeout adjustment. 9

5. Overloaded HBase region servers during state check • Crawler extracts from document hundreds of links in average. • Before adding this links to queue, they needs to be checked if they weren’t already crawled (to avoid repetitive visiting). • On small volumes SSDs were just fine. After increase of table size, we had to move to HDDs, and response times dramatically grew up. • Host-local fingerprint function for keys in HBase. • Tuning HBase block cache to fit average host states into one block. 10

6. Intensive network traffic from workers to services • We noticed throughput between workers Kafka and HBase up to 1Gbit/s. • Switched to Thrift compact protocol for HBase communication. • Message compression in Kafka using Snappy. 11

7. Further query and traffic optimizations to HBase • State check required lion’s share of requests and network throughput. • Consistency was another requirement. • We created local state cache in strategy worker. • For consistency, spider log was partitioned by host, to avoid cache overlap between workers. 12

State cache • All operations are batched: • If key is absent in cache, it’s requested from HBase, • every ~4K documents cache is flushed to HBase. • When achieving 3M (~1 Гб ) elements, flush and cleanup happens. • It seems Least-Recently-Used (LRU) algorithm is a good fit there.

Spider priority queue (slot) • Cell has an array of:   - fingerprint,   - Crc32(hostname),   - URL,   - score • Dequeueing top N. • Such design is prone to huge hosts. • Partially this problem can be solved using scoring model taking into account known document count per host. 14

8. Problem of big and small hosts (strikes back!) • During crawling we’ve found few very huge hosts (>20M docs) • All queue partitions were flooded with pages from few huge hosts, because of queue design and scoring model used. • We made two MapReduce jobs: • queue shuffling, • limiting all hosts to no more than 100 documents. 15

Hardware requirements • Single-thread Scrapy spider gives 1200 pages/min. from about 100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example: • 12 spiders ~ 14.4K pages/min., • 3 SW and 3 DB workers, • Total 18 cores.

Software requirements • Apache HBase, CDH (100% Open source • Apache Kafka, Hadoop package) • Python 2.7+, • Scrapy 0.24+, • DNS Service. 17

Maintaining Cloudera Hadoop on Amazon EC2 • CDH is very sensitive to free space on root partition, parcels, and storage of Cloudera Manager. • We’ve moved it using symbolic links to separate EBS partition. • EBS should be at least 30Gb, base IOPS should be enough. • Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb, 2x40 SSD). • After one week of crawling, we ran out of space, and started to move DataNodes to d2.xlarge (4 CPU, 30.5Gb, 3x2Tb HDD).

Spanish (.es) internet crawl results • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites • 68.7K domains found ( ~600K expected ), • 46.5M crawled pages overall, • 1.5 months, • 22 websites with more than 50M pages

where are the rest of web servers?!

Bow-tie model A. Broder et al. / Computer Networks 33 (2000) 309-320

Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005

Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014

Main features • Online operation: scheduling of new batch, updating of DB state. • Storage abstraction: write your own backend (sqlalchemy, HBase is included). • Canonical URLs resolution abstraction: each document has many URLs, which to use? • Scrapy ecosystem: good documentation, big community, ease of customization. 24

Main features • Communication layer is Apache Kafka: topic partitioning, offsets mechanism. • Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module. • Polite by design : each website is downloaded by at most one spider. • Python: workers, spiders.

References • Distributed Frontera. https://github.com/ scrapinghub/distributed-frontera • Frontera. https://github.com/scrapinghub/frontera • Documentation: • http://distributed-frontera.readthedocs.org/ • http://frontera.readthedocs.org/ 26

Future plans • Lighter version, without HBase and Kafka. Communicating using sockets. • Revisiting strategy out-of-box. • Watchdog solution: tracking website content changes. • PageRank or HITS strategy. • Own HTML and URL parsers. • Integration into Scrapinghub services. • Testing on larger volumes. 27

Contribute! • Distributed Frontera is a historically first attempt to implement web scale web crawler using Python. • Truly resource-intensive task: CPU, network, disks. • Made in Scrapinghub, a company where Scrapy was created. • A plans to become an Apache Software Foundation project. 28

We’re hiring! http://scrapinghub.com/jobs/ 29

Köszönöm! Thank you! Alexander Sibiryakov, sibiryakov@scrapinghub.com

Frontera: open source, large scale web crawling framework Alexander - PowerPoint PPT Presentation

Frontera: open source, large scale web crawling framework Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com Sziasztok rsztvev k! Born in Yekaterinburg, RU 5 years at Yandex, search quality department: social and QA

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

Focussed Web Crawling Using RL Searching web for pages relevant to a specific subject No

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

The New Frontera September 2017 Advisories This presentation contains forward-looking

The New Frontera October 2017 Advisories This presentation contains forward-looking statements.

SPAIN 2015 jerez de la frontera D7 Panels From Zero to Hero in 2000 Seconds

Large-Scale Web Applications Mendel Rosenblum CS142 Lecture Notes - Large-Scale Web Apps Web

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

Chapter 5: Formation of Stars and Planets ASTR/PHYS 1060: The Universe Fall 2018: Chapter 5

Project Shibboleth Project Shibboleth Update, Demonstration and Discussion Update, Demonstration

Financial technology & Islamic Finance in North Africa 17/04/2018 Financial technology &

JUST THE MATHS SLIDES NUMBER 18.3 STATISTICS 3 (Measures of dispersion (or scatter)) by

MATH 105: Finite Mathematics 9-5: Measures of Dispersion Prof. Jonathan Duncan Walla Walla

-Beating, dispersion and coupling correction in the LHC R. Toms, R. Calaga, O. Bruning, S.

Modeling Overdispersion James H. Steiger Department of Psychology and Human Development

Frontera: open source, large scale web crawling framework Alexander - PowerPoint PPT Presentation

Frontera: open source, large scale web crawling framework Alexander Sibiryakov, October 1, 2015 sibiryakov@scrapinghub.com Sziasztok rsztvev k! Born in Yekaterinburg, RU 5 years at Yandex, search quality department: social and QA

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

Focussed Web Crawling Using RL Searching web for pages relevant to a specific subject No

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

The New Frontera September 2017 Advisories This presentation contains forward-looking

The New Frontera October 2017 Advisories This presentation contains forward-looking statements.

SPAIN 2015 jerez de la frontera D7 Panels From Zero to Hero in 2000 Seconds

Large-Scale Web Applications Mendel Rosenblum CS142 Lecture Notes - Large-Scale Web Apps Web

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

Chapter 5: Formation of Stars and Planets ASTR/PHYS 1060: The Universe Fall 2018: Chapter 5

Project Shibboleth Project Shibboleth Update, Demonstration and Discussion Update, Demonstration

Financial technology &amp; Islamic Finance in North Africa 17/04/2018 Financial technology &amp;

JUST THE MATHS SLIDES NUMBER 18.3 STATISTICS 3 (Measures of dispersion (or scatter)) by

MATH 105: Finite Mathematics 9-5: Measures of Dispersion Prof. Jonathan Duncan Walla Walla

-Beating, dispersion and coupling correction in the LHC R. Toms, R. Calaga, O. Bruning, S.

Modeling Overdispersion James H. Steiger Department of Psychology and Human Development

Financial technology & Islamic Finance in North Africa 17/04/2018 Financial technology &