Nutch as a Web mining platform Nutch Berlin Buzzwords '10 the - PowerPoint PPT Presentation

Apache Nutch as a Web mining platform Nutch – Berlin Buzzwords '10 the present and the future Andrzej Białecki ab@sigram.com

Intro ● Started using Lucene in 2003 (1.2-dev?) ● Created Luke – the Lucene Index Toolbox ● Nutch, Lucene committer, Lucene PMC member ● Nutch project lead Nutch – Berlin Buzzwords '10

Agenda ● Nutch architecture overview ● Crawling in general – strategies and challenges ● Nutch workflow ● Web data mining with Nutch with examples   Nutch – Berlin Buzzwords '10 ● Nutch present and future ● Questions and answers 3

Apache Nutch project ● Founded in 2003 by Doug Cutting, the Lucene creator, and Mike Cafarella ● Apache project since 2004 (sub-project of Lucene) ● Spin-offs: Nutch – Berlin Buzzwords '10 – Map-Reduce and distributed FS → Hadoop – Content type detection and parsing → Tika ● Many installations in operation, mostly vertical search ● Collections typically 1 mln - 200 mln documents ● Apache Top-Level Project since May ● Current release 1.1 4

What's in a search engine? … a few things that may surprise you!  Nutch – Berlin Buzzwords '10 5

Search engine building blocks Injector Scheduler Crawler Searcher Nutch – Berlin Buzzwords '10 Indexer Web graph Updater Content - page info repository -links (in/out) Parser Crawling frontier controls 6

Nutch features at a glance ● Plugin-based, highly modular: ● Most behaviors can be changed via plugins ● Data repository: – Page status database and link database (web graph) – Content and parsed data database (shards) ● Multi-protocol, multi-threaded, distributed crawler Nutch – Berlin Buzzwords '10 ● Robust crawling frontier controls ● Scalable data processing framework ● Hadoop MapReduce processing ● Full-text indexer & search front-end ● Using Solr (or Lucene) ● Support for distributed search ● Flexible integration options 7

Search engine building blocks Injector Scheduler Crawler Searcher Nutch – Berlin Buzzwords '10 Indexer Web graph Updater Content - page info repository -links (in/out) Parser Crawling frontier controls 8

Nutch building blocks Injector Generator Fetcher Searcher Nutch – Berlin Buzzwords '10 Indexer Updater CrawlDB Shards (segments) Parser Link inverter LinkDB URL filters & normalizers, parsing/indexing filters, scoring plugins 9

Nutch data Maintains info on all known URL-s: Injector Generator Fetcher Searcher ● Fetch schedule ● Fetch status ● Page signature ● Metadata Nutch – Berlin Buzzwords '10 Indexer Updater CrawlDB Shards (segments) Parser Link inverter LinkDB URL filters & normalizers, parsing/indexing filters, scoring plugins 10

Nutch data For each target URL keeps info on Injector Generator Fetcher Searcher incoming links, i.e. list of source URL-s and their associated anchor text Nutch – Berlin Buzzwords '10 Indexer Updater CrawlDB Shards (segments) Parser Link inverter LinkDB URL filters & normalizers, parsing/indexing filters, scoring plugins 11

Nutch data Shards (“segments”) keep: Injector Generator Fetcher Searcher ● Raw page content ● Parsed content + discovered metadata + outlinks ● Plain text for indexing and Nutch – Berlin Buzzwords '10 Indexer snippets Updater CrawlDB Shards (segments) Parser Link inverter LinkDB URL filters & normalizers, parsing/indexing filters, scoring plugins 12

Shard-based workflow ● Unit of work (batch) – easier to process massive datasets ● Convenience placeholder, using predefined directory names ● Unit of deployment to the search infrastructure – Solr-based search may discard shards once indexed ● Once completed they are basically unmodifiable – No in-place updates of content, or replacing of obsolete content Nutch – Berlin Buzzwords '10 ● Periodically phased-out by new, re-crawled shards – Solr-based search can update Solr index in-place 200904301234/ 2009043012345 2009043012345 Generator crawl_generate/ crawl_generate crawl_generate crawl_fetch/ crawl_fetch Fetcher crawl_fetch content/ “cached” view content crawl_parse/ content crawl_parse parse_data/ Parser crawl_parse parse_text/ parse_data parse_data snippets parse_text parse_text Indexer 13

Crawling frontier challenge ● No authoritative catalog of web pages ● Crawlers need to discover their view of web universe ● Start from “seed list” & follow (walk) some ( useful? interesting? ) outlinks ● Many dangers of simply wandering around ● explosion or collapse of the frontier; collecting unwanted content (spam, junk, offensive) Nutch – Berlin Buzzwords '10 I need a few interesting items... 14

High-quality seed list ● Reference sites: – Wikipedia, FreeBase, DMOZ seed + 1 hop – Existing verticals ● Seeding from existing Nutch – Berlin Buzzwords '10 search engines – Collect top-N URL-s for seed characteristic keywords i = 1 ● Seed URL-s plus 1: – First hop usually retains high- quality and focus – Remove blatantly obvious junk 15 15

Controlling the crawling frontier ● URL filter plugins – White-list, black-list, regex – May use external resources (DB-s, services ...) ● URL normalizer plugins Nutch – Berlin Buzzwords '10 – Resolving relative path seed elements – “Equivalent” URLs i = 1 i = 2 ● Additional controls i = 3 – priority, metadata select/block – Breadth first, depth first, per site mixed ... ‑ 16

Wide vs. focused crawling ● Differences: – Little technical difference in configuration – Big difference in operations, maintenance and quality ● Wide crawling: ● (Almost) Unlimited crawling frontier ● High risk of spamming and junk content Nutch – Berlin Buzzwords '10 ● “Politeness” a very important limiting factor ● Bandwidth & DNS considerations ● Focused (vertical or enterprise) crawling: ● Limited crawling frontier ● Bandwidth or politeness is often not an issue ● Low risk of spamming and junk content 17

Vertical & enterprise search ● Vertical search – Range of selected “reference” sites – Robust control of the crawling frontier – Extensive content post-processing – Business-driven decisions about ranking Nutch – Berlin Buzzwords '10 ● Enterprise search – Variety of data sources and data formats – Well-defined and limited crawling frontier – Integration with in-house data sources – Little danger of spam – PageRank-like scoring usually works poorly 18

Nutch – Berlin Buzzwords '10 ? Face to face with Nutch 19

Installation & basic config ● http://nutch.apache.org ● Java 1.5+ ● Single-node out of the box – Comes also as a “job” jar to run on existing Hadoop cluster ● File-based configuration: conf/ Nutch – Berlin Buzzwords '10 – Plugin list – Per-plugin configuration ● … much, much more on this on the Wiki 20 20

Main Nutch workflow Command-line: bin/nutch ● Inject : initial creation of CrawlDB inject – Insert seed URLs – Initial LinkDB is empty Nutch – Berlin Buzzwords '10 ● Generate new shard's fetchlist generate ● Fetch raw content fetch ● Parse content (discovers outlinks) parse ● Update CrawlDB from shards updatedb ● Update LinkDB from shards invertlinks ● Index shards index / solrindex (repeat) 21

Injecting new URL-s Injector Generator Fetcher Searcher Nutch – Berlin Buzzwords '10 Indexer Updater CrawlDB Shards (segments) Parser Link inverter LinkDB URL filters & normalizers, parsing/indexing filters, scoring plugins 22

Generating fetchlists Injector Generator Fetcher Searcher Nutch – Berlin Buzzwords '10 Indexer Updater CrawlDB Shards (segments) Parser Link inverter LinkDB URL filters & normalizers, parsing/indexing filters, scoring plugins 23

Fetching content Injector Generator Fetcher Searcher Nutch – Berlin Buzzwords '10 Indexer Updater CrawlDB Shards (segments) Parser Link inverter LinkDB URL filters & normalizers, parsing/indexing filters, scoring plugins 24

Content processing Injector Generator Fetcher Searcher Nutch – Berlin Buzzwords '10 Indexer Updater CrawlDB Shards (segments) Parser Link inverter LinkDB URL filters & normalizers, parsing/indexing filters, scoring plugins 25

Link inversion Injector Generator Fetcher Searcher Nutch – Berlin Buzzwords '10 Indexer Updater CrawlDB Shards (segments) Parser Link inverter LinkDB URL filters & normalizers, parsing/indexing filters, scoring plugins 26

Page importance - scoring Injector Generator Fetcher Searcher Nutch – Berlin Buzzwords '10 Indexer Updater CrawlDB Shards (segments) Parser Link inverter LinkDB URL filters & normalizers, parsing/indexing filters, scoring plugins 27

Indexing Injector Generator Fetcher Searcher Nutch – Berlin Buzzwords '10 Indexer Updater CrawlDB Shards (segments) Parser Link inverter LinkDB URL filters & normalizers, parsing/indexing filters, scoring plugins 28

Nutch as a Web mining platform Nutch Berlin Buzzwords '10 the - PowerPoint PPT Presentation

Apache Nutch as a Web mining platform Nutch Berlin Buzzwords '10 the present and the future Andrzej Biaecki ab@sigram.com Intro Started using Lucene in 2003 (1.2-dev?) Created Luke the Lucene Index Toolbox Nutch, Lucene

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Securing the Web Platform Securing the Web Platform Collin Jackson Stanford University The Web

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Text Categorization P2P Security Datamining Semantic Web Case Studies: Nutch, Google,

Introduction to Web Mining What is Web Mining? Discovering useful information from the

Web Crawling with Apache Nutch Sebastian Nagel ApacheCon EU 2014 2014-11-18 snagel@apache.org

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

The original vision of Nutch, 14 years later: Building an open source search engine Apache Big

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Greedy Algorithms 1 The main idea of greedy algorithm is look some optimal solution locally

The 2D shape structure dataset: A user annotated open access database A. Carlier, G. Morin K.

How 2 Apply 4 Stuff* (Strategic thinking and tips for academic job applications and career

iLab2: WWW Security Johannes Naab <2019-03-15 Fri> Contents 1 WWW Basics 1 2

Start a count for an itemset S B if every proper subset of S had a count prior to arrival

SMART CITIES Conference What is Bitcoin and how does it work? Matej Petkovi Abelium

Learning to Inflate Tom Rudelius IAS Based on 1810.05159/hep-th Outline Machine Learning

On(x) ~Flat(x) START FINISH ~Flat(Spare) Intact(Spare) Off(Spare) On(Tire1) Flat(Tire1)

Nutch as a Web mining platform Nutch Berlin Buzzwords '10 the - PowerPoint PPT Presentation

Apache Nutch as a Web mining platform Nutch Berlin Buzzwords '10 the present and the future Andrzej Biaecki ab@sigram.com Intro Started using Lucene in 2003 (1.2-dev?) Created Luke the Lucene Index Toolbox Nutch, Lucene

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Securing the Web Platform Securing the Web Platform Collin Jackson Stanford University The Web

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Text Categorization P2P Security Datamining Semantic Web Case Studies: Nutch, Google,

Introduction to Web Mining What is Web Mining? Discovering useful information from the

Web Crawling with Apache Nutch Sebastian Nagel ApacheCon EU 2014 2014-11-18 snagel@apache.org

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

The original vision of Nutch, 14 years later: Building an open source search engine Apache Big

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Greedy Algorithms 1 The main idea of greedy algorithm is look some optimal solution locally

The 2D shape structure dataset: A user annotated open access database A. Carlier, G. Morin K.

How 2 Apply 4 Stuff* (Strategic thinking and tips for academic job applications and career

iLab2: WWW Security Johannes Naab &lt;2019-03-15 Fri&gt; Contents 1 WWW Basics 1 2

Start a count for an itemset S B if every proper subset of S had a count prior to arrival

SMART CITIES Conference What is Bitcoin and how does it work? Matej Petkovi Abelium

Learning to Inflate Tom Rudelius IAS Based on 1810.05159/hep-th Outline Machine Learning

On(x) ~Flat(x) START FINISH ~Flat(Spare) Intact(Spare) Off(Spare) On(Tire1) Flat(Tire1)

iLab2: WWW Security Johannes Naab <2019-03-15 Fri> Contents 1 WWW Basics 1 2