The original vision of Nutch, 14 years later: Building an open - PowerPoint PPT Presentation

The original vision of Nutch, 14 years later: Building an open source search engine Apache Big Data Europe 2016 sylvain@sylvainzimmer.com @sylvinus

/usr/bin/whoami • Jamendo (Founder & CTO, 2004-2011) • TEDxParis (Co-founder, 2009-2012) • dotConferences (Founder, 2012-) • Pricing Assistant (Co-founder & CTO, 2012-)

"The original motivation for the Nutch project was to provide a transparent alternative to the growing power of a handful of private search services over most users’ view of the Web. However, as Nutch has been adopted with greater enthusiasm by smaller organizations, the Nutch Organization has de-emphasized operating a multi- billion-page index in the public interest." CommerceNet Labs Technical Report, Nov 2004

again?

transparency reproducibility

https://uidemo.commonsearch.org

https://explain.commonsearch.org/?q=python&g=en

Agenda • Values & tech choices • Search engine components • Challenges • Opportunities

Values & tech choices

Radical transparency • Open source (Apache License v2) • Open data • (Governance)

Privacy • Results can be tailored by language/country, but NOT by user/cookie/sessionid • \o/ Cache everything! • Tor service: http://comsearchl2zlnre.onion

Participation & Pragmatism • Use high-level languages as much as possible (Python, Go) • Embrace active communities (Spark, Elasticsearch) • Use mainstream participation platforms, even if they are nonfree (GitHub, Slack)

Search engines

Crawler Indexer Database Ranker Searcher The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) http://infolab.stanford.edu/~backrub/google.html

Crawler

http://commoncrawl.org

Today at 3:30pm !

http://scrapy.org

http://github.com/cocrawler/cocrawler

Indexer

Specs • HTML parsing & analysis • Tokenization / NLP • Static rankings • Language detection • I/O from crawls to databases

Common Search Pipeline Doc sources Data output Filter Document Output Common Crawl, Database, file, plugins parsing plugins WARC files, HDFS, S3, ... URLs ...

HTML parsers • BeautifulSoup & friends • lxml • html5lib • Gumbo!

https://github.com/google/gumbo-parser

Gumbocy • Use Cython instead of ctypes • Smaller API • Tree traversal on the Cython side with basic boilerplate/visibility support https://github.com/commonsearch/gumbocy

https://github.com/commonsearch/urlparse4

Database(s)

http://lucene.apache.org/

Ranker

Ranking formula rank = f( static_score , dynamic_score( query ) ) Alexa ElasticSearch & Lucene DMOZ TF-IDF Blacklists BM25 PageRank ... ...

https://about.commonsearch.org/developer/get-started

Today @ 4:30pm ;-)

Searcher / Frontend

Specs • Send user query to databases • Search-as-you-type • HTML & JSON endpoints • High performance

https://github.com/commonsearch/cosr-front

Crawler Parser Index Ranker Searcher The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) http://infolab.stanford.edu/~backrub/google.html

Challenges

Funding / Scale • Frugalism • Caching • In-kind services • Individual donations / Foundation grants • General economic incentives

Spam • Email spam • Wikipedia vandalism • Algorithm complexity & scale • Given enough eyeballs, all spam is shallow?

Relevance • Exhaustivity • Rescoring • Evaluation • More at 4:30pm ;-)

More search dimensions • Realtime search • Local search • Universal search

Semantic search • Wikidata • YAGO • Conversational / Voice search

Outreach • Easy onboarding & docs • Making people care believe

Opportunities

Decentralization • YaCy • Extremely high technical & social cost! • Transparency?

Research • More people should know how to build search engines • Spam, Relevance, Large-scale data processing • We need more open datasets!

https://about.commonsearch.org/blog/

Make the Web a better place! • SEO • Transparency • Influence of money • Public service

Questions? https://about.commonsearch.org/contributing https://github.com/commonsearch contact@commonsearch.org slack.commonsearch.org

The original vision of Nutch, 14 years later: Building an open - PowerPoint PPT Presentation

The original vision of Nutch, 14 years later: Building an open source search engine Apache Big Data Europe 2016 sylvain@sylvainzimmer.com @sylvinus /usr/bin/whoami Jamendo (Founder & CTO, 2004-2011) TEDxParis (Co-founder, 2009-2012)

Nutch as a Web mining platform Nutch Berlin Buzzwords '10 the present and the future Andrzej

Part 8 Planning Report Mayfair Building July 2016 Original Building Original Building

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

BUILDING ON LIBBITCOIN L I S B O N 2 0 1 8 Libbitcoin Developer (5 years) E R I C VO S K U

Text Categorization P2P Security Datamining Semantic Web Case Studies: Nutch, Google,

Photoshop Workshop By Nate Kong Original Cropped Original Filters Original B&W Original

P Patterns Of O Streaming Applications S A Monal Daxini 11/ 6 / 2018 @ monaldax Profile

Web Crawling with Apache Nutch Sebastian Nagel ApacheCon EU 2014 2014-11-18 snagel@apache.org

What do you mean? Inferring Word Meaning Using Computer Vision Shibamouli Lahiri Original paper

GHC Monthly Meeting March 2019 2.1 Vision 2.1 What do we look like in 10 years? What do we

Strategic Vision and Capital Plan, 2018 2020 (2020 2030 Horizon) Building a Foundation

D r o u g h 33.6 -45.5 33.6 -45.5 11 years 5.4 years 11 years 5.4 years years t

Transit Feasibility Study Summary and Recommendations June 2019 BUILDING THE VISION Vision

Building Together for the Next 50 Years Dear friends at National, The congregation has committed

ORIGINAL ARTICLE PATTERN OF PRESENTATION OF LUNG CANCER IN SUDAN WITHIN YEARS (2000-2006) By

Background Colin Taylor - Director 30+ years in Building Services industry - 10 years

Building on 35 Years of Progress The Next 10 Years of Photovoltaic Research at NREL Pioneers

Building Our Vision St. Andrews Vision and Mission Our Vision: Our Vision: The Tree of Life is

Facade Improvement Program Downtown Blueprint Vision In 20 years, Downtown Yarmouth will be

T EAT PLAY LIVE WORK LAUGH LOVE THRIVE OCTOBER 9, 2017 THE VISION The Alhambra A TRULY

TONIGHTS WORK SESSION Community Team Project History Community Outreach Vision Conceptual

Relaunching a Derbyshire Icon The Icon that is New Bath Hotel Original building dates back to

Vision Tracking Benjamin Newman 3pm, 28 April 2011 Cofrin Hall, rm 209 Original idea A robot

Climbing Logs Ropes Ropes Original plan (GreenWorks) includes: Grove of Giants Fairy

The original vision of Nutch, 14 years later: Building an open - PowerPoint PPT Presentation

The original vision of Nutch, 14 years later: Building an open source search engine Apache Big Data Europe 2016 sylvain@sylvainzimmer.com @sylvinus /usr/bin/whoami Jamendo (Founder & CTO, 2004-2011) TEDxParis (Co-founder, 2009-2012)

Nutch as a Web mining platform Nutch Berlin Buzzwords '10 the present and the future Andrzej

Part 8 Planning Report Mayfair Building July 2016 Original Building Original Building

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

BUILDING ON LIBBITCOIN L I S B O N 2 0 1 8 Libbitcoin Developer (5 years) E R I C VO S K U

Text Categorization P2P Security Datamining Semantic Web Case Studies: Nutch, Google,

Photoshop Workshop By Nate Kong Original Cropped Original Filters Original B&amp;W Original

P Patterns Of O Streaming Applications S A Monal Daxini 11/ 6 / 2018 @ monaldax Profile

Web Crawling with Apache Nutch Sebastian Nagel ApacheCon EU 2014 2014-11-18 snagel@apache.org

What do you mean? Inferring Word Meaning Using Computer Vision Shibamouli Lahiri Original paper

GHC Monthly Meeting March 2019 2.1 Vision 2.1 What do we look like in 10 years? What do we

Strategic Vision and Capital Plan, 2018 2020 (2020 2030 Horizon) Building a Foundation

D r o u g h 33.6 -45.5 33.6 -45.5 11 years 5.4 years 11 years 5.4 years years t

Transit Feasibility Study Summary and Recommendations June 2019 BUILDING THE VISION Vision

Building Together for the Next 50 Years Dear friends at National, The congregation has committed

ORIGINAL ARTICLE PATTERN OF PRESENTATION OF LUNG CANCER IN SUDAN WITHIN YEARS (2000-2006) By

Background Colin Taylor - Director 30+ years in Building Services industry - 10 years

Building on 35 Years of Progress The Next 10 Years of Photovoltaic Research at NREL Pioneers

Building Our Vision St. Andrews Vision and Mission Our Vision: Our Vision: The Tree of Life is

Facade Improvement Program Downtown Blueprint Vision In 20 years, Downtown Yarmouth will be

T EAT PLAY LIVE WORK LAUGH LOVE THRIVE OCTOBER 9, 2017 THE VISION The Alhambra A TRULY

TONIGHTS WORK SESSION Community Team Project History Community Outreach Vision Conceptual

Relaunching a Derbyshire Icon The Icon that is New Bath Hotel Original building dates back to

Vision Tracking Benjamin Newman 3pm, 28 April 2011 Cofrin Hall, rm 209 Original idea A robot

Climbing Logs Ropes Ropes Original plan (GreenWorks) includes: Grove of Giants Fairy

Photoshop Workshop By Nate Kong Original Cropped Original Filters Original B&W Original