Spinn3r architecture and data Kevin Burton, Founder/CEO What is - PowerPoint PPT Presentation

Jun 21, 2023 •274 likes •454 views

Spinn3r architecture and data Kevin Burton, Founder/CEO What is Spinn3r? Licensed weblog, forum, and social media crawler Save $40k per month 300k posts per hour 21TB of content (1.2TB per month) 18 months of archives

Spinn3r architecture and data Kevin Burton, Founder/CEO
What is Spinn3r? • Licensed weblog, forum, and social media crawler – Save $40k per month • 300k posts per hour • 21TB of content (1.2TB per month) • 18 months of archives • 3B documents • +150Mb /s - 24/7
Theory of Operation • Index content as quickly as possible • Make compromises for latency and throughput • No spam • Discard no metadata
Hardware • 40 mid-range (scale diagonally) Intel servers • 22TB of raw storage ~60TB effective • 200GB of in-memory data • Three replicas • Fault tolerant database • Highly available
Live indexing • Receive pings from social media sites • Index content cyclically (30 minutes) for sites without pings • Traditional crawlers must make sacrifices (crawl rate) • Hybrid approach works well
Indexing Rates • ~2-5 M HTTP requests per hour • 2-4k HTTP requests per second – RSS – Permalink URLs – New source discovery – Spam detection (90% of the ping stream) – Ping handling
RSS and Atom • Rich metadata – Accurate title – Tags – Publication time – Huge waste of bandwidth
Language classification • Do not trust manually selected languages • N-gram model • Code page detection • In production for more than three years
Fighting Spam • Link analysis • Text analysis • Long tail content is the hardest
Spam Statistics • 30% of our time is spent fighting spam • 95% of pings are from spammers • Primarily stolen content • 10% malware – BAD when it happens
Smart Spammers • Don’t assume you can win • Spammers are getting smarter • Your elegant theory will be torn to shreds in practice – Pragmatism rules
Content Extraction – High ranking sites disable full content in RSS/Atom feeds • Increases ad revenue • Reduced bandwidth cost • Probability that you will have summary content is directly proportional to your rank – Full content is needed for search, sentiment analysis, link graph, etc.
Identify Full Content • Strip all redundant HTML • Only return content • Result should be well formed XHTML including <strong> <em> <a> elements
Ranking • Time based rank • Indegree • Multiple stable ranking vectors – Language – Category – Time
Comments • RSS/Atom feeds • Template parsing • Comment hosting
What’s next • More data for ICWSM in 2010 – Comments – Content extract – Full HTML – 4TB • Tighter duplicate content suppression • New ranking • Clustering

Recommend

Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space

Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space Building Religion Building Religion architecture is important to the study of history for several reasons:

1.29k views • 58 slides

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for Predicting Protein Secondary for Predicting Protein Secondary for Predicting Protein Secondary for Predicting Protein Secondary Structures

992 views • 65 slides

Introduction to Software Architecture Reid Holmes Architecture Architecture is: All

Introduction to Software Architecture Reid Holmes Architecture Architecture is: All about communication. What parts are there? How do the parts fit together? Architecture is not: About development. About

393 views • 18 slides

CMS Strip Readout Architecture for SLHC OUTLINE brief review of LHC strip readout architecture p

CMS Strip Readout Architecture for SLHC OUTLINE brief review of LHC strip readout architecture p proposed architecture for SLHC front end amplifier design in 130nm system architecture ideas system architecture ideas triggering possibilities

369 views • 24 slides

National Data Storage National Data Storage - g - architecture and mechanisms architecture and

National Data Storage National Data Storage - g - architecture and mechanisms architecture and mechanisms Micha Jankowski Maciej Brze niak PSNC A Agenda d Introduction Assumptions Assumptions Architecture Main

698 views • 32 slides

Institute for East Asian Architecture and Urbanism in Kyoto www.East-Asian-Architecture.org

Institute for East Asian Architecture and Urbanism in Kyoto www.East-Asian-Architecture.org Institute for East Asian Architecture and Urbanism in Kyoto www.East-Asian-Architecture.org Spaceship Earth as hosting entity (nature) is actually

1.53k views • 95 slides

Overview of Sofware Architecture Sofware Architecture VO (706.706) Roman Kern 2020-10-04

# These slides should give an overview of the sofware architecture, and what its role is. # Why and when is sofware architecture important. # In other words: sofware architecture in a nutshell ! Overview of Sofware Architecture Sofware

209 views • 7 slides

Betting on Software Architecture as Code a note on hypothesis-driven architecture James Lewis :

Betting on Software Architecture as Code a note on hypothesis-driven architecture James Lewis : @boicy Betting on Software Architecture as Code Software architecture is those decisions which are both important and hard to change Martin

1.6k views • 124 slides

Reference Architecture A Reference Architecture for Web Servers by Hassan, Holt SWAG

Reference Architecture A Reference Architecture for Web Servers by Hassan, Holt SWAG UoW Reference Architecture Definition A reference architecture for a domain defines the fundamental components of the domain and the

463 views • 24 slides

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type system sets the stage for the capabilities of the language Understanding data types empowers you as a data scientist DataCamp Data Types for Data

528 views • 23 slides

Defense Daily Open Architecture Summit 2014 Defense Daily Open Architecture Summit 2014 PEO IWS

Defense Daily Open Architecture Summit 2014 Defense Daily Open Architecture Summit 2014 PEO IWS Open Architecture Implementation PEO IWS Open Architecture Implementation Mr. Bill Bray, Executive Director Mr. Bill Bray, Executive Director

425 views • 16 slides

Wisznia | Architecture + Development Wisznia | Architecture + Development The Rebirth of a

Wisznia | Architecture + Development Wisznia | Architecture + Development The Rebirth of a Classic New Orleans Landmark Wisznia | Architecture + Development Wisznia | Architecture + Development The Maritime Building, 1918 Wisznia |

393 views • 24 slides

Four Layers to Build a Four Layers to Build a Trusted Architecture Trusted Architecture Danny

Trusted Architecture for Trusted Architecture for Securely Shared Services Securely Shared Services Four Layers to Build a Four Layers to Build a Trusted Architecture Trusted Architecture Danny De Cock K.U.Leuven ESAT/COSIC

369 views • 17 slides

Generic Architecture Architecture Generic to Securely Securely Manage Manage to

Trusted Architecture for Trusted Architecture for Securely Shared Services Securely Shared Services Generic Architecture Architecture Generic to Securely Securely Manage Manage to Employability, Healthcare & Employability, Healthcare

814 views • 16 slides

Clean Architecture Clean Architecture in Python in Python Sebastian Buczyski Sebastian

Clean Architecture Clean Architecture in Python in Python Sebastian Buczyski Sebastian Buczyski @ PyKonik Tech Talks #36 PyKonik Tech Talks #36 Clean Architecture Clean Architecture 1. Independence of frameworks 2. Testability 3.

552 views • 40 slides

From Requirements to Architecture Ana Moreira Software Architecture - Basics 1 Goals

From Requirements to Architecture Ana Moreira Software Architecture - Basics 1 Goals Understanding the importance of software architecture what is software architecture Discussing emerging issues in the transition from requirements to

525 views • 24 slides

Going Beyond Corr-LDA for Detecting Specific Comments on News & Blogs Trapit Bansal Joint

Going Beyond Corr-LDA for Detecting Specific Comments on News & Blogs Trapit Bansal Joint work with Mrinal Das & Prof. Chiranjib Bhattacharyya Machine Learning Lab Department of CSA, IISc 16th January, 2014 Understanding Comments

887 views • 62 slides

THE ART OF THE PITCH PERSUASION AND PRESENTATION SKILLS THAT WIN BUSINESS 1ST EDITION FREE

THE ART OF THE PITCH PERSUASION AND PRESENTATION SKILLS THAT WIN BUSINESS 1ST EDITION FREE Author: Peter Coughter ISBN: 9780230120518 Download Link: CLICK HERE Reading Free The Art Of The Pitch Persuasion And Presentation Skills That Win

428 views • 5 slides

Many Happy Returns Event Tuesday 19 TH September 2017 #DU21 Mobilising a Digital Champion

Many Happy Returns Event Tuesday 19 TH September 2017 #DU21 Mobilising a Digital Champion Movement Emma Weston OBE, Chief Executive, Digital Unite It all began in 1996 Remember these Digital skills challenge still as hard as it ever was

423 views • 25 slides

Place-Based Community Freecoast Festival V | September 8, 2018 Tim Brochu Host of

ANARCHITECTURE podcast The Power of Place-Based Community The Power of Place-Based Community Freecoast Festival V | September 8, 2018 Tim Brochu Host of ANARCHITECTURE podcast Principal and Manager of LLC Tim Brochu anarchitecture

914 views • 65 slides

Gradescope Introduction to features and use Introduction Gradescope was has been used by

Gradescope Introduction to features and use Introduction Gradescope was has been used by Jenny Laaser in smaller P-chem and polymer chemistry courses Gradescope was used for 4 sections of Chem 0110 (250 students/section) last semester

521 views • 29 slides

No Yes Do you use the Internet to do homework? Do you use the Internet to follow your

No Yes Do you use the Internet to do homework? Do you use the Internet to follow your favourite team? Do you use the internet to watch music videos? Do you use the Internet to read the news? Do you use the Internet to play games? Shared

997 views • 51 slides

U.S. First Mortgage Lending Platform Sample Financing Structure Equity 25% T TFCC

TRUST INSIGHT VISION T ERRA F IRMA C APITAL C ORPORATION May 2019 Private & Confidential 0 F ORWARD -L OOKING S TATEMENTS This presentation contains certain statements that may be forward - looking statements. All

365 views • 22 slides

For personal use only The CEO Sessions Finance News Network 15 November 2016 For personal use

For personal use only The CEO Sessions Finance News Network 15 November 2016 For personal use only Digital Disruption Entertainment Rooms / Human Real content Transport labour Estate + + + + Personal Sharing Flexible Finance

752 views • 11 slides

Spinn3r architecture and data Kevin Burton, Founder/CEO What is - PowerPoint PPT Presentation

Spinn3r architecture and data Kevin Burton, Founder/CEO What is Spinn3r? Licensed weblog, forum, and social media crawler Save $40k per month 300k posts per hour 21TB of content (1.2TB per month) 18 months of archives

Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Introduction to Software Architecture Reid Holmes Architecture Architecture is: All

CMS Strip Readout Architecture for SLHC OUTLINE brief review of LHC strip readout architecture p

National Data Storage National Data Storage - g - architecture and mechanisms architecture and

Institute for East Asian Architecture and Urbanism in Kyoto www.East-Asian-Architecture.org

Overview of Sofware Architecture Sofware Architecture VO (706.706) Roman Kern 2020-10-04

Betting on Software Architecture as Code a note on hypothesis-driven architecture James Lewis :

Reference Architecture A Reference Architecture for Web Servers by Hassan, Holt SWAG

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Defense Daily Open Architecture Summit 2014 Defense Daily Open Architecture Summit 2014 PEO IWS

Wisznia | Architecture + Development Wisznia | Architecture + Development The Rebirth of a

Four Layers to Build a Four Layers to Build a Trusted Architecture Trusted Architecture Danny

Generic Architecture Architecture Generic to Securely Securely Manage Manage to

Clean Architecture Clean Architecture in Python in Python Sebastian Buczyski Sebastian

From Requirements to Architecture Ana Moreira Software Architecture - Basics 1 Goals

Going Beyond Corr-LDA for Detecting Specific Comments on News & Blogs Trapit Bansal Joint

THE ART OF THE PITCH PERSUASION AND PRESENTATION SKILLS THAT WIN BUSINESS 1ST EDITION FREE

Many Happy Returns Event Tuesday 19 TH September 2017 #DU21 Mobilising a Digital Champion

Place-Based Community Freecoast Festival V | September 8, 2018 Tim Brochu Host of

Gradescope Introduction to features and use Introduction Gradescope was has been used by

No Yes Do you use the Internet to do homework? Do you use the Internet to follow your

U.S. First Mortgage Lending Platform Sample Financing Structure Equity 25% T TFCC

For personal use only The CEO Sessions Finance News Network 15 November 2016 For personal use

Sambuz

Useful Links

Newsletter

Mail Us

Spinn3r architecture and data Kevin Burton, Founder/CEO What is - PowerPoint PPT Presentation

Spinn3r architecture and data Kevin Burton, Founder/CEO What is Spinn3r? Licensed weblog, forum, and social media crawler Save $40k per month 300k posts per hour 21TB of content (1.2TB per month) 18 months of archives

Architecture: Culture and Space Architecture: Culture and Space Architecture: Culture and Space

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Introduction to Software Architecture Reid Holmes Architecture Architecture is: All

CMS Strip Readout Architecture for SLHC OUTLINE brief review of LHC strip readout architecture p

National Data Storage National Data Storage - g - architecture and mechanisms architecture and

Institute for East Asian Architecture and Urbanism in Kyoto www.East-Asian-Architecture.org

Overview of Sofware Architecture Sofware Architecture VO (706.706) Roman Kern 2020-10-04

Betting on Software Architecture as Code a note on hypothesis-driven architecture James Lewis :

Reference Architecture A Reference Architecture for Web Servers by Hassan, Holt SWAG

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Defense Daily Open Architecture Summit 2014 Defense Daily Open Architecture Summit 2014 PEO IWS

Wisznia | Architecture + Development Wisznia | Architecture + Development The Rebirth of a

Four Layers to Build a Four Layers to Build a Trusted Architecture Trusted Architecture Danny

Generic Architecture Architecture Generic to Securely Securely Manage Manage to

Clean Architecture Clean Architecture in Python in Python Sebastian Buczyski Sebastian

From Requirements to Architecture Ana Moreira Software Architecture - Basics 1 Goals

Going Beyond Corr-LDA for Detecting Specific Comments on News &amp; Blogs Trapit Bansal Joint

THE ART OF THE PITCH PERSUASION AND PRESENTATION SKILLS THAT WIN BUSINESS 1ST EDITION FREE

Many Happy Returns Event Tuesday 19 TH September 2017 #DU21 Mobilising a Digital Champion

Place-Based Community Freecoast Festival V | September 8, 2018 Tim Brochu Host of

Gradescope Introduction to features and use Introduction Gradescope was has been used by

No Yes Do you use the Internet to do homework? Do you use the Internet to follow your

U.S. First Mortgage Lending Platform Sample Financing Structure Equity 25% T TFCC

For personal use only The CEO Sessions Finance News Network 15 November 2016 For personal use

Sambuz

Useful Links

Newsletter

Mail Us

Going Beyond Corr-LDA for Detecting Specific Comments on News & Blogs Trapit Bansal Joint