Rainbird: Real-time Analytics @Twitter Kevin Weil -- @kevinweil - PowerPoint PPT Presentation

Rainbird: Real-time Analytics @Twitter Kevin Weil -- @kevinweil Product Lead for Revenue, Twitter TM Thursday, February 3, 2011

Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source Thursday, February 3, 2011

My Background ‣ Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter : Hadoop, Pig, HBase, Cassandra, data viz, social graph analysis, soon to be PBs of data Thursday, February 3, 2011

My Background ‣ Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter : Hadoop, Pig, HBase, Cassandra, data viz, social graph analysis, soon to be PBs of data Now revenue products! Thursday, February 3, 2011

Why Real-time Analytics ‣ Twitter is real-time Thursday, February 3, 2011

Why Real-time Analytics ‣ Twitter is real-time ‣ ... even in space Thursday, February 3, 2011

And My Personal Favorite Thursday, February 3, 2011

Real-time Reporting ‣ Discussion around ad-based revenue model ‣ Help shape the conversation in real-time with Promoted Tweets Thursday, February 3, 2011

Real-time Reporting ‣ Discussion around ad-based revenue model ‣ Help shape the conversation in real-time with Promoted Tweets ‣ Realtime reporting ties it all together Thursday, February 3, 2011

Requirements ‣ Extremely high write volume Needs to scale to 100,000s of WPS ‣ Thursday, February 3, 2011

Requirements ‣ Extremely high write volume Needs to scale to 100,000s of WPS ‣ ‣ High read volume Needs to scale to 10,000s of RPS ‣ Thursday, February 3, 2011

Requirements ‣ Extremely high write volume Needs to scale to 100,000s of WPS ‣ ‣ High read volume Needs to scale to 10,000s of RPS ‣ ‣ Horizontally scalable (reads, storage, etc) Needs to scale to 100+ TB ‣ Thursday, February 3, 2011

Requirements ‣ Extremely high write volume Needs to scale to 100,000s of WPS ‣ ‣ High read volume Needs to scale to 10,000s of RPS ‣ ‣ Horizontally scalable (reads, storage, etc) Needs to scale to 100+ TB ‣ ‣ Low latency Most reads <100 ms (esp. recent data) ‣ Thursday, February 3, 2011

Cassandra ‣ Pro : In-house expertise ‣ Pro : Open source Apache project ‣ Pro : Writes are extremely fast ‣ Pro : Horizontally scalable, low latency ‣ Pro : Other startup adoption (Digg, SimpleGeo) Thursday, February 3, 2011

Cassandra ‣ Pro : In-house expertise ‣ Pro : Open source Apache project ‣ Pro : Writes are extremely fast ‣ Pro : Horizontally scalable, low latency ‣ Pro : Other startup adoption (Digg, SimpleGeo) ‣ Con : It was really young (0.3a) Thursday, February 3, 2011

Cassandra ‣ Pro : Some dudes at Digg had already started working on distributed atomic counters in Cassandra Thursday, February 3, 2011

Cassandra ‣ Pro : Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin Thursday, February 3, 2011

Cassandra ‣ Pro : Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x Thursday, February 3, 2011

Cassandra ‣ Pro : Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x ‣ A dude from Sweden began helping: @skr Thursday, February 3, 2011

Cassandra ‣ Pro : Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x ‣ A dude from Sweden began helping: @skr ‣ Now all at Twitter :) Thursday, February 3, 2011

Rainbird ‣ It counts things. Really quickly. ‣ Layers on top of the distributed counters patch, CASSANDRA-1072 Thursday, February 3, 2011

Rainbird ‣ It counts things. Really quickly. ‣ Layers on top of the distributed counters patch, CASSANDRA-1072 ‣ Relies on Zookeeper, Cassandra, Scribe, Thrift ‣ Written in Scala Thursday, February 3, 2011

Rainbird Design ‣ Aggregators buffer for 1m ‣ Intelligent flush to Cassandra ‣ Query servers read once written ‣ 1m is configurable Thursday, February 3, 2011

Rainbird Data Structures struct Event { 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts } Thursday, February 3, 2011

Rainbird Data Structures struct Event { Unix timestamp of event 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts } Thursday, February 3, 2011

Rainbird Data Structures struct Event { Stat category name 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts } Thursday, February 3, 2011

Rainbird Data Structures struct Event { Stat keys (hierarchical) 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts } Thursday, February 3, 2011

Rainbird Data Structures struct Event { Actual count (diff) 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts } Thursday, February 3, 2011

Rainbird Data Structures struct Event { More later 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts } Thursday, February 3, 2011

Hierarchical Aggregation ‣ Say we’re counting Promoted Tweet impressions category = pti ‣ keys = [advertiser_id, campaign_id, tweet_id] ‣ count = 1 ‣ ‣ Rainbird automatically increments the count for [advertiser_id, campaign_id, tweet_id] ‣ [advertiser_id, campaign_id] ‣ [advertiser_id] ‣ ‣ Means fast queries over each level of hierarchy ‣ Configurable in rainbird.conf, or dynamically via ZK Thursday, February 3, 2011

Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ ‣ Rainbird automatically increments the count for [com, amazon, music, full URL] ‣ [com, amazon, music] ‣ [com, amazon] ‣ [com] ‣ ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains! Thursday, February 3, 2011

Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ ‣ Rainbird automatically increments the count for [com, amazon, music, full URL] ‣ How many people tweeted [com, amazon, music] ‣ full URL? [com, amazon] ‣ [com] ‣ ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains! Thursday, February 3, 2011

Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ ‣ Rainbird automatically increments the count for [com, amazon, music, full URL] ‣ How many people tweeted [com, amazon, music] ‣ any music.amazon.com URL? [com, amazon] ‣ [com] ‣ ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains! Thursday, February 3, 2011

Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ ‣ Rainbird automatically increments the count for [com, amazon, music, full URL] ‣ How many people tweeted [com, amazon, music] ‣ any amazon.com URL? [com, amazon] ‣ [com] ‣ ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains! Thursday, February 3, 2011

Rainbird: Real-time Analytics @Twitter Kevin Weil -- @kevinweil - PowerPoint PPT Presentation

Rainbird: Real-time Analytics @Twitter Kevin Weil -- @kevinweil Product Lead for Revenue, Twitter TM Thursday, February 3, 2011 Agenda Why Real-time Analytics? Rainbird and Cassandra Production Uses at Twitter Open Source

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

The SEDNA Project Jenny Rainbird (BMT Group) This project has received funding from the European

Real-Time in the Real World: Building a State of the Art Real-Time Analytics Platform INFORMS

Real- Real -Time Systems Time Systems Real- -Time Systems Time Systems Real

Real Real- -Time Systems Time Systems Designing a real- Designing a real -time system time

Real- Real -time systems time systems Real- Real -time programming time programming

Real graduates, Real graduates, real transitions, real transitions, real stories: real

Building Real-Time Visualizations at Scale Mike Barry @msb5014 Kevin Robinson @krob Hello!

Real Real Real Time Real-Time Time Time Model Checking Model Model Checking Model

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Using Twitter for your CPD Janet Thomas November 2019 #PHYSIO19 Why twitter for CPD?

ML at Twitter: A Deep Dive into Twitters Timeline Cibele Montez Halasz, Twitter Cortex

//Dashboard //Twitter Panel //Twitter Panel Context and Actions Act based on the document

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Real Time Operating Systems Shirvaikar Chapter 4 REAL TIME SYSTEMS SHIRVAIKAR 1 Real Time

Laurie Adelhardt | ag@owlcreek.net | 410.705.3700 Susanne Zilberfarb | susanne@hammondmedia.com |

in in Astronomy and Astrophysics Ashraf Maleki Master Graduate at Scientometrics University of

Using T witte r for Advisor Pr ofe ssional De ve lopme nt Ama nda T ha u, Sr. Ac a de mic

Strategic Social Media: A Twitter Tutorial Imago Houston August 19, 2011 5 C's for Digital

500 Million tweets are posted by 325 Million active users each day Use a family, mission,

Basic Social Media: Facebook and Twitter Crafting Your Message for Social Media: I just signed up

Twitter summary of a presentation by public health physician Dr Elizabeth Haworth and others at a

Analysis of Valentine Twitter Data Kyle Witt, Veslava Ovendale, Arash Naderpour Introduction

Sambuz

Useful Links

Newsletter

Mail Us