Building real-time analytics applications using A LinkedIn case - PowerPoint PPT Presentation

Building real-time analytics applications using A LinkedIn case study

Member Job Ad Post Company Course

LinkedIn Activity Data Model Member Job Ad Post Company Course Actor Verb Object

Activity Data Scale Tens of million 3+ million 30 million 610+ posts jobs posted companies million liked/shared per month users per day Trillions of events/day

Create Generate LifeCycle Analyze What can we do with all the activity data?

Pinot @ LinkedIn

Who Am I ESPRESSO ThirdEye

Use case 1: Article Analytics

Option 1: Join on the Fly Activity Table Member Action ArticleId Time Id SELECT M.industry, count(*) FROM Activity as A INNER join Member as M ON A.memberId = M.memberId Activity View Join WHERE A.articleId=<111> Stream Like GROUP BY M.industry Shares App Comment Member Industry Geo Skills Company Id Member Table REALTIME ( Depending on storage) ● High Latency ●

Option 2: Pre Join + Pre Aggregate Article Id Industry ... Action Time Count Stream App Processing Framework Pre Join + Activity Pre Agg View Stream Look Up Like SELECT industry, sum(count) Shares FROM PreJoined_Activity_Member WHERE A.articleId=<111> Comment GROUP BY M.industry Member Industry Geo Skills Company Id Member Table Near real-time ingestion ● Low latency (unpredictable*) ●

Option 3: Pre Join + Pre Cube + Pre Agg Article Id Industry ... Action Time Count Stream App Processing Framework Pre Cube Activity View SELECT industry, sum(count) Stream Look Up Like FROM PreCubed_Activity_Member Shares WHERE A.articleId=<111> Comment AND company = ’*’ AND … = ‘*’ Member Industry Geo Skills Company GROUP BY M.industry Id Member Table ● Very fast (mostly lookup) Batch (Hourly/Daily) ● ● Extra storage (Curse of dimensionality) Re-bootstrap on schema changes ● ● Limited query capability

Comparison Pinot, Kylin, Presto, Druid, KV Store BigQuery, ElasticSearch, Pinot RedShift InfluxDB Activity Table Pre Join PreAggregation PreCubed Member Latency Flexibility

Publisher Analytics Architecture Samza View Like Article Activity + Member data Article Shares Activity Comment Espresso PINOT Member DB 2 year Article Analytics retention

Can we use the activities data to improve the feed?

Feed Relevance Rank the feed based on relevance ● Company 01 Identity ● Geography ● Skills ● Views, Likes, Comments 02 Content Age ● ● Category ● Prior interactions 03 Behaviour ● Interests ● Engagement

Feed Ranking Architecture Activity + Member data + Article Data Samza View Like Article Shares Activity Comment Espresso PINOT Feed Ranker Member Table 30 day retention Article Table

Feed Ranking Perf Numbers SELECT sum(count) from T WHERE memberId = <> AND article in (list of 1500 items) AND time >= (now - 14 days) GROUP BY action, item, position, time QPS p50 p90 p99 p99.9 6400 5ms 25ms 45ms 100ms Significant increase in engagement

Site Facing use case: Pinot vs Druid Sorted Index ● Per query optimizer ● Optional indexing ●

What Business Insights can we generate from this data?

Posts Published: Breakdown By Country

Distribution: By Industry

Views: Breakdown by Referrer

Slice and Dice UI 1000’s of Business ● Metrics Trillions of rows ●

Dashboard Pipeline Architecture Activity Data UI HDFS (Raptor) UMP Espresso Member, Company, Metric Definition and Article Data Compute Logic

Dashboard use case: Pinot vs Druid ● ~ 5000 random queries of the form ○ select sum(views), time from T where country = us, browser = chrome,… group by Date ● run sequentially one after the other Pinot Druid Total 11 minutes 24 minutes time p50 84 ms 136 ms p90 206 ms 667 ms

Anomaly Detection Why don’t we monitor these metric and alert?

ThirdEye: Anomaly Detection

ThirdEye:Root Cause Analysis action_type Domain article_type Interactive author_type break down 18 #connections Dimensions sub-second industry Multiple …. queries location verb_type

Anomaly Detection for d1 in [us, ca, … ] for d2 in [chrome, ie, … ] … SELECT sum(view), time SELECT sum(view), time FROM PostView FROM PostView WHERE country = d1 GROUP BY time AND browser = d2 AND ... GROUP BY time MULTI DIMENSIONAL TOP LEVEL

Multi-dimensional anomaly detection challenges for d1 in [us, ca, … ] 1. Identifying issues requires monitoring all for d2 in [chrome, ie, … ] possible combinations … SELECT sum(view), time 2. No Id column (ArticleId, Member Id) FROM PostView WHERE 3. Latency is unpredictable even with country = d1 Inverted Index AND browser = d2 AND ... GROUP BY time select sum(view) scan 60-70% of Slow where the rows country=”us’ MULTI DIMENSIONAL country=”ireland’ scan <1% of the Fast rows

Space-Time trade off Columnar Store No pre-computation KV Store (Pre-computed) Startree Index Latency Partial pre-computation Full Pre-Cube Storage

Anomaly Detection: Druid vs Pinot Druid Pinot Pinot - (Star tree Index) (Inv Index)

Anomaly Detection Architecture UI (Raptor) HDFS UMP Espresso ThirdEye Metric Definition and Compute Logic

Pinot usage MarketPlace ✓ UberPool ✓ UberFreight ✓ 50TB Jump ✓ 1000 qps UberEATS ✓

Conclusion Key Site Facing Value Applications store Dashboard: OLAP Business Store Analytics Activity Data Stream Anomaly Processing Detection Engine

Questions Website http://pinot.apache.org Slack apache-pinot.slack.com Twitter Handle @apachepinot, @kishoreBytes

Building real-time analytics applications using A LinkedIn case - PowerPoint PPT Presentation

Building real-time analytics applications using A LinkedIn case study Member Job Ad Post Company Course LinkedIn Activity Data Model Member Job Ad Post Company Course Actor Verb Object Activity Data Scale Tens of million 3+

Real-Time in the Real World: Building a State of the Art Real-Time Analytics Platform INFORMS

Building Real-Time Visualizations at Scale Mike Barry @msb5014 Kevin Robinson @krob Hello!

New Real-time Applications PhD Peter Idestam-Almquist Starcounter AB New real time applications

NOSQL WITH CACHING, SEARCH AND REAL-TIME ANALYTICS James Gorlick Basho

Real-time stats for real-time problems Informing daily practice via predictive analytics

Track Description Level Session Link ABD Analytics & Big Data 201 Big Data Architectural

GO BEYOND DATA Real-time Analytics for Application Performance Management Yury Oleynik Data

Social and Real-time Web Applications using Meteor Developing Real-time Web Apps in JavaScript on

Agenda 1. Capital One 2. Traditional Batch Analytics 3. The Great Paradigm Shift Real-Time

Automating Operational Decisions in Real-time Chris Sanden Senior Analytics Engineer About Me.

Rainbird: Real-time Analytics @Twitter Kevin Weil -- @kevinweil Product Lead for Revenue,

Real-Time Java for Latency Critical Banking Applications Real-Time Bertrand Delsart System

RID Analytics Smart approach to real estate About us RID Analytics is one of the most

An Open-Source Streaming Machine Learning and Real-Time Analytics Architecture Using an IoT

RT-MVC Real Time Model/View/Controller Applications Daniel Erickson qConSF /

RAP Tight integration with the physical world Location aware Communication patterns:

Real-Time AI Systems INTRODUCTION KickView creates real-time AI systems. Intelligent

Real Time Data Analytics @ Uber Ankur Bansal November 14, 2016 About Me Sr. Software Engineer,

Applications - Real-time vs. near real-time Markus Peuhkuri 2003-10-23 Lecture Topics What

Query Optimization Time: The New Bottleneck in Real-time Data Analytics IMDM 2015 Rajkumar Sen,

Y2K $ 134 billion Traditional Analytics context account real-time rules do not take into

Real-Time Protocol (RTP) bag of tricks use UDP to avoid TCP congestion control (delays)

Automating Predictive Analytics www.xpanseanalytics.com Agenda Predictive Analytics vs

The Step2Smart Platform for Historical and Real -Time Transport Analytics and Traffic

Building real-time analytics applications using A LinkedIn case - PowerPoint PPT Presentation

Building real-time analytics applications using A LinkedIn case study Member Job Ad Post Company Course LinkedIn Activity Data Model Member Job Ad Post Company Course Actor Verb Object Activity Data Scale Tens of million 3+

Real-Time in the Real World: Building a State of the Art Real-Time Analytics Platform INFORMS

Building Real-Time Visualizations at Scale Mike Barry @msb5014 Kevin Robinson @krob Hello!

New Real-time Applications PhD Peter Idestam-Almquist Starcounter AB New real time applications

NOSQL WITH CACHING, SEARCH AND REAL-TIME ANALYTICS James Gorlick Basho

Real-time stats for real-time problems Informing daily practice via predictive analytics

Track Description Level Session Link ABD Analytics &amp; Big Data 201 Big Data Architectural

GO BEYOND DATA Real-time Analytics for Application Performance Management Yury Oleynik Data

Social and Real-time Web Applications using Meteor Developing Real-time Web Apps in JavaScript on

Agenda 1. Capital One 2. Traditional Batch Analytics 3. The Great Paradigm Shift Real-Time

Automating Operational Decisions in Real-time Chris Sanden Senior Analytics Engineer About Me.

Rainbird: Real-time Analytics @Twitter Kevin Weil -- @kevinweil Product Lead for Revenue,

Real-Time Java for Latency Critical Banking Applications Real-Time Bertrand Delsart System

RID Analytics Smart approach to real estate About us RID Analytics is one of the most

An Open-Source Streaming Machine Learning and Real-Time Analytics Architecture Using an IoT

RT-MVC Real Time Model/View/Controller Applications Daniel Erickson qConSF /

RAP Tight integration with the physical world Location aware Communication patterns:

Real-Time AI Systems INTRODUCTION KickView creates real-time AI systems. Intelligent

Real Time Data Analytics @ Uber Ankur Bansal November 14, 2016 About Me Sr. Software Engineer,

Applications - Real-time vs. near real-time Markus Peuhkuri 2003-10-23 Lecture Topics What

Query Optimization Time: The New Bottleneck in Real-time Data Analytics IMDM 2015 Rajkumar Sen,

Y2K $ 134 billion Traditional Analytics context account real-time rules do not take into

Real-Time Protocol (RTP) bag of tricks use UDP to avoid TCP congestion control (delays)

Automating Predictive Analytics www.xpanseanalytics.com Agenda Predictive Analytics vs

The Step2Smart Platform for Historical and Real -Time Transport Analytics and Traffic

Track Description Level Session Link ABD Analytics & Big Data 201 Big Data Architectural