Building real-time analytics applications using A LinkedIn case study
Member Job Ad Post Company Course
LinkedIn Activity Data Model Member Job Ad Post Company Course Actor Verb Object
Activity Data Scale Tens of million 3+ million 30 million 610+ posts jobs posted companies million liked/shared per month users per day Trillions of events/day
Create Generate LifeCycle Analyze What can we do with all the activity data?
Pinot @ LinkedIn
Pinot @ LinkedIn
Who Am I ESPRESSO ThirdEye
Use case 1: Article Analytics
Option 1: Join on the Fly Activity Table Member Action ArticleId Time Id SELECT M.industry, count(*) FROM Activity as A INNER join Member as M ON A.memberId = M.memberId Activity View Join WHERE A.articleId=<111> Stream Like GROUP BY M.industry Shares App Comment Member Industry Geo Skills Company Id Member Table REALTIME ( Depending on storage) ● High Latency ●
Option 2: Pre Join + Pre Aggregate Article Id Industry ... Action Time Count Stream App Processing Framework Pre Join + Activity Pre Agg View Stream Look Up Like SELECT industry, sum(count) Shares FROM PreJoined_Activity_Member WHERE A.articleId=<111> Comment GROUP BY M.industry Member Industry Geo Skills Company Id Member Table Near real-time ingestion ● Low latency (unpredictable*) ●
Option 3: Pre Join + Pre Cube + Pre Agg Article Id Industry ... Action Time Count Stream App Processing Framework Pre Cube Activity View SELECT industry, sum(count) Stream Look Up Like FROM PreCubed_Activity_Member Shares WHERE A.articleId=<111> Comment AND company = ’*’ AND … = ‘*’ Member Industry Geo Skills Company GROUP BY M.industry Id Member Table ● Very fast (mostly lookup) Batch (Hourly/Daily) ● ● Extra storage (Curse of dimensionality) Re-bootstrap on schema changes ● ● Limited query capability
Comparison Pinot, Kylin, Presto, Druid, KV Store BigQuery, ElasticSearch, Pinot RedShift InfluxDB Activity Table Pre Join PreAggregation PreCubed Member Latency Flexibility
Publisher Analytics Architecture Samza View Like Article Activity + Member data Article Shares Activity Comment Espresso PINOT Member DB 2 year Article Analytics retention
Can we use the activities data to improve the feed?
Feed Relevance Rank the feed based on relevance ● Company 01 Identity ● Geography ● Skills ● Views, Likes, Comments 02 Content Age ● ● Category ● Prior interactions 03 Behaviour ● Interests ● Engagement
Feed Ranking Architecture Activity + Member data + Article Data Samza View Like Article Shares Activity Comment Espresso PINOT Feed Ranker Member Table 30 day retention Article Table
Feed Ranking Perf Numbers SELECT sum(count) from T WHERE memberId = <> AND article in (list of 1500 items) AND time >= (now - 14 days) GROUP BY action, item, position, time QPS p50 p90 p99 p99.9 6400 5ms 25ms 45ms 100ms Significant increase in engagement
Site Facing use case: Pinot vs Druid Sorted Index ● Per query optimizer ● Optional indexing ●
What Business Insights can we generate from this data?
Posts Published: Breakdown By Country
Distribution: By Industry
Views: Breakdown by Referrer
Slice and Dice UI 1000’s of Business ● Metrics Trillions of rows ●
Dashboard Pipeline Architecture Activity Data UI HDFS (Raptor) UMP Espresso Member, Company, Metric Definition and Article Data Compute Logic
Dashboard use case: Pinot vs Druid ● ~ 5000 random queries of the form ○ select sum(views), time from T where country = us, browser = chrome,… group by Date ● run sequentially one after the other Pinot Druid Total 11 minutes 24 minutes time p50 84 ms 136 ms p90 206 ms 667 ms
Anomaly Detection Why don’t we monitor these metric and alert?
ThirdEye: Anomaly Detection
ThirdEye:Root Cause Analysis action_type Domain article_type Interactive author_type break down 18 #connections Dimensions sub-second industry Multiple …. queries location verb_type
Anomaly Detection for d1 in [us, ca, … ] for d2 in [chrome, ie, … ] … SELECT sum(view), time SELECT sum(view), time FROM PostView FROM PostView WHERE country = d1 GROUP BY time AND browser = d2 AND ... GROUP BY time MULTI DIMENSIONAL TOP LEVEL
Multi-dimensional anomaly detection challenges for d1 in [us, ca, … ] 1. Identifying issues requires monitoring all for d2 in [chrome, ie, … ] possible combinations … SELECT sum(view), time 2. No Id column (ArticleId, Member Id) FROM PostView WHERE 3. Latency is unpredictable even with country = d1 Inverted Index AND browser = d2 AND ... GROUP BY time select sum(view) scan 60-70% of Slow where the rows country=”us’ MULTI DIMENSIONAL country=”ireland’ scan <1% of the Fast rows
Space-Time trade off Columnar Store No pre-computation KV Store (Pre-computed) Startree Index Latency Partial pre-computation Full Pre-Cube Storage
Anomaly Detection: Druid vs Pinot Druid Pinot Pinot - (Star tree Index) (Inv Index)
Anomaly Detection Architecture UI (Raptor) HDFS UMP Espresso ThirdEye Metric Definition and Compute Logic
Pinot usage MarketPlace ✓ UberPool ✓ UberFreight ✓ 50TB Jump ✓ 1000 qps UberEATS ✓
Conclusion Key Site Facing Value Applications store Dashboard: OLAP Business Store Analytics Activity Data Stream Anomaly Processing Detection Engine
Questions Website http://pinot.apache.org Slack apache-pinot.slack.com Twitter Handle @apachepinot, @kishoreBytes
Recommend
More recommend