building real time analytics applications using
play

Building real-time analytics applications using A LinkedIn case - PowerPoint PPT Presentation

Building real-time analytics applications using A LinkedIn case study Member Job Ad Post Company Course LinkedIn Activity Data Model Member Job Ad Post Company Course Actor Verb Object Activity Data Scale Tens of million 3+


  1. Building real-time analytics applications using A LinkedIn case study

  2. Member Job Ad Post Company Course

  3. LinkedIn Activity Data Model Member Job Ad Post Company Course Actor Verb Object

  4. Activity Data Scale Tens of million 3+ million 30 million 610+ posts jobs posted companies million liked/shared per month users per day Trillions of events/day

  5. Create Generate LifeCycle Analyze What can we do with all the activity data?

  6. Pinot @ LinkedIn

  7. Pinot @ LinkedIn

  8. Who Am I ESPRESSO ThirdEye

  9. Use case 1: Article Analytics

  10. Option 1: Join on the Fly Activity Table Member Action ArticleId Time Id SELECT M.industry, count(*) FROM Activity as A INNER join Member as M ON A.memberId = M.memberId Activity View Join WHERE A.articleId=<111> Stream Like GROUP BY M.industry Shares App Comment Member Industry Geo Skills Company Id Member Table REALTIME ( Depending on storage) ● High Latency ●

  11. Option 2: Pre Join + Pre Aggregate Article Id Industry ... Action Time Count Stream App Processing Framework Pre Join + Activity Pre Agg View Stream Look Up Like SELECT industry, sum(count) Shares FROM PreJoined_Activity_Member WHERE A.articleId=<111> Comment GROUP BY M.industry Member Industry Geo Skills Company Id Member Table Near real-time ingestion ● Low latency (unpredictable*) ●

  12. Option 3: Pre Join + Pre Cube + Pre Agg Article Id Industry ... Action Time Count Stream App Processing Framework Pre Cube Activity View SELECT industry, sum(count) Stream Look Up Like FROM PreCubed_Activity_Member Shares WHERE A.articleId=<111> Comment AND company = ’*’ AND … = ‘*’ Member Industry Geo Skills Company GROUP BY M.industry Id Member Table ● Very fast (mostly lookup) Batch (Hourly/Daily) ● ● Extra storage (Curse of dimensionality) Re-bootstrap on schema changes ● ● Limited query capability

  13. Comparison Pinot, Kylin, Presto, Druid, KV Store BigQuery, ElasticSearch, Pinot RedShift InfluxDB Activity Table Pre Join PreAggregation PreCubed Member Latency Flexibility

  14. Publisher Analytics Architecture Samza View Like Article Activity + Member data Article Shares Activity Comment Espresso PINOT Member DB 2 year Article Analytics retention

  15. Can we use the activities data to improve the feed?

  16. Feed Relevance Rank the feed based on relevance ● Company 01 Identity ● Geography ● Skills ● Views, Likes, Comments 02 Content Age ● ● Category ● Prior interactions 03 Behaviour ● Interests ● Engagement

  17. Feed Ranking Architecture Activity + Member data + Article Data Samza View Like Article Shares Activity Comment Espresso PINOT Feed Ranker Member Table 30 day retention Article Table

  18. Feed Ranking Perf Numbers SELECT sum(count) from T WHERE memberId = <> AND article in (list of 1500 items) AND time >= (now - 14 days) GROUP BY action, item, position, time QPS p50 p90 p99 p99.9 6400 5ms 25ms 45ms 100ms Significant increase in engagement

  19. Site Facing use case: Pinot vs Druid Sorted Index ● Per query optimizer ● Optional indexing ●

  20. What Business Insights can we generate from this data?

  21. Posts Published: Breakdown By Country

  22. Distribution: By Industry

  23. Views: Breakdown by Referrer

  24. Slice and Dice UI 1000’s of Business ● Metrics Trillions of rows ●

  25. Dashboard Pipeline Architecture Activity Data UI HDFS (Raptor) UMP Espresso Member, Company, Metric Definition and Article Data Compute Logic

  26. Dashboard use case: Pinot vs Druid ● ~ 5000 random queries of the form ○ select sum(views), time from T where country = us, browser = chrome,… group by Date ● run sequentially one after the other Pinot Druid Total 11 minutes 24 minutes time p50 84 ms 136 ms p90 206 ms 667 ms

  27. Anomaly Detection Why don’t we monitor these metric and alert?

  28. ThirdEye: Anomaly Detection

  29. ThirdEye:Root Cause Analysis action_type Domain article_type Interactive author_type break down 18 #connections Dimensions sub-second industry Multiple …. queries location verb_type

  30. Anomaly Detection for d1 in [us, ca, … ] for d2 in [chrome, ie, … ] … SELECT sum(view), time SELECT sum(view), time FROM PostView FROM PostView WHERE country = d1 GROUP BY time AND browser = d2 AND ... GROUP BY time MULTI DIMENSIONAL TOP LEVEL

  31. Multi-dimensional anomaly detection challenges for d1 in [us, ca, … ] 1. Identifying issues requires monitoring all for d2 in [chrome, ie, … ] possible combinations … SELECT sum(view), time 2. No Id column (ArticleId, Member Id) FROM PostView WHERE 3. Latency is unpredictable even with country = d1 Inverted Index AND browser = d2 AND ... GROUP BY time select sum(view) scan 60-70% of Slow where the rows country=”us’ MULTI DIMENSIONAL country=”ireland’ scan <1% of the Fast rows

  32. Space-Time trade off Columnar Store No pre-computation KV Store (Pre-computed) Startree Index Latency Partial pre-computation Full Pre-Cube Storage

  33. Anomaly Detection: Druid vs Pinot Druid Pinot Pinot - (Star tree Index) (Inv Index)

  34. Anomaly Detection Architecture UI (Raptor) HDFS UMP Espresso ThirdEye Metric Definition and Compute Logic

  35. Pinot usage MarketPlace ✓ UberPool ✓ UberFreight ✓ 50TB Jump ✓ 1000 qps UberEATS ✓

  36. Conclusion Key Site Facing Value Applications store Dashboard: OLAP Business Store Analytics Activity Data Stream Anomaly Processing Detection Engine

  37. Questions Website http://pinot.apache.org Slack apache-pinot.slack.com Twitter Handle @apachepinot, @kishoreBytes

Recommend


More recommend