swimming in the data river
play

Swimming in the data river Or, when streaming analytics isnt Gian - PowerPoint PPT Presentation

Swimming in the data river Or, when streaming analytics isnt Gian Merlino gian@imply.io Who am I? Gian Merlino Committer & PMC member on Cofounder at 10 years working on scalable systems 2 Agenda From warehouses to rivers


  1. Swimming in the data river Or, when “streaming analytics” isn’t Gian Merlino gian@imply.io

  2. Who am I? Gian Merlino Committer & PMC member on Cofounder at 10 years working on scalable systems 2

  3. Agenda From warehouses to rivers ● What can we do with streaming data? ● Streaming analytics ● Enter the Druid ● Do try this at home! ● 3

  4. Rolling down the river 4

  5. Data warehouses Tightly coupled architecture with limited flexibility. Analytics Data Reporting ETL Data Data warehouse Data mining Data Data Sources Processing Store and Compute Querying 5

  6. Data lakes Modern data architectures are more application-centric. Data SQL Data lake Storage Data ML/AI Data ETL Data Sources TSDB Apps MapReduce, Spark 6 Confidential. Do not redistribute.

  7. Data rivers Streaming architectures are true-to-life and enable faster decision cycles. Stream Streaming Data analytics hub Storage Data Databases Data ETL Archive to Data Sources data lake Apps Stream processors 7 Confidential. Do not redistribute.

  8. Streaming data 8 Source: https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/

  9. Streaming data Analytical or transactional? How does data move between systems? Alerting, charting, observability, APM? ? ? ? ? ? Do search ? ? and NoSQL need apps? App or set of requirements? 9 Source: https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/ (plus question marks)

  10. Streaming data Stream App hub Stream hub library Direct production 10

  11. Streaming data Stream DB Something Transactional App hub connection DB-specific DB Change data capture 11

  12. Streaming data Stream Kafka Connect EDW or hub Kinesis Firehose data lake Streaming data pipeline 12

  13. Streaming data Service Stream Service #1 hub #2 Microservice communication 13

  14. Streaming data 14

  15. Streaming data Spark Streaming Flink ? Storm Stream Samza hub Kafka Streams Kinesis Data Analytics Azure Stream Analytics Google Cloud Dataflow 15

  16. Streaming data Alerting, Stream Stream processors hub API calls Real-time actions 16

  17. Streaming data Stream Stream processors Storage hub HDFS Cloud storage Data movement 17

  18. Streaming data Stream Stream processors Storage hub HDFS Stream Continuous query processors Cloud storage Data movement + enrichment 18

  19. Streaming data Stream Realtime Stream processors K / V stores hub dashboard HBase Stream processors Cassandra Redis Continuous query + write to serving layer 19

  20. Streaming data Direct state access Stream Realtime Stream processors K / V stores hub dashboard HBase Stream processors Cassandra Redis Continuous query + write to serving layer + unemitted state serving 20

  21. Streaming data Access unemitted Continuous query Direct state aggregation state, etc access Stream Realtime Stream processors K / V stores hub dashboard HBase Stream processors Cassandra Key lookups Redis Short range scans ‘Precomputation’ Continuous query + write to serving layer + unemitted state serving 21

  22. 22

  23. The problem 23

  24. The problem 24

  25. The problem Slice-and-dice for big data streams ● Interactive exploration ● Look under the hood of reports and dashboards ● And we want our data fresh, too ● 25

  26. Challenges Scale: when data is large, we need a lot of servers ● Speed: aiming for sub-second response time ● Complexity: too much fine grain to precompute ● High dimensionality: 10s or 100s of dimensions ● Concurrency: many users and tenants ● Freshness: load from streams ● 26

  27. high performance analytics data store for event-driven data 27

  28. What is Druid? “high performance”: low query latency, high ingest rates ● “analytics”: counting, ranking, groupBy, time trend ● “data store”: the cluster stores a copy of your data ● “event-driven data”: fact data like clickstream, network flows, ● user behavior, digital marketing, server metrics, IoT 28

  29. Streaming data Continuous load Aggregation + serving layers merged Stream Analytical App SQL hub Database (Like Apache Imply Pivot Stream Druid!) Apache Superset Enrichment processors Looker Your App Here™ 29

  30. Key features Column oriented ● High concurrency ● Scalable to 100s of servers, millions of messages/sec ● Continuous, real-time ingest ● Indexes on all dimensions by default ● Query through SQL ● Target query latency sub-second to a few seconds ● 30

  31. Use cases Clickstreams, user behavior ● Digital advertising ● Application performance management ● Network flows ● IoT ● 31

  32. Powered by Apache Druid + many more! Source: http://druid.io/druid-powered.html 32

  33. Powered by Apache Druid From Yahoo: “The performance is great ... some of the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second .” Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html 33

  34. Architecture Indexer Historical Files Indexer Historical Streams Indexer Historical Broker Broker Queries 34

  35. Why this works Computers are fast these days ● Indexes help save work and cost ● But don’t be afraid to scan tables — it can be done efficiently ● 35

  36. Integration patterns 36

  37. Deployment patterns Enrichment Event streams Stream hub (millions of events/sec) ● Modern data architecture ● Centered around stream hub 37

  38. Deployment patterns Enrichment (Spark, Hive) File dumps Data lake (hourly, daily) (Hadoop, S3) ● (Slightly less) modern data architecture ● Centered around data lake 38

  39. 39

  40. Download Apache Druid community site (new): https://druid.apache.org/ Apache Druid community site (legacy): http://druid.io/ Imply distribution: https://imply.io/get-started 40

  41. Contribute https://github.com/apache/druid 41

  42. Stay in touch Follow the Druid project on Twitter! @druidio Join the community! http://druid.apache.org/ 42

Recommend


More recommend