Swimming in the data river Or, when “streaming analytics” isn’t Gian Merlino gian@imply.io
Who am I? Gian Merlino Committer & PMC member on Cofounder at 10 years working on scalable systems 2
Agenda From warehouses to rivers ● What can we do with streaming data? ● Streaming analytics ● Enter the Druid ● Do try this at home! ● 3
Rolling down the river 4
Data warehouses Tightly coupled architecture with limited flexibility. Analytics Data Reporting ETL Data Data warehouse Data mining Data Data Sources Processing Store and Compute Querying 5
Data lakes Modern data architectures are more application-centric. Data SQL Data lake Storage Data ML/AI Data ETL Data Sources TSDB Apps MapReduce, Spark 6 Confidential. Do not redistribute.
Data rivers Streaming architectures are true-to-life and enable faster decision cycles. Stream Streaming Data analytics hub Storage Data Databases Data ETL Archive to Data Sources data lake Apps Stream processors 7 Confidential. Do not redistribute.
Streaming data 8 Source: https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/
Streaming data Analytical or transactional? How does data move between systems? Alerting, charting, observability, APM? ? ? ? ? ? Do search ? ? and NoSQL need apps? App or set of requirements? 9 Source: https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/ (plus question marks)
Streaming data Stream App hub Stream hub library Direct production 10
Streaming data Stream DB Something Transactional App hub connection DB-specific DB Change data capture 11
Streaming data Stream Kafka Connect EDW or hub Kinesis Firehose data lake Streaming data pipeline 12
Streaming data Service Stream Service #1 hub #2 Microservice communication 13
Streaming data 14
Streaming data Spark Streaming Flink ? Storm Stream Samza hub Kafka Streams Kinesis Data Analytics Azure Stream Analytics Google Cloud Dataflow 15
Streaming data Alerting, Stream Stream processors hub API calls Real-time actions 16
Streaming data Stream Stream processors Storage hub HDFS Cloud storage Data movement 17
Streaming data Stream Stream processors Storage hub HDFS Stream Continuous query processors Cloud storage Data movement + enrichment 18
Streaming data Stream Realtime Stream processors K / V stores hub dashboard HBase Stream processors Cassandra Redis Continuous query + write to serving layer 19
Streaming data Direct state access Stream Realtime Stream processors K / V stores hub dashboard HBase Stream processors Cassandra Redis Continuous query + write to serving layer + unemitted state serving 20
Streaming data Access unemitted Continuous query Direct state aggregation state, etc access Stream Realtime Stream processors K / V stores hub dashboard HBase Stream processors Cassandra Key lookups Redis Short range scans ‘Precomputation’ Continuous query + write to serving layer + unemitted state serving 21
22
The problem 23
The problem 24
The problem Slice-and-dice for big data streams ● Interactive exploration ● Look under the hood of reports and dashboards ● And we want our data fresh, too ● 25
Challenges Scale: when data is large, we need a lot of servers ● Speed: aiming for sub-second response time ● Complexity: too much fine grain to precompute ● High dimensionality: 10s or 100s of dimensions ● Concurrency: many users and tenants ● Freshness: load from streams ● 26
high performance analytics data store for event-driven data 27
What is Druid? “high performance”: low query latency, high ingest rates ● “analytics”: counting, ranking, groupBy, time trend ● “data store”: the cluster stores a copy of your data ● “event-driven data”: fact data like clickstream, network flows, ● user behavior, digital marketing, server metrics, IoT 28
Streaming data Continuous load Aggregation + serving layers merged Stream Analytical App SQL hub Database (Like Apache Imply Pivot Stream Druid!) Apache Superset Enrichment processors Looker Your App Here™ 29
Key features Column oriented ● High concurrency ● Scalable to 100s of servers, millions of messages/sec ● Continuous, real-time ingest ● Indexes on all dimensions by default ● Query through SQL ● Target query latency sub-second to a few seconds ● 30
Use cases Clickstreams, user behavior ● Digital advertising ● Application performance management ● Network flows ● IoT ● 31
Powered by Apache Druid + many more! Source: http://druid.io/druid-powered.html 32
Powered by Apache Druid From Yahoo: “The performance is great ... some of the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second .” Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html 33
Architecture Indexer Historical Files Indexer Historical Streams Indexer Historical Broker Broker Queries 34
Why this works Computers are fast these days ● Indexes help save work and cost ● But don’t be afraid to scan tables — it can be done efficiently ● 35
Integration patterns 36
Deployment patterns Enrichment Event streams Stream hub (millions of events/sec) ● Modern data architecture ● Centered around stream hub 37
Deployment patterns Enrichment (Spark, Hive) File dumps Data lake (hourly, daily) (Hadoop, S3) ● (Slightly less) modern data architecture ● Centered around data lake 38
39
Download Apache Druid community site (new): https://druid.apache.org/ Apache Druid community site (legacy): http://druid.io/ Imply distribution: https://imply.io/get-started 40
Contribute https://github.com/apache/druid 41
Stay in touch Follow the Druid project on Twitter! @druidio Join the community! http://druid.apache.org/ 42
Recommend
More recommend