Swimming in the data river Or, when streaming analytics isnt Gian - PowerPoint PPT Presentation

Swimming in the data river Or, when “streaming analytics” isn’t Gian Merlino gian@imply.io

Who am I? Gian Merlino Committer & PMC member on Cofounder at 10 years working on scalable systems 2

Agenda From warehouses to rivers ● What can we do with streaming data? ● Streaming analytics ● Enter the Druid ● Do try this at home! ● 3

Rolling down the river 4

Data warehouses Tightly coupled architecture with limited flexibility. Analytics Data Reporting ETL Data Data warehouse Data mining Data Data Sources Processing Store and Compute Querying 5

Data lakes Modern data architectures are more application-centric. Data SQL Data lake Storage Data ML/AI Data ETL Data Sources TSDB Apps MapReduce, Spark 6 Confidential. Do not redistribute.

Data rivers Streaming architectures are true-to-life and enable faster decision cycles. Stream Streaming Data analytics hub Storage Data Databases Data ETL Archive to Data Sources data lake Apps Stream processors 7 Confidential. Do not redistribute.

Streaming data 8 Source: https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/

Streaming data Analytical or transactional? How does data move between systems? Alerting, charting, observability, APM? ? ? ? ? ? Do search ? ? and NoSQL need apps? App or set of requirements? 9 Source: https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/ (plus question marks)

Streaming data Stream App hub Stream hub library Direct production 10

Streaming data Stream DB Something Transactional App hub connection DB-specific DB Change data capture 11

Streaming data Stream Kafka Connect EDW or hub Kinesis Firehose data lake Streaming data pipeline 12

Streaming data Service Stream Service #1 hub #2 Microservice communication 13

Streaming data 14

Streaming data Spark Streaming Flink ? Storm Stream Samza hub Kafka Streams Kinesis Data Analytics Azure Stream Analytics Google Cloud Dataflow 15

Streaming data Alerting, Stream Stream processors hub API calls Real-time actions 16

Streaming data Stream Stream processors Storage hub HDFS Cloud storage Data movement 17

Streaming data Stream Stream processors Storage hub HDFS Stream Continuous query processors Cloud storage Data movement + enrichment 18

Streaming data Stream Realtime Stream processors K / V stores hub dashboard HBase Stream processors Cassandra Redis Continuous query + write to serving layer 19

Streaming data Direct state access Stream Realtime Stream processors K / V stores hub dashboard HBase Stream processors Cassandra Redis Continuous query + write to serving layer + unemitted state serving 20

Streaming data Access unemitted Continuous query Direct state aggregation state, etc access Stream Realtime Stream processors K / V stores hub dashboard HBase Stream processors Cassandra Key lookups Redis Short range scans ‘Precomputation’ Continuous query + write to serving layer + unemitted state serving 21

The problem 23

The problem 24

The problem Slice-and-dice for big data streams ● Interactive exploration ● Look under the hood of reports and dashboards ● And we want our data fresh, too ● 25

Challenges Scale: when data is large, we need a lot of servers ● Speed: aiming for sub-second response time ● Complexity: too much fine grain to precompute ● High dimensionality: 10s or 100s of dimensions ● Concurrency: many users and tenants ● Freshness: load from streams ● 26

high performance analytics data store for event-driven data 27

What is Druid? “high performance”: low query latency, high ingest rates ● “analytics”: counting, ranking, groupBy, time trend ● “data store”: the cluster stores a copy of your data ● “event-driven data”: fact data like clickstream, network flows, ● user behavior, digital marketing, server metrics, IoT 28

Streaming data Continuous load Aggregation + serving layers merged Stream Analytical App SQL hub Database (Like Apache Imply Pivot Stream Druid!) Apache Superset Enrichment processors Looker Your App Here™ 29

Key features Column oriented ● High concurrency ● Scalable to 100s of servers, millions of messages/sec ● Continuous, real-time ingest ● Indexes on all dimensions by default ● Query through SQL ● Target query latency sub-second to a few seconds ● 30

Use cases Clickstreams, user behavior ● Digital advertising ● Application performance management ● Network flows ● IoT ● 31

Powered by Apache Druid + many more! Source: http://druid.io/druid-powered.html 32

Powered by Apache Druid From Yahoo: “The performance is great ... some of the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second .” Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html 33

Architecture Indexer Historical Files Indexer Historical Streams Indexer Historical Broker Broker Queries 34

Why this works Computers are fast these days ● Indexes help save work and cost ● But don’t be afraid to scan tables — it can be done efficiently ● 35

Integration patterns 36

Deployment patterns Enrichment Event streams Stream hub (millions of events/sec) ● Modern data architecture ● Centered around stream hub 37

Deployment patterns Enrichment (Spark, Hive) File dumps Data lake (hourly, daily) (Hadoop, S3) ● (Slightly less) modern data architecture ● Centered around data lake 38

Download Apache Druid community site (new): https://druid.apache.org/ Apache Druid community site (legacy): http://druid.io/ Imply distribution: https://imply.io/get-started 40

Contribute https://github.com/apache/druid 41

Stay in touch Follow the Druid project on Twitter! @druidio Join the community! http://druid.apache.org/ 42

Swimming in the data river Or, when streaming analytics isnt Gian - PowerPoint PPT Presentation

Swimming in the data river Or, when streaming analytics isnt Gian Merlino gian@imply.io Who am I? Gian Merlino Committer & PMC member on Cofounder at 10 years working on scalable systems 2 Agenda From warehouses to rivers

ALLEGHENY MOUNTAIN SWIMMING in Partnership with USA Swimming House Of Delegates May 20, 2015

School Swimming and Water Safety Sue Barlow Programme Manager School Swimming School Swimming

SPPA Conference 2014 Start Swimming before you Start School Fiona Paterson Scottish Swimming

YMCA of Northern York YMCA of Northern York YMCA of Northern York Swimming Swimming Swimming

Water Safety Sue Barlow Programme Manager School Swimming School Swimming and Water Safety Aims

CISM SPORT COMMITTEE Swimming & Lifesaving CISM Sport Committee Swimming & Lifesaving

PADE-Supersharkz Swimming Club 2 nd Selangor Novice Swimming Meet 2017 General Guide for Swimmers

Root River Fisheries Root River Fisheries Craig Helker Craig Helker WDNR WDNR Root River

SUNNYNOOK RIVER PARK SUNNYNOOK RIVER PARK SUNNYNOOK RIVER PARK Community Presentation - - July

Salmon management and Sport fishing a Swedish perspective - 2012 Torne river Kalix river Rne

FLIP OR SWIM? Ho How Gymnas w Gymnastics and Swimming Im tics and Swimming Impact Health and

PVC Swimming Pool Lin iners 1 PVC Swimming Pool Lin iners In ground pre-tailored pool liners A

Sport Premium Conference In partnership with: School Swimming and Water Safety Sue Barlow

Open Water Swimming Speaker: Dave Candler, STA President Qualifications STA Level 1 Award for

Sc ottish Swimming AGM 2014 WE L COME day 22 nd F Satur e br uar y 2014 T he Pathfoot

Colorado River Basin Characterizing the Colorado River Basin 3 rd largest river basin in Texas

1 SOCIAL SURVEY RESEARCH: AN OVERVIEW OF SURVEY RESEARCH PRINCIPLES FOR CONSUMERS AND DECISION

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Introduction to model-based

Webinar: Monitoring & Evaluation Methods for Trainings 18th June 2020 Please complete the

Management of CAPI and CATI Management of CAPI and CATI at the Labor Force Survey at the Labor

Hardware-Software Codesign 9. Worst Case Execution Time Analysis Lothar Thiele Swiss Federal

Not everything that counts can be counted, and not everything that can be counted counts.

Second Quarter 2015 Results Presentation to Investors July 23, 2015 Disclaimer Cautionary

HPCG: ONE YEAR LATER Jack Dongarra & Piotr Luszczek University of Tennessee/ORNL Michael

Swimming in the data river Or, when streaming analytics isnt Gian - PowerPoint PPT Presentation

Swimming in the data river Or, when streaming analytics isnt Gian Merlino gian@imply.io Who am I? Gian Merlino Committer & PMC member on Cofounder at 10 years working on scalable systems 2 Agenda From warehouses to rivers

ALLEGHENY MOUNTAIN SWIMMING in Partnership with USA Swimming House Of Delegates May 20, 2015

School Swimming and Water Safety Sue Barlow Programme Manager School Swimming School Swimming

SPPA Conference 2014 Start Swimming before you Start School Fiona Paterson Scottish Swimming

YMCA of Northern York YMCA of Northern York YMCA of Northern York Swimming Swimming Swimming

Water Safety Sue Barlow Programme Manager School Swimming School Swimming and Water Safety Aims

CISM SPORT COMMITTEE Swimming &amp; Lifesaving CISM Sport Committee Swimming &amp; Lifesaving

PADE-Supersharkz Swimming Club 2 nd Selangor Novice Swimming Meet 2017 General Guide for Swimmers

Root River Fisheries Root River Fisheries Craig Helker Craig Helker WDNR WDNR Root River

SUNNYNOOK RIVER PARK SUNNYNOOK RIVER PARK SUNNYNOOK RIVER PARK Community Presentation - - July

Salmon management and Sport fishing a Swedish perspective - 2012 Torne river Kalix river Rne

FLIP OR SWIM? Ho How Gymnas w Gymnastics and Swimming Im tics and Swimming Impact Health and

PVC Swimming Pool Lin iners 1 PVC Swimming Pool Lin iners In ground pre-tailored pool liners A

Sport Premium Conference In partnership with: School Swimming and Water Safety Sue Barlow

Open Water Swimming Speaker: Dave Candler, STA President Qualifications STA Level 1 Award for

Sc ottish Swimming AGM 2014 WE L COME day 22 nd F Satur e br uar y 2014 T he Pathfoot

Colorado River Basin Characterizing the Colorado River Basin 3 rd largest river basin in Texas

1 SOCIAL SURVEY RESEARCH: AN OVERVIEW OF SURVEY RESEARCH PRINCIPLES FOR CONSUMERS AND DECISION

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Introduction to model-based

Webinar: Monitoring &amp; Evaluation Methods for Trainings 18th June 2020 Please complete the

Management of CAPI and CATI Management of CAPI and CATI at the Labor Force Survey at the Labor

Hardware-Software Codesign 9. Worst Case Execution Time Analysis Lothar Thiele Swiss Federal

Not everything that counts can be counted, and not everything that can be counted counts.

Second Quarter 2015 Results Presentation to Investors July 23, 2015 Disclaimer Cautionary

HPCG: ONE YEAR LATER Jack Dongarra &amp; Piotr Luszczek University of Tennessee/ORNL Michael

CISM SPORT COMMITTEE Swimming & Lifesaving CISM Sport Committee Swimming & Lifesaving

Webinar: Monitoring & Evaluation Methods for Trainings 18th June 2020 Please complete the

HPCG: ONE YEAR LATER Jack Dongarra & Piotr Luszczek University of Tennessee/ORNL Michael