Architecting Distributed Databases for Failure A Case Study with - PowerPoint PPT Presentation

Architecting Distributed Databases for Failure A Case Study with Druid Fangjin Yang Cofounder @ Imply

The Bad The Really Bad Overview The Catastrophic Best Practices: Operations

Everything is going to fail!

Requirements Scalable - Tens of thousands of nodes - Petabytes of raw data Available - 24 x 7 x 365 uptime Performant - Run as smoothly as possible when things go wrong

Druid Open source distributed data store Column oriented storage of event data Low latency OLAP queries & low latency data ingestion Initially designed to power a SaaS for online advertising (in AWS) Our real-world example case study

The Bad

Single Server Failures Common Occurs for every imaginable and unimaginable reason - Hardware malfunction, kernel panic, network outage, etc. - Minimal impact Standard solution: replication

Druid Segments Timestamp Dimensions Measures 2015-01-01T00 Segment_2015-01-01/2014-01-02 Timestamp Dimensions Measures 2015-01-01T01 2015-01-01T00 2015-01-01T01 Timestamp Dimensions Measures 2015-01-02T05 2015-01-02T05 Segment_2015-01-02/2014-01-03 2015-01-02T07 2015-01-02T07 2015-01-03T05 Timestamp Dimensions Measures 2015-01-03T07 2015-01-03T05 Segment_2015-01-03/2014-01-04 2015-01-03T07 Partition by time

Replication Example Druid Historicals Segment_1 Load Queries Druid Brokers Segment_2 Segment_2015-01-01/2015-01-02 (Segment_1) Client Segment_1 Segment_2015-01-01/2015-01-02 Segment_3 (Segment_2) Segment_2015-01-01/2015-01-02 (Segment_3) Segment_2 Segment_3

Query Segment_1 Druid Historicals Segment_1 Load Queries Druid Brokers Segment_2 Segment_2015-01-01/2015-01-02 (Segment_1) Client Segment_1 Segment_2015-01-01/2015-01-02 Segment_3 (Segment_2) Segment_2015-01-01/2015-01-02 (Segment_3) Segment_2 Segment_3

Multi-Server Failures Common: 1 server fails Less common: >1 server fails Data center issues (rack failure) Two strategies: - fast recovery - multi-datacenter replication

Fast Recovery Complete data availability in the face of multi-server failures is hard! Focus on fast recovery instead Be careful of the pitfalls of fast recovery More viable in the cloud

Fast Recovery Example Druid Historicals Deep Storage Segment_1 (S3/HDFS) Queries Druid Brokers Segment_2 Load Segment_2015-01-01/2015-01-02 (Segment_1) Load Client Segment_1 Segment_2015-01-01/2015-01-02 Segment_3 (Segment_2) Segment_2015-01-01/2015-01-02 (Segment_3) Segment_2 Segment_3

Fast Recovery Example Druid Historicals Deep Storage Segment_1 (S3/HDFS) Segment_2 Load Segment_1 Segment_3 Segment_2 Segment_3

Fast Recovery Example Druid Historicals Deep Storage Segment_1 (S3/HDFS) Segment_2 Load

Fast Recovery Example Druid Historicals Deep Storage Segment_1 (S3/HDFS) Segment_2 Druid Coordinator Load Segment_1, Segment_3 Load Load Segment_2, Segment_3

Fast Recovery Example Druid Historicals Deep Storage Segment_1 (S3/HDFS) Segment_2 Druid Coordinator Load Segment_1 Segment_3 Segment_2 Segment_3

Dangers of Fast Recovery Easy to create bottlenecks - Prioritize how resources are spent during recovery - Druid prioritizes data availability and throttles replication Beware query hotspots - Intelligent load balancing during recovery is important

Fast Recovery Example Druid Historicals Deep Storage Segment_1 (S3/HDFS) Segment_2 Load Segment_1 Segment_3 Segment_2 Segment_3

Fast Recovery Example Deep Storage (S3/HDFS) Druid Historicals Load Segment_1 Overloaded! Segment_2 Segment_3

The Really Bad

Data Center Outage Very uncommon Power loss Can be extremely disruptive without proper planning Solution: Multi-datacenter replication Beware pitfalls of multi-datacenter replication

Multi-Datacenter Replication Druid Historicals Queries Druid Brokers Segment_1 Segment_2 Druid Coordinator Segment_3 Data Center 1 Data Center 2 Client Segment_1 Segment_3 Segment_2 Segment_3

Multi-Datacenter Pitfalls Coordination + leader election can be tricky Communication can require non-trivial network time Coordination usually done with heartbeats and quorum decisions Writes, failovers, & consistent reads require round trips

Multi-Datacenter Replication Data Center 1 Client Data Center 2

The Catastrophic

“Why are things slow today?” Poor performance is much worse than things completely failing Causes: - Heavy concurrent usage (multi-tenancy) - Hotspots & variability - Bad software update

Architecting for Multi-tenancy Small units of computation - No single query should starve out a cluster

Druid Multi-tenancy Druid Historical Segment_query_1 Segment_query_2 Queries Processing Segment_query_1 Order Segment_query_3 Segment_query_2 Segment_query_1 Segment_query_4

Architecting for Multi-tenancy Resource prioritization and isolation - Not all queries are equal - Not all users are equal

Druid Multi-tenancy Druid Historicals Tier 1: Older Queries Druid Brokers Data Dedicated for Older data Client Tier 2: Newer Data Dedicated for Newer Dat Tier 2: Newer Data

Hotspots Incredible variability in query performance among nodes Nodes may become slow but not fail Difficult to detect as there is nothing obviously wrong Solutions: - Hedged requests - Selective Replication - Latency Induced Probation

Hedged Requests Druid Historicals Segment_1 Druid Brokers Segment_2 Client Segment_1 Segment_3 Segment_2 Segment_3

Minimizing Variability Selective Replication Latency-induced probation Great paper: https://web.stanford.edu/class/cs240/readings/tail-at-scale.pdf

Bad Software Updates It is very difficult to simulate production traffic - Testing/staging clusters mostly verify correctness No noticeable failures for a long time Common cause of cascading failures

Rolling Upgrades Be able to update different components with no down time Backwards compatibility is extremely important Roll back if things are bad

Rolling Upgrades Druid Historicals Queries Druid Brokers V2 V1 Client V1 V1 V1

Best Practices: Operations

Monitoring Detection of when things go badly Define your critical metrics and acceptable values

Alerts Alert on critical errors - Out of disk space, out of cluster capacity, etc. Design alerts to reduce “noise” - Distinguish warnings and alerts

Exploratory Analytics Extremely critical to diagnosing root causes quickly Not many organizations do this

Takeaways Everything is going to fail! - Use replication for single server failures - Use fast recovery for multi-server failures (when you don’t want to set up another data center) - Use multi-datacenter replication when availability really matters - Alerting, monitoring, and exploratory analysis are critical

Thanks! @implydata @druidio @fangjin imply.io druid.io

Architecting Distributed Databases for Failure A Case Study with - PowerPoint PPT Presentation

Architecting Distributed Databases for Failure A Case Study with Druid Fangjin Yang Cofounder @ Imply The Bad The Really Bad Overview The Catastrophic Best Practices: Operations Everything is going to fail! Requirements Scalable -

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Architecting the Internet of Things Dieter Uckelmann Mark Harrison Florian Michahelles

Architecting Java solutions for CICS Architecting Java solutions for CICS Course introduction

Architecting a 30 PB all - Architecting a 30 PB all flash file system flash file system Kirill

Distributed Databases Distributed database management system A distributed database (DDB) is

Health Failure Telehealth Final Report Sarah Briggs Heart Failure Specialist Nurse Heart Failure

DISTRIBUTED DATABASES CHAPTER 25 LECTURE OVERVIEW What are distributed databases?

Module 3: Creating and Managing Databases Overview Creating Databases Creating

Architecting the Blockchain for Failure Conor Svensson @conors10 blk.io Founder web3j Author

Failure is a four-letter word Andreas Zeller Thomas Zimmermann Christian Bird PROMISE

CS377: Database Systems Distributed Databases Distributed Databases

GEMS/Food Databases and GEMS/Food Databases and GEMS/Food Databases and in the Food Supply

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Lecture 10: Larger-than-Memory Databases 1 / 53 Larger-than-Memory Databases Recap

A DVISOR was first developed in November 1994. Its predicts acceleration time to within 0.7% and

Atmospheric Phase Correction for ALMA Phase correction & for ALMA Introduction ALMA 183

EPVF: AN ENHANCED PROGRAM VULNERABILITY FACTOR METHODOLOGY FOR CROSS-LAYER RESILIENCE ANALYSIS Bo

Road to High Speed WLAN Xiaowen Wang Introduction 802.11n standardization process.

CONCUSSIONS IN THE NEWS When I was playing we CONCUSSION MANAGEMENT didnt know from

Practical Scrubbing Getting to the bad sector at the right time George Amvrosiadis Bianca

Changes at the end of Q4 2017 Economic Research Department Paris, December 13 th , 2017 Photo by

Data Protection Compliance for the Hospitality Sector Paul Byrne - Director Key findings of

Architecting Distributed Databases for Failure A Case Study with - PowerPoint PPT Presentation

Architecting Distributed Databases for Failure A Case Study with Druid Fangjin Yang Cofounder @ Imply The Bad The Really Bad Overview The Catastrophic Best Practices: Operations Everything is going to fail! Requirements Scalable -

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Architecting the Internet of Things Dieter Uckelmann Mark Harrison Florian Michahelles

Architecting Java solutions for CICS Architecting Java solutions for CICS Course introduction

Architecting a 30 PB all - Architecting a 30 PB all flash file system flash file system Kirill

Distributed Databases Distributed database management system A distributed database (DDB) is

Health Failure Telehealth Final Report Sarah Briggs Heart Failure Specialist Nurse Heart Failure

DISTRIBUTED DATABASES CHAPTER 25 LECTURE OVERVIEW What are distributed databases?

Module 3: Creating and Managing Databases Overview Creating Databases Creating

Architecting the Blockchain for Failure Conor Svensson @conors10 blk.io Founder web3j Author

Failure is a four-letter word Andreas Zeller Thomas Zimmermann Christian Bird PROMISE

CS377: Database Systems Distributed Databases Distributed Databases

GEMS/Food Databases and GEMS/Food Databases and GEMS/Food Databases and in the Food Supply

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Lecture 10: Larger-than-Memory Databases 1 / 53 Larger-than-Memory Databases Recap

A DVISOR was first developed in November 1994. Its predicts acceleration time to within 0.7% and

Atmospheric Phase Correction for ALMA Phase correction &amp; for ALMA Introduction ALMA 183

EPVF: AN ENHANCED PROGRAM VULNERABILITY FACTOR METHODOLOGY FOR CROSS-LAYER RESILIENCE ANALYSIS Bo

Road to High Speed WLAN Xiaowen Wang Introduction 802.11n standardization process.

CONCUSSIONS IN THE NEWS When I was playing we CONCUSSION MANAGEMENT didnt know from

Practical Scrubbing Getting to the bad sector at the right time George Amvrosiadis Bianca

Changes at the end of Q4 2017 Economic Research Department Paris, December 13 th , 2017 Photo by

Data Protection Compliance for the Hospitality Sector Paul Byrne - Director Key findings of

Atmospheric Phase Correction for ALMA Phase correction & for ALMA Introduction ALMA 183