Operating Multi-Tenant Kafka Services for Developers Data Council SF 2019 Ali Hamidi - Heroku Data
Agenda • Intro • Motivation • Single Tenant Dedicated • Multi-tenancy • Configuration & Tuning • Testing • Automation • Limitations Data Council SF 2019 - Heroku Data 2
Intro I am… Ali Hamidi, an engineer on the Heroku Data team at Salesforce. Heroku is... a cloud platform that lets companies build, deliver, monitor and scale apps. Heroku Data is… the team that provides secure, scalable data services on the Heroku Platform. Data Council SF 2019 - Heroku Data 3
Apache Kafka • Distributed Streaming Platform Data Council SF 2019 - Heroku Data 4
Apache Kafka • Distributed Streaming Platform • Publish/Subscribe (=> Produce/Consume) Data Council SF 2019 - Heroku Data 5
Apache Kafka • Distributed Streaming Platform • Publish/Subscribe (=> Produce/Consume) • Durable message store (commit log) Data Council SF 2019 - Heroku Data 6
Apache Kafka • Distributed Streaming Platform • Publish/Subscribe (=> Produce/Consume) • Durable message store (commit log) • Highly available Data Council SF 2019 - Heroku Data 7
Apache Kafka on Heroku • Fully Managed Service Data Council SF 2019 - Heroku Data 8
Apache Kafka on Heroku • Fully Managed Service • Opinionated Data Council SF 2019 - Heroku Data 9
Apache Kafka on Heroku • Fully Managed Service • Opinionated • Configured for best practices for most users* 10 Data Council SF 2019 - Heroku Data 
Use Cases • Decompose a monolithic app 11 Data Council SF 2019 - Heroku Data 
Use Cases • Decompose a monolithic app • Process high volume, real-time data streams 12 Data Council SF 2019 - Heroku Data 
Use Cases • Decompose a monolithic app • Process high volume, real-time data streams • Power a real-time, event-driven architecture 13 Data Council SF 2019 - Heroku Data 
SHIFT Commerce Decompose a monolithic app 14 Data Council SF 2019 - Heroku Data 
Quoine • QUOINE is a leading global fintech company that provides trading, exchange, and next generation financial services powered by blockchain technology • Consume real-time cryptocurrency pricing data from individual markets and exchanges 15 Data Council SF 2019 - Heroku Data 
Caesars Entertainment • Ingest, aggregate, and process customer data in real-time to provide the best customer experience • Real-time, event-driven architecture 16 Data Council SF 2019 - Heroku Data 
The Motivation 17 Data Council SF 2019 - Heroku Data 
Why Multi-tenant Kafka? • More accessible • Additional use cases • Development • Testing • Low volume production 18 Data Council SF 2019 - Heroku Data 
19 Data Council SF 2019 - Heroku Data 
Single Tenant Dedicated 20 Data Council SF 2019 - Heroku Data 
21 Data Council SF 2019 - Heroku Data 
22 Data Council SF 2019 - Heroku Data 
Multi-tenancy 23 Data Council SF 2019 - Heroku Data 
24 Data Council SF 2019 - Heroku Data 
Multi-tenancy • Resource isolation • Security • Performance • Safety • Parity • Feature • Behaviour • Compatibility • Costs • Resources • Operational 25 Data Council SF 2019 - Heroku Data 
Multi-tenancy • Resource isolation • Security • Performance • Safety • Parity • Feature • Behaviour • Compatibility • Costs • Resources • Operational 26 Data Council SF 2019 - Heroku Data 
Security 27 Data Council SF 2019 - Heroku Data 
A tenant should not be able to access another tenant’s data 28 Data Council SF 2019 - Heroku Data 
29 Data Council SF 2019 - Heroku Data 
30 Data Council SF 2019 - Heroku Data 
Security • Access Control Lists (ACLs) • Namespacing 31 Data Council SF 2019 - Heroku Data 
Security • Access Control Lists (ACLs) • User A can carry out action B on resource C • Namespacing 32 Data Council SF 2019 - Heroku Data 
Security • Access Control Lists (ACLs) • User A can carry out action B on resource C • Namespacing • wabash-58779.events 33 Data Council SF 2019 - Heroku Data 
Performance 34 Data Council SF 2019 - Heroku Data 
A tenant should not adversely affect another tenant’s performance 35 Data Council SF 2019 - Heroku Data 
Performance • Quotas • Produce • Consume 36 Data Council SF 2019 - Heroku Data 
Safety 37 Data Council SF 2019 - Heroku Data 
A tenant should not jeopardise the stability of the cluster 38 Data Council SF 2019 - Heroku Data 
Safety • Limits • Topics • Partitions • Consumer Groups • Storage • Throughput 39 Data Council SF 2019 - Heroku Data 
Capacity = Message Throughput * Retention * Replication 40 Data Council SF 2019 - Heroku Data 
Safety • Limits • Topics • Partitions • Consumer Groups • Storage Capacity • Throughput 41 Data Council SF 2019 - Heroku Data 
Safety • Limits • Topics • Partitions • Consumer Groups • Storage Capacity • Throughput • Monitoring 42 Data Council SF 2019 - Heroku Data 
Safety • Limits • Topics • Partitions • Consumer Groups • Storage Capacity • Throughput • Monitoring • Limit enforcement! 43 Data Council SF 2019 - Heroku Data 
Multi-tenancy • Resource isolation • Security • Performance • Safety • Parity • Feature • Behaviour • Compatibility • Costs • Resources • Operational 44 Data Council SF 2019 - Heroku Data 
Parity 45 Data Council SF 2019 - Heroku Data 
For the service to be useful, it needs to behave like a normal cluster 46 Data Council SF 2019 - Heroku Data 
Parity • Access to a standard cluster 47 Data Council SF 2019 - Heroku Data 
Parity • Access to a standard cluster • ...but with some limitations 48 Data Council SF 2019 - Heroku Data 
Multi-tenancy • Resource isolation • Security • Performance • Safety • Parity • Feature • Behaviour • Compatibility • Costs • Resources • Operational 49 Data Council SF 2019 - Heroku Data 
Compatibility 50 Data Council SF 2019 - Heroku Data 
The service needs to support standard clients No vendor lock-in 51 Data Council SF 2019 - Heroku Data 
Compatibility • Open Source Apache Kafka • Not a fork • No custom code required • Use standard clients 52 Data Council SF 2019 - Heroku Data 
Multi-tenancy • Resource isolation • Security • Performance • Safety • Parity • Feature • Behaviour • Compatibility • Costs • Resources • Operational 53 Data Council SF 2019 - Heroku Data 
Costs 54 Data Council SF 2019 - Heroku Data 
The service needs to be financially feasible 55 Data Council SF 2019 - Heroku Data 
Resource Costs • Packing Density • Utilization 56 Data Council SF 2019 - Heroku Data 
Resource Costs • Cluster size? • No over provisioning • Seamless upgrading • Can’t move tenants (can’t migrate message offsets) 57 Data Council SF 2019 - Heroku Data 
Operational Costs • Minimal operational burden • Minimize impact/blast radius 58 Data Council SF 2019 - Heroku Data 
Operational Costs • Safe defaults • Similar clusters to our dedicated • Automation (kind of our thing) • Testing (lots) 59 Data Council SF 2019 - Heroku Data 
Configuration & Tuning 60 Data Council SF 2019 - Heroku Data 
Configuration & Tuning • Partitions • Quotas • Topics & Consumer Groups • Guard Rails 61 Data Council SF 2019 - Heroku Data 
Partitions • Lots of partitions • 48,000 • Max file descriptors • 500,000 • Max mmap count • 500,000 62 Data Council SF 2019 - Heroku Data 
Quotas • Per Broker! • Counter intuitive enforcement 63 Data Council SF 2019 - Heroku Data 
Topics & Consumer Groups • Explicit Topic creation • Explicit Consumer Group creation 64 Data Council SF 2019 - Heroku Data 
Guard Rails • Limit potential bad usage 65 Data Council SF 2019 - Heroku Data 
Guard Rails • Limit potential bad usage • “Customers don’t make mistakes, we make bad tools” 66 Data Council SF 2019 - Heroku Data 
# Heroku Data Control Plane min_retention_time = 24.hours 67 Data Council SF 2019 - Heroku Data 
# Heroku Data Control Plane min_retention_time = 24.hours max_retention_time = 7.days 68 Data Council SF 2019 - Heroku Data 
# Heroku Data Control Plane min_retention_time = 24.hours max_retention_time = 7.days default_replication_factor = 3 69 Data Council SF 2019 - Heroku Data 
Recommend
More recommend