Multi-Tenancy & Isolation Bogdan Munteanu - Dropbox
Overview • What is Edgestore? • Workloads & API • Multi-tenancy & Isolation • Lessons Learned
What is Edgestore • Distributed Metadata Store built on top of MySQL • Highly Available, Scalable, Durable • Abstract away sharding and caching • Reduce operational burden • Flexible schemas • Multi-Region Setup
Architecture
Architecture cont’d • 2048 Shards • 8 Shards per Engine (and MySQL cluster) • 1 Master - 2 Slaves (semi-sync) • Multi-region setup
MYSQL EDGESTORE Team Edgedata Id Company Size Schema Id Data Edge type Gid Data 1 Expedia 5000 Company:Expedia; Name:SF.jpg; 2 NatGeo 500 Team 10:1 Photo Entity ? Size:5000 Size:64 3 Intuit 2000 Company:NatGeo; Team 20:1 Photo Entity ? Name:Hawaii; Size:64 Size:500 4 Spotify 600 Company:Intuit; Name:Tahoe.jpg; Team 30:7 Photo Entity ? Size:2000 Size:128 Company:Spotity; User Name:Office.jpg; Team 35:3 Photo Entity ? Size:600 Size:1024 Id Email Name Type Email:jondoe@, User 15:1 Name:Jon; Type:Free 1 jondoe@ Jon Free Email:jenny@; User 20:2 Name:Jenny; Type:Pro 2 jenny@ Jenny Pro
Shard the table Schema Id Data Schema Id Data Schema Id Data Company:Expedia; Company:Intuit; Company:Spotity; Team 10:1 Team 30:1 Team 50:2 Size:5000 Size:2000 Size:600 Company:NatGeo; Email:jondoe@, Email:jenny@; Team 20:4 User 40:2 User 60:1 Size:500 Name:Jon; Type:Free Name:Jenny; Type:Pro Shard 1 Shard 2 Shard n
Restricted API • Create/Update/Delete • single and batch • Compare and Set semantics • Reads: • Read(Id, ) • List(Id, *) • Count(Id, *) • List(Id, condition=[equals, prefix, range]) • ReadLog(Id) • ListLog(Id, *) • Acquire Read/Write Lock • Commit/Rollback • Strong consistency semantics
Workloads • 10 million QPS • 600k Writes / second • 9.4mil Reads / second • 90% of Reads are cache hits • 1.5 million QPS to Engine fleet
Workloads cont’d • Batch Size 1 to 10000 • Some read requests can return 1 row • Some can return 100000 rows • Rows can be between a few bytes to several MB • 500+ unique Schemas
Engine Proto -> SQL Query Query Result -> Proto Connection Pooling Control / Reduce load to MySQL
Workloads cont’d • High QPS • Write / Read • Large / expensive requests: • Write - large transactions • Read - large number of rows, or large rows • Multi-Read / Multi-Write
Single Request - 1 token Engine Resource Request Pool Handler
Batch (parallel) Request - n tokens Engine Id1 goroutine Resource Request Id2 goroutine Pool Handler Id3 goroutine
Batch (sequential) Request - n tokens Engine Resource Request Id 1 - Id 10 Pool Handler Id 11 - Id 20 Id 21 - Id 30
More Isolation breakdowns • Type of Traffic: • Live traffic: Front Ends - user traffic, sync related traffic • Offline traffic: Scripts / Async processing / Offline processing • Type of Request: • Write (Insert, Delete, Update, Create Ids, Aquire Read/ Write Locks) • Read (Single read, multi read, list, count, listLog)
Layer Resource Pools Engine Write Live Resource Pool Read Live Request Handler Resource Pool Write Offline Resource Pool Read Offline Resource Pool
Breakdown by tenant • What is a tenant? • Source Machine Tag (e.g. front-end) • Source ServiceName (e.g. FileSync) • Source Schema (e.g. Team) • Source Handler (e.g. Thumbnail generator) • Source Script (e.g. backfill-albums)
Examples • “frontend:www:TeamEvent” • “async-worker:async_task_wrapper:Contacts” • “email:emailservice.py:UserEmailEvent” • “taskrunner-node- quota:update_team_usage.py:User”
CPU Memory Engine Network Storage Disk IO Mysql: Mysql: Threads CPU / Disk IO connected Mysql: Mysql: Semi-sync Threads running
Resources • QPS is not a good metric, as requests vary considerably • # Connections used (mapping to token resource pool) • connections used * time • 200 connections total pool = 200 * 60 = 12000 connection seconds / min: • 1 connection per second for 1 min = 60 connection seconds / min • 60 connections for 1 second = 60 connection seconds / min
Write Live - 1 minute snapshot ConnSec Tenant Connections Errors Used 20 % 5 0 frontend:rpc:User 3 % 90 0 frontend:www:FileId taskrunner: growth: 0,5 % 4 0 team_quota 1 % 1 0 email: UserEmail Total 24,5 % 100 0
100 75 Percentage 50 25 0 10:00 10:01 10:02 10:03 10:04 10:05 Time
100 75 Percentage 50 25 0 10:00 10:01 10:02 10:03 10:04 10:05 Time
100 75 Percentage 50 25 0 10:00 10:01 10:02 10:03 10:04 10:05 Time
Throttle mechanism • Auto-throttle heuristics based on history of resource usage per tenant • No predefined quota • Steady state usage by tenant varies wildly 0.001% - 20% • Triggering event -> find “bad” tenant -> decide how much to throttle them -> throttle “bad” tenant • Disabled the auto-throttling mechanism • We have learned a lot
Timer 9 3 8 7 6 5 4 1 2 Start Acquire Lock Commit Read Write Engine Conn
Resources • Used Time -> Execution Time • Bytes In/Out
Write Live - 1 minute snapshot Tenant Used Execution MB Read Conns Errors 20 % 1 % 1 5 0 frontend:rpc:User frontend:www:File 3 % 3 % 30 90 0 Id taskrunner: 0,5 % 0,5 % 5 4 0 growth: team_quota 1 % 0,5 % 4 1 0 email: UserEmail Total 24,5 % 5 % 40 100 0
Layer: write_live, NumTenants: 360 Throttle Controls: State: steady, TokensPrimaryPool: 300, TokensThrottledPool: 0 Throttled Tenants: [] Period 1: | Used | Idle |Execution| Conns | Errors | Size(MB) | Tenants | 6.48% | 60.15% | 2.58% | 34057 | 0 | 19 | Aggregated stats Top 5 Sources sorted by Used: | 0.79% | 94.11% | 0.36% | 423 | 0 | 0 | offline:bluemail:Email | 0.76% | 93.68% | 0.20% | 437 | 0 | 0 | frontend:rpc:UserEntity | 0.45% | 19.92% | 0.08% | 4922 | 0 | 4 | cape-sfj:cape_dispatcher:CursorEntity | 0.42% | 52.50% | 0.07% | 2783 | 0 | 0 | filejournal:fj_server_bin:FileID | 0.36% | 93.80% | 0.02% | 252 | 0 | 0 | frontend:www:ActivityEntity Layer: write_live, NumTenants: 360 Throttle Controls: State: steady, TokensPrimaryPool: 300, TokensThrottledPool: 0 Throttled Tenants: [] Period 2: | Used | Idle |Execution| Conns | Errors | Size(MB) | Tenants | 100% | 60.15% | 52.58% | 34057 | 20000 | 19 | Aggregated stats Top 5 Sources sorted by Used: | 93.79% | 0.11% | 50.36% | 600 | 300 | 0 | offline:bluemail:Email | 0.76% | 93.68% | 0.20% | 437 | 254 | 0 | frontend:rpc:UserEntity | 0.45% | 19.92% | 0.08% | 4922 | 1293 | 4 | cape-sfj:cape_dispatcher:CursorEntity | 0.42% | 52.50% | 0.07% | 2783 | 2913 | 0 | filejournal:fj_server_bin:FileID | 0.36% | 93.80% | 0.02% | 252 | 23 | 0 | frontend:www:ActivityEntity
edgestore_throttle —tenant=offline:bluemail:Email —tokens=30 —host=abc-de-fg —layer=write_live Layer: write_live, NumTenants: 360 Throttle Controls: State: throttled, TokensPrimaryPool: 270, TokensThrottledPool: 30 Throttled Tenants: [offline:bluemail:Email] Period 3: | Used | Idle |Execution| Conns | Errors | Size(MB) | Tenants | 16.20% | 60.15% | 7.58% | 34057 | 1900 | 19 | Aggregated stats Top 5 Sources sorted by Used: | 10.79% | 0.11% | 5.36% | 600 | 1900 | 0 | offline:bluemail:Email | 0.76% | 93.68% | 0.20% | 437 | 0 | 0 | frontend:rpc:UserEntity | 0.45% | 19.92% | 0.08% | 4922 | 0 | 4 | cape-sfj:cape_dispatcher:CursorEntity | 0.42% | 52.50% | 0.07% | 2783 | 0 | 0 | filejournal:fj_server_bin:FileID | 0.36% | 93.80% | 0.02% | 252 | 0 | 0 | frontend:www:ActivityEntity
Impact • Reduce MTTR • Availability event: • 1. Detection • 2. Investigation • 3. Containment • 4. Short term fix • 5. Long term fix
Findings • Expensive queries • Abusable APIs • Query optimizer • Inconsistencies • Insufficient documentation • Bugs • Perf optimization
Auto-throttle heuristics Manual Throttle using a throttle tool Lessons Learned Query / Throttle / Unthrottle Aggregate tool - queries and filters all engines to isolate the error and limit blast radius while investigating, root causing and fixing the underlying problem. There was a time when we shut down scripts manually not knowing who was causing the problem • 1 deployment to rule them all works found issues with API, bugs, poorly documented client, best practices Throttle mechanism Future work (in progress) • There is such a thing as automating too soon • Silently throttling is bad • Throttling should be a temporary state • Not having pre-defined quotas works • Multiple Isolation breakdowns (by user, by table, by tenant, by request type (Read/Write), by traffic type (Live vs Offline)
What’s next • Control Plane “brain” • continuously query all Engines • automatically throttle tenants when system is degraded • detecting trends • Per logical micros shard (and per Id) granularity for throttling
Credits • Zviad Metreveli • Rati Gelashvili • Robert Verkuil • Alex Degtiar • Jonathan Lee
Recommend
More recommend