Taking Storage for a Ride René W. Schmidt, Storage Platform NOVEMBER 4, 2016
About me Uber for almost 3 years working on scaling out our storage infrastructure across the planet. VMware for 10+ years. Part of the team that released Virtual Center 1.0, and many vSphere releases since then. Sun Microsystems for 4 years. Part of the team that shipped the Java Hotspot Virtual Machine 1.0, and Java Web Start 1.0.
Problem Statement
All ongoing trips Marketplace Billing, Payouts, User Accounts, Backend Trip histories, Fraud, etc. Data Warehouse Operational Data
Not so long ago… Marketplace Backend
We were there
Storage bottleneck Misc Users Latency Scalability Availability Development Agility Trips As of early 2014
The new world… Marketplace Backend The Schemaless Storage System
The Schemaless Storage System … Trips Billing Ratings Service Layer Service Service Service Datastores Datastores Datastores Developer Abstraction … Trips Receipts Ratings Self-Service Operational Self-Healing Infrastructure Schemaless Schemaless Schemaless Instances Schemaless Storage System Instances Schemaless Instances Instances Instances Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone US-West US-East CN-East CN-West Datacenter US-West US-East CN-East CN-West US-West US-East CN-East CN-West Regions
Status after 2 years in production More than 80% of Uber’s operational data is in Schemaless From a single datastore (trips) to 300+ datastores From 48 MySQL hosts to many thousands MySQL instances
Schemaless Architecture
Requirements API & Features (make developers efficient and happy) Scalability and Efficiency (qps, capacity, $/GB, trust in operation) Availability (4 9’s, zero-downtime operations, hide failures) Time to Market
Easy to replace Postgress SQL-like secondary Indexes select uuid from trips where Fast queries user_uuid = ? and status = ? and request_at > ? and request_at < ?
Support batch operations BASE ROUTE { trip_uuid: … rider_uuid: … driver_uuid: … gps_points: […] FARE payment_info: … driver_rating: client_rating:… receipt: … payout: … } BILLING PAYOUT
Microservices Trips Datastore 1000+ services (most are stateless) Each service can request their own storage
Durability
Scalability Scalability & Reliability 512 128 2TB 8TB 1PB 1PB w/ redundancy
Ledger-style API Tracking of real-life interactions UUID BASE ROUTE FARE RATING put_cell(uuid, column_key, ts, data) 12AB { json dict } { json dict } { json dict } get_cell(uuid, colum_key, ts) { json dict …, ts: 0} F4CD { json dict } { json dict } { json dict …, ts: 2} get_cell_latest(uuid, column_key) … { json dict } Simple, proven, schemaless datamodel Append-only - each cell can only be written once
Physical storage layout Logical Model A007 { json dict } { json dict } { json dict } Distribution Layer Sharding Function % 4096 4 4 4 4 Fixed set of Shards … 0 0 0 0 4 5 6 7 0 1 2 3 9 9 9 9 2 3 4 5 Expandable set of MySQL Clusters
Efficient indexes Index Definition INDEX: Defined on columns name: CLIENT_INDEX column: BASE Scalable - partitioned across shards fields: Fast queries - just need to query a single shard - name: client_id - name: fare Can be added / removed dynamically put_cell( 100 , ‘BASE’, { client_id: 10 driver_id: 437 fare: 10 Cell Cell } put_cell( 121 , ‘BASE’, { Index client_id: 10 driver_id: 217 Index fare: 15 Shard 0 Shard 1 Shard 2 Shard 3 }
The duality of Schemaless: Log and Key-Value Store Internally organized as an ordered log (append-only datastore) B-Tree index for (row, col, ts) lookups Efficient scanning for changes over time put_cells Shard 0 Recent Shard 1 inserts Shard 2 Shard 3 Time
Data driven triggers partition_read(partition, columns, offset_vector): cells All backend processing is triggered based on data being written: (BASE, ROUTE) —> FARE (BASE, FARE) -> CLIENT_BILLING (BASE, FARE) -> DRIVER_PAYOUT Functional programming paradigm Robust in case of failures Eliminates out-of-band message queues
Schemaless features • Bigtable-style API for storing JSON dictionaries • Horizontally-scalable in both IO and capacity • Append-only to track real-time interactions • Fast secondary indexes • Batch processing using triggers / partition_read API • No downtime for changing schemas or indexes • Build on a solid MySQL foundation
Schemaless Availability
“The difference between a high- quality product and a low quality product is how well it works when stressed”
What happens when a database dies? Master put_cell Slaves Distribution Layer Master Slaves
What happens when a database dies? Slave Failure - replace Master X put_cell Slaves Distribution Layer Master Slaves
What happens when a database dies? Master Failure - Hinted Handoff X Master Slave is promoted to master put_cell Slaves Distribution Layer Two options: 1) Fail write (fine for batch) Master 2) Buffer write and retry later Slaves
Consistency guarantees Log and Retry - Commutative operations makes it simple put_cell(row, column, ts, json, buffered = true ) The trigger API hides the buffering for the application programmer.
Handling growth Database Cluster Splits 1 Lots of 2 3 writes 4
Handling growth Database Cluster Splits 1 1 2 2 3 3 4 4 Setup a read-only shadow cluster in the background
Handling growth Database Cluster Splits 1 2 3 4 Make writable Delete extra shards
Key operations How fast can we add a MySQL slave? How fast can a slave DB be promoted to a master DB?
Physical restore limits The network is not infinitely fast 3.6TB in 1 hour 512GB in 30 min w/ 25% network saturation 10Gbps NIC Typical restore SLA is less than 30 minutes
Partition data within hosts 8TB Restore chunks in parallel
Operating Storage at Scale
The Schemaless Storage System … Trips Billing Ratings Service Layer Service Service Service Datastores Datastores Datastores Developer Abstraction … Trips Receipts Ratings Self-Service Operational Self-Healing Infrastructure Schemaless Schemaless Schemaless Instances Schemaless Storage System Instances Schemaless Instances Instances Instances Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone US-West US-East CN-East CN-West Datacenters US-West US-East CN-East CN-West US-West US-East CN-East CN-West Regions
Do more with less
1236 word setup 1557 word execution
What gets hard with scale? Drift, Drift, and Drift • Host and rack failures • Upgrade all OSes or MySQL across all boxes •Performance tuning and debugging • Somebody ran some manual commands on a host (!) •Creating new instances & indexes • Applied manual steps inconsistently • Running out of disk-space • Capacity Planning
Pets vs Cattle “Pets” “Cattle” - Unique names (“fluffy”, “biscuit”, etc.) - Enumerated name (cow0235) - Know address by heart - Arbitrary address - Being nurtured when becoming ill - Replaceable when becoming ill
So what does that mean? Pets World View Cattle World View The desired state is in your head Desired state is codi fj ed Driven by you Driven by software Making changes is cool Making changes is a non-event Making operations directly on hosts Changing the model Runbooks Autonomous Brittle Robust Operation Oriented Goal-State Oriented
What is a goal state…? •Less than 80% disk space used •Hosts uses linux kernel version X •At least one database is backed up in each cluster •A cluster has desired number of slaves •Instance has X clusters •Instance X exists
The goal-state engine Operator Input Goal State Actual State Current Drift Evaluate System Action Idempotent, Robust, Restartable, Continuous, Self-healing, Simple
Example: Goal-State engine on a host Goal State Actual State DB Role Sync_From Role Read-only Issues foo-db1 master - master no none bar-db2 slave bar-db1 slave yes none baz-db10 slave baz-db9 slave yes none UPDATE ACTUAL STATE PULL GOAL STATE M S S Opsless- Docker Agent Host
Let’s promote foo-db2 to be the new master Goal State Actual State DB Host Role Sync_From Role Read-only Issues foo-db1 Host A master - master no none foo-db2 Host B slave foo-db1 slave yes none S M Opsless- Opsless- Docker Docker Agent Agent Host 1 Host 2 Goalstate: { foo-db1: {role: master}} Goalstate: { foo-db2: { role: slave, …}}
We just have to change the goal-state Goal State Actual State DB Host Role Sync_From Role Read-only Issues foo-db1 Host A idle - master no Wrong role foo-db2 Host B master - slave yes Wrong role S M Opsless- Opsless- Docker Docker Agent Agent Host 1 Host 2 Goalstate: { foo-db1: {role: master}} Goalstate: { foo-db2: { role: slave, …}}
Recommend
More recommend