taking storage for a ride
play

Taking Storage for a Ride Ren W. Schmidt, Storage Platform - PowerPoint PPT Presentation

Taking Storage for a Ride Ren W. Schmidt, Storage Platform NOVEMBER 4, 2016 About me Uber for almost 3 years working on scaling out our storage infrastructure across the planet. VMware for 10+ years. Part of the team that released Virtual


  1. Taking Storage for a Ride René W. Schmidt, Storage Platform NOVEMBER 4, 2016

  2. About me Uber for almost 3 years working on scaling out our storage infrastructure across the planet. VMware for 10+ years. Part of the team that released Virtual Center 1.0, and many vSphere releases since then. Sun Microsystems for 4 years. Part of the team that shipped the Java Hotspot Virtual Machine 1.0, and Java Web Start 1.0.

  3. Problem Statement

  4. All ongoing trips Marketplace Billing, Payouts, User Accounts, Backend Trip histories, Fraud, etc. Data Warehouse Operational Data

  5. Not so long ago… Marketplace Backend

  6. We were there

  7. Storage bottleneck Misc Users Latency Scalability Availability Development Agility Trips As of early 2014

  8. The new world… Marketplace Backend The Schemaless Storage System

  9. The Schemaless Storage System … Trips Billing Ratings Service Layer Service Service Service Datastores Datastores Datastores Developer Abstraction … Trips Receipts Ratings Self-Service Operational Self-Healing Infrastructure Schemaless Schemaless Schemaless Instances Schemaless Storage System Instances Schemaless Instances Instances Instances Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone US-West US-East CN-East CN-West Datacenter US-West US-East CN-East CN-West US-West US-East CN-East CN-West Regions

  10. Status after 2 years in production More than 80% of Uber’s operational data is in Schemaless From a single datastore (trips) to 300+ datastores From 48 MySQL hosts to many thousands MySQL instances

  11. Schemaless Architecture

  12. Requirements API & Features (make developers efficient and happy) Scalability and Efficiency (qps, capacity, $/GB, trust in operation) Availability (4 9’s, zero-downtime operations, hide failures) Time to Market

  13. Easy to replace Postgress SQL-like secondary Indexes select uuid from trips where Fast queries user_uuid = ? and status = ? and request_at > ? and request_at < ?

  14. Support batch operations BASE ROUTE { trip_uuid: … rider_uuid: … driver_uuid: … gps_points: […] FARE payment_info: … driver_rating: client_rating:… receipt: … payout: … } BILLING PAYOUT

  15. Microservices Trips Datastore 1000+ services (most are stateless) Each service can request their own storage

  16. Durability

  17. Scalability Scalability & Reliability 512 128 2TB 8TB 1PB 1PB w/ redundancy

  18. Ledger-style API Tracking of real-life interactions UUID BASE ROUTE FARE RATING put_cell(uuid, column_key, ts, data) 12AB { json dict } { json dict } { json dict } get_cell(uuid, colum_key, ts) { json dict …, ts: 0} F4CD { json dict } { json dict } { json dict …, ts: 2} get_cell_latest(uuid, column_key) … { json dict } Simple, proven, schemaless datamodel Append-only - each cell can only be written once

  19. Physical storage layout Logical Model A007 { json dict } { json dict } { json dict } Distribution Layer Sharding Function % 4096 4 4 4 4 Fixed set of Shards … 0 0 0 0 4 5 6 7 0 1 2 3 9 9 9 9 2 3 4 5 Expandable set of MySQL Clusters

  20. Efficient indexes Index Definition INDEX: Defined on columns name: CLIENT_INDEX column: BASE Scalable - partitioned across shards fields: Fast queries - just need to query a single shard - name: client_id - name: fare Can be added / removed dynamically put_cell( 100 , ‘BASE’, { client_id: 10 driver_id: 437 fare: 10 Cell Cell } put_cell( 121 , ‘BASE’, { Index client_id: 10 driver_id: 217 Index fare: 15 Shard 0 Shard 1 Shard 2 Shard 3 }

  21. The duality of Schemaless: Log and Key-Value Store Internally organized as an ordered log (append-only datastore) B-Tree index for (row, col, ts) lookups Efficient scanning for changes over time put_cells Shard 0 Recent Shard 1 inserts Shard 2 Shard 3 Time

  22. Data driven triggers partition_read(partition, columns, offset_vector): cells All backend processing is triggered based on data being written: (BASE, ROUTE) —> FARE (BASE, FARE) -> CLIENT_BILLING (BASE, FARE) -> DRIVER_PAYOUT Functional programming paradigm Robust in case of failures Eliminates out-of-band message queues

  23. Schemaless features • Bigtable-style API for storing JSON dictionaries • Horizontally-scalable in both IO and capacity • Append-only to track real-time interactions • Fast secondary indexes • Batch processing using triggers / partition_read API • No downtime for changing schemas or indexes • Build on a solid MySQL foundation

  24. Schemaless Availability

  25. “The difference between a high- quality product and a low quality product is how well it works when stressed”

  26. What happens when a database dies? Master put_cell Slaves Distribution Layer Master Slaves

  27. What happens when a database dies? Slave Failure - replace Master X put_cell Slaves Distribution Layer Master Slaves

  28. What happens when a database dies? Master Failure - Hinted Handoff X Master Slave is promoted to master put_cell Slaves Distribution Layer Two options: 1) Fail write (fine for batch) Master 2) Buffer write and retry later Slaves

  29. Consistency guarantees Log and Retry - Commutative operations makes it simple put_cell(row, column, ts, json, buffered = true ) The trigger API hides the buffering for the application programmer.

  30. Handling growth Database Cluster Splits 1 Lots of 2 3 writes 4

  31. Handling growth Database Cluster Splits 1 1 2 2 3 3 4 4 Setup a read-only shadow cluster in the background

  32. Handling growth Database Cluster Splits 1 2 3 4 Make writable Delete extra shards

  33. Key operations How fast can we add a MySQL slave? How fast can a slave DB be promoted to a master DB?

  34. Physical restore limits The network is not infinitely fast 3.6TB in 1 hour 512GB in 30 min w/ 25% network saturation 10Gbps NIC Typical restore SLA is less than 30 minutes

  35. Partition data within hosts 8TB Restore chunks in parallel

  36. Operating Storage at Scale

  37. The Schemaless Storage System … Trips Billing Ratings Service Layer Service Service Service Datastores Datastores Datastores Developer Abstraction … Trips Receipts Ratings Self-Service Operational Self-Healing Infrastructure Schemaless Schemaless Schemaless Instances Schemaless Storage System Instances Schemaless Instances Instances Instances Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone US-West US-East CN-East CN-West Datacenters US-West US-East CN-East CN-West US-West US-East CN-East CN-West Regions

  38. Do more with less

  39. 1236 word setup 1557 word execution

  40. What gets hard with scale? Drift, Drift, and Drift • Host and rack failures • Upgrade all OSes or MySQL across all boxes •Performance tuning and debugging • Somebody ran some manual commands on a host (!) •Creating new instances & indexes • Applied manual steps inconsistently • Running out of disk-space • Capacity Planning

  41. Pets vs Cattle “Pets” “Cattle” - Unique names (“fluffy”, “biscuit”, etc.) - Enumerated name (cow0235) - Know address by heart - Arbitrary address - Being nurtured when becoming ill - Replaceable when becoming ill

  42. So what does that mean? Pets World View Cattle World View The desired state is in your head Desired state is codi fj ed Driven by you Driven by software Making changes is cool Making changes is a non-event Making operations directly on hosts Changing the model Runbooks Autonomous Brittle Robust Operation Oriented Goal-State Oriented

  43. What is a goal state…? •Less than 80% disk space used •Hosts uses linux kernel version X •At least one database is backed up in each cluster •A cluster has desired number of slaves •Instance has X clusters •Instance X exists

  44. The goal-state engine Operator Input Goal State Actual State Current Drift Evaluate System Action Idempotent, Robust, Restartable, Continuous, Self-healing, Simple

  45. Example: Goal-State engine on a host Goal State Actual State DB Role Sync_From Role Read-only Issues foo-db1 master - master no none bar-db2 slave bar-db1 slave yes none baz-db10 slave baz-db9 slave yes none UPDATE ACTUAL STATE PULL GOAL STATE M S S Opsless- Docker Agent Host

  46. Let’s promote foo-db2 to be the new master Goal State Actual State DB Host Role Sync_From Role Read-only Issues foo-db1 Host A master - master no none foo-db2 Host B slave foo-db1 slave yes none S M Opsless- Opsless- Docker Docker Agent Agent Host 1 Host 2 Goalstate: { foo-db1: {role: master}} Goalstate: { foo-db2: { role: slave, …}}

  47. We just have to change the goal-state Goal State Actual State DB Host Role Sync_From Role Read-only Issues foo-db1 Host A idle - master no Wrong role foo-db2 Host B master - slave yes Wrong role S M Opsless- Opsless- Docker Docker Agent Agent Host 1 Host 2 Goalstate: { foo-db1: {role: master}} Goalstate: { foo-db2: { role: slave, …}}

Recommend


More recommend