Taking Storage for a Ride Ren W. Schmidt, Storage Platform - PowerPoint PPT Presentation

Taking Storage for a Ride René W. Schmidt, Storage Platform NOVEMBER 4, 2016

About me Uber for almost 3 years working on scaling out our storage infrastructure across the planet. VMware for 10+ years. Part of the team that released Virtual Center 1.0, and many vSphere releases since then. Sun Microsystems for 4 years. Part of the team that shipped the Java Hotspot Virtual Machine 1.0, and Java Web Start 1.0.

Problem Statement

All ongoing trips Marketplace Billing, Payouts, User Accounts, Backend Trip histories, Fraud, etc. Data Warehouse Operational Data

Not so long ago… Marketplace Backend

We were there

Storage bottleneck Misc Users Latency Scalability Availability Development Agility Trips As of early 2014

The new world… Marketplace Backend The Schemaless Storage System

The Schemaless Storage System … Trips Billing Ratings Service Layer Service Service Service Datastores Datastores Datastores Developer Abstraction … Trips Receipts Ratings Self-Service Operational Self-Healing Infrastructure Schemaless Schemaless Schemaless Instances Schemaless Storage System Instances Schemaless Instances Instances Instances Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone US-West US-East CN-East CN-West Datacenter US-West US-East CN-East CN-West US-West US-East CN-East CN-West Regions

Status after 2 years in production More than 80% of Uber’s operational data is in Schemaless From a single datastore (trips) to 300+ datastores From 48 MySQL hosts to many thousands MySQL instances

Schemaless Architecture

Requirements API & Features (make developers efficient and happy) Scalability and Efficiency (qps, capacity, $/GB, trust in operation) Availability (4 9’s, zero-downtime operations, hide failures) Time to Market

Easy to replace Postgress SQL-like secondary Indexes select uuid from trips where Fast queries user_uuid = ? and status = ? and request_at > ? and request_at < ?

Support batch operations BASE ROUTE { trip_uuid: … rider_uuid: … driver_uuid: … gps_points: […] FARE payment_info: … driver_rating: client_rating:… receipt: … payout: … } BILLING PAYOUT

Microservices Trips Datastore 1000+ services (most are stateless) Each service can request their own storage

Durability

Scalability Scalability & Reliability 512 128 2TB 8TB 1PB 1PB w/ redundancy

Ledger-style API Tracking of real-life interactions UUID BASE ROUTE FARE RATING put_cell(uuid, column_key, ts, data) 12AB { json dict } { json dict } { json dict } get_cell(uuid, colum_key, ts) { json dict …, ts: 0} F4CD { json dict } { json dict } { json dict …, ts: 2} get_cell_latest(uuid, column_key) … { json dict } Simple, proven, schemaless datamodel Append-only - each cell can only be written once

Physical storage layout Logical Model A007 { json dict } { json dict } { json dict } Distribution Layer Sharding Function % 4096 4 4 4 4 Fixed set of Shards … 0 0 0 0 4 5 6 7 0 1 2 3 9 9 9 9 2 3 4 5 Expandable set of MySQL Clusters

Efficient indexes Index Definition INDEX: Defined on columns name: CLIENT_INDEX column: BASE Scalable - partitioned across shards fields: Fast queries - just need to query a single shard - name: client_id - name: fare Can be added / removed dynamically put_cell( 100 , ‘BASE’, { client_id: 10 driver_id: 437 fare: 10 Cell Cell } put_cell( 121 , ‘BASE’, { Index client_id: 10 driver_id: 217 Index fare: 15 Shard 0 Shard 1 Shard 2 Shard 3 }

The duality of Schemaless: Log and Key-Value Store Internally organized as an ordered log (append-only datastore) B-Tree index for (row, col, ts) lookups Efficient scanning for changes over time put_cells Shard 0 Recent Shard 1 inserts Shard 2 Shard 3 Time

Data driven triggers partition_read(partition, columns, offset_vector): cells All backend processing is triggered based on data being written: (BASE, ROUTE) —> FARE (BASE, FARE) -> CLIENT_BILLING (BASE, FARE) -> DRIVER_PAYOUT Functional programming paradigm Robust in case of failures Eliminates out-of-band message queues

Schemaless features • Bigtable-style API for storing JSON dictionaries • Horizontally-scalable in both IO and capacity • Append-only to track real-time interactions • Fast secondary indexes • Batch processing using triggers / partition_read API • No downtime for changing schemas or indexes • Build on a solid MySQL foundation

Schemaless Availability

“The difference between a high- quality product and a low quality product is how well it works when stressed”

What happens when a database dies? Master put_cell Slaves Distribution Layer Master Slaves

What happens when a database dies? Slave Failure - replace Master X put_cell Slaves Distribution Layer Master Slaves

What happens when a database dies? Master Failure - Hinted Handoff X Master Slave is promoted to master put_cell Slaves Distribution Layer Two options: 1) Fail write (fine for batch) Master 2) Buffer write and retry later Slaves

Consistency guarantees Log and Retry - Commutative operations makes it simple put_cell(row, column, ts, json, buffered = true ) The trigger API hides the buffering for the application programmer.

Handling growth Database Cluster Splits 1 Lots of 2 3 writes 4

Handling growth Database Cluster Splits 1 1 2 2 3 3 4 4 Setup a read-only shadow cluster in the background

Handling growth Database Cluster Splits 1 2 3 4 Make writable Delete extra shards

Key operations How fast can we add a MySQL slave? How fast can a slave DB be promoted to a master DB?

Physical restore limits The network is not infinitely fast 3.6TB in 1 hour 512GB in 30 min w/ 25% network saturation 10Gbps NIC Typical restore SLA is less than 30 minutes

Partition data within hosts 8TB Restore chunks in parallel

Operating Storage at Scale

The Schemaless Storage System … Trips Billing Ratings Service Layer Service Service Service Datastores Datastores Datastores Developer Abstraction … Trips Receipts Ratings Self-Service Operational Self-Healing Infrastructure Schemaless Schemaless Schemaless Instances Schemaless Storage System Instances Schemaless Instances Instances Instances Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone Zone US-West US-East CN-East CN-West Datacenters US-West US-East CN-East CN-West US-West US-East CN-East CN-West Regions

Do more with less

1236 word setup 1557 word execution

What gets hard with scale? Drift, Drift, and Drift • Host and rack failures • Upgrade all OSes or MySQL across all boxes •Performance tuning and debugging • Somebody ran some manual commands on a host (!) •Creating new instances & indexes • Applied manual steps inconsistently • Running out of disk-space • Capacity Planning

Pets vs Cattle “Pets” “Cattle” - Unique names (“fluffy”, “biscuit”, etc.) - Enumerated name (cow0235) - Know address by heart - Arbitrary address - Being nurtured when becoming ill - Replaceable when becoming ill

So what does that mean? Pets World View Cattle World View The desired state is in your head Desired state is codi fj ed Driven by you Driven by software Making changes is cool Making changes is a non-event Making operations directly on hosts Changing the model Runbooks Autonomous Brittle Robust Operation Oriented Goal-State Oriented

What is a goal state…? •Less than 80% disk space used •Hosts uses linux kernel version X •At least one database is backed up in each cluster •A cluster has desired number of slaves •Instance has X clusters •Instance X exists

The goal-state engine Operator Input Goal State Actual State Current Drift Evaluate System Action Idempotent, Robust, Restartable, Continuous, Self-healing, Simple

Example: Goal-State engine on a host Goal State Actual State DB Role Sync_From Role Read-only Issues foo-db1 master - master no none bar-db2 slave bar-db1 slave yes none baz-db10 slave baz-db9 slave yes none UPDATE ACTUAL STATE PULL GOAL STATE M S S Opsless- Docker Agent Host

Let’s promote foo-db2 to be the new master Goal State Actual State DB Host Role Sync_From Role Read-only Issues foo-db1 Host A master - master no none foo-db2 Host B slave foo-db1 slave yes none S M Opsless- Opsless- Docker Docker Agent Agent Host 1 Host 2 Goalstate: { foo-db1: {role: master}} Goalstate: { foo-db2: { role: slave, …}}

We just have to change the goal-state Goal State Actual State DB Host Role Sync_From Role Read-only Issues foo-db1 Host A idle - master no Wrong role foo-db2 Host B master - slave yes Wrong role S M Opsless- Opsless- Docker Docker Agent Agent Host 1 Host 2 Goalstate: { foo-db1: {role: master}} Goalstate: { foo-db2: { role: slave, …}}

Taking Storage for a Ride Ren W. Schmidt, Storage Platform - PowerPoint PPT Presentation

Taking Storage for a Ride Ren W. Schmidt, Storage Platform NOVEMBER 4, 2016 About me Uber for almost 3 years working on scaling out our storage infrastructure across the planet. VMware for 10+ years. Part of the team that released Virtual

FREE FREE FREE FREE RIDE RIDE RIDE RIDE W HAT HAT IS IS F REE REE RIDE RIDE ? HAT HAT IS

SoUNd ride I.D. Ciro Dvila SoUNd ride Concept. Sound Ride is inspired in the SUN RIDE

WOLF Ride FY17 Budget Request wou.edu/wolfride WOU Safe Ride Program: WOLF Ride

AUDITORIUM 5 KM 11 MIN RIDE PARCO DELLA MUSICA 8 KM 15 MIN RIDE GEMELLI HOSPITAL

Arrive-n-Ride Marketing Presentation What is the Arrive -n- Ride Program? A New Innovation

BICYCLE SAFETY KINDERGARTEN-GRADE 2 4 KEY RULES! Wear a Ride with Ride in a Use hand helmet

ride statistics ride statistics resistance variability ride statistics resistance variability

BIG RIDE ASSEMBLY AND INSTALLATION INSTRUCTIONS S.R. SMITH BIG RIDE SLIDES ARE

Run 2 Data Taking Run 2 Data Taking 50ns ramp (early measurement) 25ns data taking

Taking Taking Taking Taking Aspiration and Aspiration and Aspiration and Aspiration and

LOS ANGELES AQUEDUCT TERMINUS, THE CASCADES, CALIFORNIA I wanna ride it old aqueduct (1913)

ACCESS CESS-A-RIDE IDE SER ERVICES ICES IN NE NEW W YOR ORK CITY What t is A s Access

E xpa nde d E me rg e nc y Ride Ho me Pro g ra m Pro po sa l E sse ntia l Wo rke r Ride Ho

DISCOUNT VENDOR Program WHAT IS CLUB RIDE? Club Ride is a free program of the Regional

Bicycle Parking Initiative About Ride the Cov The mission of Ride the Cov is to establish

RIDE Program Update August 26, 2014 AAMVA Annual International Conference What Is RIDE? An

Puzzles on Document Interoperability Document hereby refers the content that can be printed

SHRiMPS Status of soft interactions in SHERPA Holger Schulz (IPPP Durham) November 23, 2015

Dr. Pat Selinger IBM Fellow and VP of Data Management Architecture and Technology IBM, USA DB2

CLIC CLient-Informed Caching for Storage Servers Xin Liu Ashraf Aboulnaga Ken Salem Xuhui Li

Introduction to Relational Algebra Elmasri/Navathe ch 6 Padron-McCarthy/Risch ch 10 Silvia

Data Management Systems Query Processing Introduction Execution models Optimization I

DCPS: Building Expert Readers through Content-Rich Curriculum

DISCRETE COSINE TRANSFORM Laboratory session Fernando Pereira Instituto Superior Tcnico

Sambuz

Useful Links

Newsletter

Mail Us