Percona Live Europe 2016 Launching Vitess Anthony Yeh, Dan Rogart Amsterdam, Netherlands | October 3 – 5, 2016
Overview http://vitess.io
Why Vitess? Their App YouTube Your App Their Vitess Vitess Sharding Magic Sharding Magic Sharding Magic MySQL MySQL MySQL 3
Why not Vitess? Vitess is... Vitess is not... ● an opinionated cluster ● a proxy ○ Many ways to scale; this is one. ○ Understands the query. ○ More on those opinions next. ○ Generates queries of its own. ● a powerful tool ● plug-and-play ○ Huge problems get easier. ○ ... yet. ○ Simple things get more complex. ○ This talk is about the gaps. 4
Launching Vitess http://vitess.io/user-guide/launching.html
Scalability Philosophy
Horizontal Scaling Small Instances Cluster Orchestration ● Many instances per host ● Containers isolate ports, files, ● Faster replication, backup/restore compute ● Less contention, outages isolated ● Scheduling for resilience ● Improves HW utilization Self-Healing, Automation ● Health checks ● Ops work should be O(1) 7
Durability and Consistency Durability through replication Sharded consistency model ● Disk is not durable ● Single-shard transactions ○ sync_binlog off ○ Same guarantees as MySQL ● Data must be on multiple ● Cross-shard transactions machines ○ May fail partially across shards ○ Work in progress on 2PC ○ semisync ● Cross-shard reads ○ lossless failover ○ routine reparent ○ Even with 2PC, may read from shards in different states 8
Globally Distributed Multi-Cell Deployment Cluster Metadata ("Topology") ● Cell = Zone | Availability Zone ● Distributed, consistent, highly ○ Possible shared fate within cell available key-value store ○ But failures shouldn't propagate ○ e.g. etcd, ZooKeeper ● Multi-Region ● Global Topology Store ○ Survive fiber cuts, regional outages ○ Quorum across multiple cells ○ Lower regional read latency ○ Survives any given cell death ● Single-Master ● Local Topology Store ○ Writes redirected at frontend ○ Quorum within a single cell ○ Only one inter-cell roundtrip ○ Independent of any other cell ○ DB writes intra-cell 9
Production Planning
Testing Integration Tests Query Compatibility ● Run app tests against Vitess ● Bind Variables ○ Use real schema ○ Client-side prepared statements ○ Test sharding ○ Vitess query plan cache ● py/vttest ● Tablet Types ○ Small footprint to run on 1 machine ○ master: writes, read-after-write ○ Emulate a full cluster for tests ○ replica: live site read traffic ○ Loads schema from .sql files ○ rdonly: batch jobs, backups ○ 1 vtcombo = all Vitess servers ● Query Support ○ 1 mysqld = all shards ○ Vitess SQL parser is incomplete ○ Report important use cases 11
Replication Binary Logging Side Effects ● Enabled everywhere (slaves too) ● Triggers ● Statement-based ● Stored procedures ○ Rewrite to PK lookups ● Foreign key constraints ● GTID required ● These can break resharding ● Used for master management, resharding, update stream, schema swap, etc. 12
Monitoring Status URLs (vtgate, vttablet, etc.) ● /debug/status ● /debug/vars ○ Prometheus, InfluxDB ● /healthz ● /queryz ● /schemaz Coming soon... ○ Realtime fleet-wide health map 13
Backups Built-in Backups ● Part of cloning, schema swap ○ Restores every day ● Storage Plugins ○ Filesystem (NFS, etc.) ○ Google Cloud Storage ○ Amazon S3 ○ Ceph ● Needs to be triggered periodically 14
Migration Strategies Tribute
Migration New Workloads Online Migration ● Getting Started + Launch Guide ● Run Vitess above existing MySQL Offline Migration ● Previously Unsharded ● Import data to Vitess ● Already Sharded ○ Custom Vindex 16
YouTube Production Dan Rogart, YouTube SRE
Run Vitess the SRE Way! • Cattle, not pets • Systemic failure is more important than individual failure • Failure is constant • Automate responses to failure when appropriate • Or detect and alert a human if required • The atomic unit is a mysql instance - for durability, availability, replacement 18
"If I have seen further than others, it is by standing upon the shoulders of giants" -- Isaac Newton • s/seen/scaled/ • Vitess runs on MySQL... • MySQL runs on Borg (Google's container cloud)... • Borg runs on Google datacenters and networks... • Each level is supported by amazing teams and we rely heavily upon their work 19
Vitess runs on MySQL on Borg • YouTube/Vitess did not fully migrate into Borg until 2013 • So, it's actually a pretty good example of how a Vitess integration with an existing MySQL stack went (pretty well, so far) • MoB had a lot of mature tools that Vitess leveraged: • Backups • Failover • Schema Management 20
Decider vtctld shard decider vttablet mysqld vtgate master vtgate vttablet vttablet vttablet vttablet mysqld mysqld mysqld mysqld replicas batch replicas 21
Decider...(vastly simplified): • Polls all mysql instances every n seconds • If the old master is unhealthy it elects a new master from the replica pool • It re-masters all the other replicas to properly replicate from the new master • Is the reason TabletExternallyReparented exists in Vitess • Total failover times for YouTube Vitess are around 5 seconds 22
Schema Management (small changes) • Autoschema • A "small" change is basically an ALTER against a table with < 2M rows • When executed on a replica it won't block the replication stream • Defined paths in source control are monitored • When a peer reviewed file containing sql is submitted... • ...autoschema will validate the change and apply it to all masters in a cluster 23
Schema Management (big changes) • Pivot • A "big" change is basically an ALTER that will block traffic for too long on the master or block replication too long when executed on a slave • Defined paths in source control are monitored • When a peer reviewed file containing sql is submitted... • ...an SRE will start a pivot • The ALTER is applied to a single replica and a seed backup is taken • All other replicas are restarted such that they restore from the backup that contains the change • Finally, the master is done last and a replica with the change is promoted 24
Schema Management • Autoschema changes take minutes • Pivots take days • At YouTube all schema changes must be forwards and backwards compatible with code. Enforced with extensive automated tests. • Sometimes dangerous: common example is removing a column using a pivot. This can break replication, so we have to block access. • Sometimes confusing for our developers: they shouldn't really care about how a change happens • Open source pivot is coming. 25
Resharding Automation • Online copy of data performed n times • Final offline copy of data to sync to a gtid • Filtered replication • Traffic redirect • ??? • Profit! 26
Resharding Automation (online copy) unsharded shard 0 shard 1 vttablet vttablet vttablet vtworker mysqld mysqld mysqld • Replication running • Read chunks from master master master source • Read chunks from target vttablet vttablet vttablet vttablet • Reconcile and write vttablet vttablet diff to target mysqld mysqld mysqld • Adaptive throttle mysqld mysqld mysqld replicas replicas replicas 27
Resharding Automation (offline copy) unsharded shard 0 shard 1 vttablet vttablet vttablet vtworker mysqld mysqld mysqld • Replication stopped • Read chunks from master master master source • Read chunks from target vttablet vttablet vttablet vttablet • Reconcile and write vttablet vttablet diff to target mysqld mysqld mysqld • Adaptive throttle mysqld mysqld mysqld replicas replicas replicas 28
Resharding Automation (filtered repl) unsharded shard 0 shard 1 vttablet vttablet vttablet mysqld mysqld mysqld • Target master tablets connect to a master master master source replica • Parse binlogs and apply statements vttablet vttablet vttablet vttablet that belong in that vttablet vttablet shard mysqld mysqld mysqld • gtid is stored and mysqld mysqld mysqld replicated on target replicas replicas replicas to survive restarts 29
Resharding Automation (redirection) • Finally application traffic is redirected: - vtctl-prod MigrateServedTypes keyspace_name/0 replica - (^^^^ sends replica traffic from unsharded to sharded) - vtctl-prod MigrateServedTypes keyspace_name/0 master - (^^^^ master cutover, point of no return) • < 5s of downtime during master cutover (faster than a normal decider failover, since only the vitess layer is touched) 30
Regression Testing • We use the Yahoo Cloud Serving Benchmark • Allows for comparison of Vitess to other storage solutions using the same workloads • A daily Vitess/YCSB sandbox is run to measure qps per core and latency • Deviations from previous results (postive or negative) are noted and investigated 31
Rate My Session! 32
Recommend
More recommend