managing mysql at scale
play

Managing MySQL at Scale Pradeep Nayak & Junyi (Luke) Lu - PowerPoint PPT Presentation

Managing MySQL at Scale Pradeep Nayak & Junyi (Luke) Lu Production Engineers - MySQL Infra Agenda 1 Terminology 2 Lifecycle of a MySQL instance 3 How do we migrate MySQL instance 4 How do we migrate shard 5 Balancing 6 Testing


  1. Managing MySQL at Scale Pradeep Nayak & Junyi (Luke) Lu Production Engineers - MySQL Infra

  2. Agenda 1 Terminology 2 Lifecycle of a MySQL instance 3 How do we migrate MySQL instance 4 How do we migrate shard 5 Balancing 6 Testing Infrastructure for automations

  3. Terminology

  4. Terminology What’s a instance ? What’s a shard ? What’s a replicaset ?

  5. Instance foobar.prn:3307 foobar.prn foobar.prn:3309 3306 foobar.prn:3306 3307 3309

  6. Shard db.helloworld db.12345 db.44365

  7. Replicaset Service WWW Discovery Replicaset 1 Replicaset 2 Replicaset 3 (0-99) (100-199) (200-299) M1 M2 M3 S1 S2 S3

  8. Service Discovery Shard ID Replicaset Master Slave 0-99 Replicaset 1 db1234.prn1:3306 db1234.frc1:3306 100-199 Replicaset 2 db4567.ftw1:3306 db4567.prn1:3307 200-299 Replicaset 3 db1234.atn1:3306 db1234.frc2:3308

  9. Service Discovery Shard ID Replicaset Master Slave 0-99 Replicaset 1 db1234.prn1:3306 db1234.frc1:3306 100-199 Replicaset 2 db4567.ftw1:3306 db1234.prn1:3309 200-299 Replicaset 3 db1234.frc2:3308 db1234.atn1:3309

  10. host host123.prn host245.ftw host567.frc instance 3306 3309 3309 3306 3306 3307 replicaset mysql.replicaset.6789 mysql.replicaset.12345 shard db.helloworld db.foobar

  11. Lifecycle of an instance

  12. Lifecycle of an instance ● States are production, spare, spare allocated, spare deallocated, reimage, drained ● Metadata includes instance properties like name, port, mysql rpm version, state etc ● A mysql shard hosts metadata of all instances in the fleet

  13. Lifecycle of an instance pre-prod checks success reimage spare s s e c c u s s k c e prod checks fail allocated for copy h c d o r p - e r p copy failed spare allocated spare deallocated copy success copy success undrain copy success production drained

  14. Lifecycle of an instance ● Each state has its own processor to do the work ● Each state has a queue where work is queued ● Runs constantly scanning the fleet

  15. Instance Migration

  16. Clone an instance Use case of cloning a production MySQL instance - Replace a broken instance/host - Move data around for maintenance - Balancing host utilization

  17. MPS Copy A workflow system that manages the requests for cloning MySQL instances ● spare allocation ● set up MySQL config ● copy data ● replication ● validation ● bring it online & remove the old instance if necessary

  18. MPS Copy - Allocation Choose the best slot for the instance based on its footprint ● Disk usage ● CPU utilization ● Failure domain Allocation Setup Migration Replication Validation Registeration

  19. MPS Copy - Setup Turn up an empty instance using the right configuration - Install the right RPM version - Bootstrap the correct directory - Generate the right my.cnf based on its use case - Make sure the empty instance is connectable Allocation Setup Migration Replication Validation Registeration

  20. MPS Copy - Data Migration We support three different ways of cloning a production instance ● Physical copy: xtrabackup and myrocks_hotbackup ● Logical copy: ○ mysqldump ○ Restore from backup Allocation Setup Migration Replication Validation Registeration

  21. MPS Copy - Replication ● Setup replication ○ From current production master ○ From Binlog Server ● Catchup Allocation Setup Migration Replication Validation Registeration

  22. MPS Copy - Validation If the data migration is a logical one, we will use snapshot based checksum to verify the correctness of data by comparing to its current master Allocation Setup Migration Replication Validation Registeration

  23. MPS Copy - Service Registration Register the new instance in our service discovery system so that the MySQL users will be able to notice this new instance that has been recently turned up Allocation Setup Migration Replication Validation Registeration

  24. Online Shard Migration

  25. Online Shard Migration Another fundamental piece of our infra to control the growth of each MySQL instance - Instance can grow beyond the host level limit - Too big - Too hot

  26. Online Shard Migration (OLM) Key concept: Move the data of a shard into other smaller/cooler instance through logical migration and register the new address into the service discovery system

  27. OLM host host245.ftw host567.ftw host123.prn host345.prn 3306 3307 3306 3307 instance replicaset mysql.replicaset.1234 mysql.replicaset.5678 shard db.2 db.1 db.3

  28. OLM db.1 = 100G binlogs table1.csv table2.csv 100G db.1 db.2 = 300G db.3 = 50G table3.csv mysql.replicaset.1 mysql.replicaset.2 Shard Replicaset mysql.replicaset.2 db.1 mysql.replicaset.1 db.2 mysql.replicaset.1 db.3 mysql.replicaset.3

  29. OLM Processor Workflow management for massive OLM operations - Conflict solver - Picking the best destination replicaset - Kickoff the actual move - Proper retry and cleanup

  30. Balancing

  31. Balancing Find the right place for the workload in order to achieve maximum sustainable resource utilization

  32. Poor Stacking RS4: 200G RS2: 300G Host: 500G Host: 500G Host: 500G Host: 500G RS1: 400G RS3: 100G Total: 1000G

  33. Poor Stacking Free: 100G Host 2 Host 1 Host 3 Free: 500G Free: 500G Free: 500G RS1: 400G RS1: 400G

  34. Poor Stacking Free: 100G Free: 200G Host 2 Host 1 Host 3 Free: 500G Free: 500G Free: 500G RS1: 400G RS2: 300G RS2: 300G

  35. Poor Stacking Free: 100G Free: 100G Free: 200G RS3: 100G Host 2 Host 1 Host 3 Free: 500G Free: 500G Free: 500G RS1: 400G RS2: 300G RS3: 100G

  36. Poor Stacking Free: 100G Free: 100G Free: 200G RS3: 100G Free: 300G Host 2 Host 1 Host 3 Free: 500G Free: 500G Free: 500G RS1: 400G RS2: 300G RS4: 200G RS4: 200G

  37. Proper Stacking Free: 100G RS3: 100G Free: 200G Host 2 Host 1 Host 3 Free: 500G Free: 500G Free: 500G RS1: 400G RS2: 300G RS3: 100G

  38. Proper Stacking Free: 100G RS3: 100G RS4: 200G Free: 200G Host 2 Host 1 Host 3 Free: 500G Free: 500G Free: 500G RS1: 400G RS2: 300G RS4: 200G

  39. Carve the Shape Host: 500G Host: 500G Host: 500G RS1: 400G Host: 500G RS2: 200G RS3: 200G RS4: 200G Total: 1000G

  40. Poor Shape Free: 100G Free: 100G Free: 300G Free: 300G RS3: 200G Host 3 Free: 500G RS1: 400G RS2: 200G RS4: 200G X3

  41. Carve the Shape 100G Host: 500G 100G Host: 500G 100G Host: 500G RS1: 400G RS1: 400G RS2: 300G 100G 100G 100G Host: 500G 100G RS2: 200G RS3: 200G RS3: 200G RS4: 200G 100G 100G 100G RS4: 100G 100G Total: 1000G

  42. Rebalancer Goal: Find the best slot for hosting the given workload profile and reduce the imbalance score across the fleet to be minimum Maintenance Rebalancer Moves Replace Broken Host

  43. Rebalancer - Challenges Multiple balancing factors - CPU/Memory/Disk usage - Fault domain spreading - MySQL vs LBU anti affinity

  44. Testing Infrastructure

  45. Testing Infra for Automations ● Lots of Automation code handling critical components of Infra ● UnitTests are good but mock backend connection ● Need to test end to end

  46. Testing Infra Goals ● build and canary packages based on the change ● provide signals at diff time for developer ● production like setup, but isolated environment ● iterate quickly with confidence

  47. deallocate Run the Testing Infra deallocate & test code Service allocate/deallocate refresh metadata trigger test access test signal Virtual test assets Diff metadata prod snapshot developer

  48. Q&A

Recommend


More recommend