one system to fit them all
play

One System To Fit Them All: Shared MySQL Hosting At Facebook - PowerPoint PPT Presentation

One System To Fit Them All: Shared MySQL Hosting At Facebook Andrew Regner Production Engineer | MySQL Infrastructure Data choices @Facebook Everyone has data to persist Also have: ZippyDB, ODS, Scuba, HBase, Scribe, RocksDB,


  1. One System To Fit Them All: 
 Shared MySQL Hosting At Facebook Andrew Regner Production Engineer | MySQL Infrastructure

  2. Data choices @Facebook • Everyone has data to persist • Also have: • ZippyDB, ODS, Scuba, HBase, Scribe, RocksDB, TAO • MySQL is the most mature

  3. XDB "Anything" Database • Larger and/or order use cases of MySQL will have their own *db • XDB is supposed to be the answer for everyone else

  4. Our History c. 2004 • Started with "CDB" allocating resources manually and with little logic behind it. • In the last few years, we've grown the MySQL Teams by a few engineers, but the company has grown > 10x.

  5. Things that store data video encoding (queue) • Hundreds of Teams data warehouse (metadata) • job scheduling • server management • internal tools (tasks, wiki) • Thousands of Shards hack-a-thon toys • visitor sign-in system • backup systems (more than i can count) • qualitative analysis of search results • machine learning models •

  6. Terminology

  7. server instance instance shard shard shard shard shard shard shard shard shard shard shard shard shard shard

  8. Master replication replication Slave Slave

  9. tasks Replica Set burger pong_1 feed videos tags wiki replication replication tasks tasks burger burger pong_1 pong_1 feed feed videos videos tags tags wiki wiki

  10. replication replication

  11. Philosophy of FB Infrastructure Move Fast • Enable engineers to do what they need, when they need it. • They understand the importance and scope of their product the best.

  12. Philosophy of FB MySQL Infra Build Stable Infrastructure • K.I.S.S. works at scale, too • No: foreign keys, views, events, triggers, procedures, replication lag • Yes: good indexes, sharding, planning

  13. Story Time

  14. Robert's New Feature xdb.profile_events

  15. Robert's New Feature xdb.profile_events

  16. Robert's New Feature xdb.profile_events Lands a configuration change to cause all records in his database to be re-processed. Everything is still in test mode, so not worried about the impact to anything.

  17. Connected Running Lock time xdb.profile_events (master)

  18. Robert's New Feature xdb.profile_events His system limits concurrency to 30 at once. Didn't realize that the script opens a new connection for each of 100 queries per execution. :-(

  19. Robert's New Feature xdb.profile_events Query comments to know where the jobs is Use internal UI's to kill the job, than page Robert

  20. Remember All The Tickets xdb.ticket_processing

  21. Remembering All The Tickets xdb.ticket_processing Every time something changes with a Help Center support ticket, some metadata is created. Have to hold onto parts of it for legal reasons.

  22. alarm 
 1 day ago xdb.ticket_processing

  23. Remembering All The Tickets xdb.ticket_processing • Guidance on maximum ideal shard size • Use cases vary too much to enforce

  24. Remembering All The Tickets Forgetting some things Tool that an intern / now full-time employee wrote to delete large amounts of data based arbitrary SQL in chunks with range queries. OSC on a larger host to reclaim disk space, than replace all the instances.

  25. Remembering All The Tickets What we learned • Existing automation is very good at hiding this • Put shard sizes in front of the user ASAP • Proactively reach out to owners of the top 5% by shard size • Automatically notify owners when their growth looks "dangerous"

  26. Cleaning up some data xdb.analytics

  27. Cleaning up some data xdb.analytics Engineer somewhere in the world wants to clean up some stale records • DELETE FROM table_one WHERE table_one.id IN (SELECT foreign_id FROM table_two WHERE some_random_thing = 'foobar')

  28. xdb.analytics

  29. Cleaning up some data xdb.analytics Only responses after it happens are to not do it again Replace instances if we need to If we caught it earlier, we can kill the query on master

  30. The nicest bad neighbor ever xdb.scheduler / xdb.looks_like

  31. The nicest bad neighbor ever xdb.scheduler / xdb.looks_like Shared pool of general purpose XDB shards A bunch of little shards on a few replica sets for a queue workload A large shard on the same replica set for archival

  32. Running Threads History List Length xdb.scheduler / xdb.looks_like

  33. Current Tools

  34. dba replblame finding the cause of lag =============================================================== Instance: xdb0123.prn1:3307 Report UUID: 91a5d53e-2fcf-49e8-8546-26df7fda31d1 Time started: 2016-09-28 08:56:11 Length: 30s Reason: dbstatus disabled instance for lag =============================================================== Total sampled queries: 2753 myservice_data (2750): 2019 LOAD DATA INFILE ? REPLACE INTO TABLE `all_the_data` FIELDS TERMINATED BY ? ENCLOSED BY '?\\?\n? 420 LOAD DATA INFILE ? IGNORE INTO TABLE `some_more_data` FIELDS TERMINATED BY ? ENCLOSED BY '?\\?\n? 127 UPDATE `sig_tw_jobs` t SET t.status = ? WHERE t.shard_id = ? AND t.handle = ? AND t.status = ? 10 UPDATE `sig_model_snapshot` s, `sig_model` m SET s.removed = ?, m.active_snapshot_id = ? WHERE s.model_snapshot_id = ? AND m.model_...70 more bytes 2 UPDATE `sigrid_model_snapshot` SET model_output = ? WHERE model_snapshot_id = ?

  35. xdb task tell someone there's a problem $ xdb task xdb.fb_learning_mysql --template size assigned_to=1369320034 tags=[u'dba', u'xdb', u'oncall', u'xdb_enforcement', u'disk_space'] title=XDB xdb.learning_mysql exceeding allowed disk space desc=An xdb that you own ( xdb.learning_mysql ) has exceeded disk space limits. Please cleanup some data immediately (see https://wiki.fb.com/out_of_space ). Instance sizes: https://ods.fb.com/455023229 Table sizes: https://ods.fb.com/455023234 Table sizes (information_schema): Schema Table Size(GB) learning_mysql channels 607.990 learning_mysql workflow_runs 157.080 learning_mysql operator_plans 133.090 learning_mysql retention 60.550 learning_mysql operator_runs 31.510 learning_mysql job_instance_status 28.230 learning_mysql job_instance_status_updates 24.980 learning_mysql operator_run_outputs 19.290

  36. DB Portal

  37. xdb.myservice_data DB Portal

  38. Looking Forward

  39. Some random thoughts from the roadmap Automatic tasks / pages for shard owners • Share our monitoring subscriptions Automatic detection (& killing) of bad queries Stricter enforcement of quotas / capacity

  40. One System To Fit Them All: 
 Shared MySQL Hosting At Facebook Andrew Regner Production Engineer | MySQL Infrastructure | aregner@fb.com

Recommend


More recommend