One System To Fit Them All: Shared MySQL Hosting At Facebook Andrew Regner Production Engineer | MySQL Infrastructure
Data choices @Facebook • Everyone has data to persist • Also have: • ZippyDB, ODS, Scuba, HBase, Scribe, RocksDB, TAO • MySQL is the most mature
XDB "Anything" Database • Larger and/or order use cases of MySQL will have their own *db • XDB is supposed to be the answer for everyone else
Our History c. 2004 • Started with "CDB" allocating resources manually and with little logic behind it. • In the last few years, we've grown the MySQL Teams by a few engineers, but the company has grown > 10x.
Things that store data video encoding (queue) • Hundreds of Teams data warehouse (metadata) • job scheduling • server management • internal tools (tasks, wiki) • Thousands of Shards hack-a-thon toys • visitor sign-in system • backup systems (more than i can count) • qualitative analysis of search results • machine learning models •
Terminology
server instance instance shard shard shard shard shard shard shard shard shard shard shard shard shard shard
Master replication replication Slave Slave
tasks Replica Set burger pong_1 feed videos tags wiki replication replication tasks tasks burger burger pong_1 pong_1 feed feed videos videos tags tags wiki wiki
replication replication
Philosophy of FB Infrastructure Move Fast • Enable engineers to do what they need, when they need it. • They understand the importance and scope of their product the best.
Philosophy of FB MySQL Infra Build Stable Infrastructure • K.I.S.S. works at scale, too • No: foreign keys, views, events, triggers, procedures, replication lag • Yes: good indexes, sharding, planning
Story Time
Robert's New Feature xdb.profile_events
Robert's New Feature xdb.profile_events
Robert's New Feature xdb.profile_events Lands a configuration change to cause all records in his database to be re-processed. Everything is still in test mode, so not worried about the impact to anything.
Connected Running Lock time xdb.profile_events (master)
Robert's New Feature xdb.profile_events His system limits concurrency to 30 at once. Didn't realize that the script opens a new connection for each of 100 queries per execution. :-(
Robert's New Feature xdb.profile_events Query comments to know where the jobs is Use internal UI's to kill the job, than page Robert
Remember All The Tickets xdb.ticket_processing
Remembering All The Tickets xdb.ticket_processing Every time something changes with a Help Center support ticket, some metadata is created. Have to hold onto parts of it for legal reasons.
alarm 1 day ago xdb.ticket_processing
Remembering All The Tickets xdb.ticket_processing • Guidance on maximum ideal shard size • Use cases vary too much to enforce
Remembering All The Tickets Forgetting some things Tool that an intern / now full-time employee wrote to delete large amounts of data based arbitrary SQL in chunks with range queries. OSC on a larger host to reclaim disk space, than replace all the instances.
Remembering All The Tickets What we learned • Existing automation is very good at hiding this • Put shard sizes in front of the user ASAP • Proactively reach out to owners of the top 5% by shard size • Automatically notify owners when their growth looks "dangerous"
Cleaning up some data xdb.analytics
Cleaning up some data xdb.analytics Engineer somewhere in the world wants to clean up some stale records • DELETE FROM table_one WHERE table_one.id IN (SELECT foreign_id FROM table_two WHERE some_random_thing = 'foobar')
xdb.analytics
Cleaning up some data xdb.analytics Only responses after it happens are to not do it again Replace instances if we need to If we caught it earlier, we can kill the query on master
The nicest bad neighbor ever xdb.scheduler / xdb.looks_like
The nicest bad neighbor ever xdb.scheduler / xdb.looks_like Shared pool of general purpose XDB shards A bunch of little shards on a few replica sets for a queue workload A large shard on the same replica set for archival
Running Threads History List Length xdb.scheduler / xdb.looks_like
Current Tools
dba replblame finding the cause of lag =============================================================== Instance: xdb0123.prn1:3307 Report UUID: 91a5d53e-2fcf-49e8-8546-26df7fda31d1 Time started: 2016-09-28 08:56:11 Length: 30s Reason: dbstatus disabled instance for lag =============================================================== Total sampled queries: 2753 myservice_data (2750): 2019 LOAD DATA INFILE ? REPLACE INTO TABLE `all_the_data` FIELDS TERMINATED BY ? ENCLOSED BY '?\\?\n? 420 LOAD DATA INFILE ? IGNORE INTO TABLE `some_more_data` FIELDS TERMINATED BY ? ENCLOSED BY '?\\?\n? 127 UPDATE `sig_tw_jobs` t SET t.status = ? WHERE t.shard_id = ? AND t.handle = ? AND t.status = ? 10 UPDATE `sig_model_snapshot` s, `sig_model` m SET s.removed = ?, m.active_snapshot_id = ? WHERE s.model_snapshot_id = ? AND m.model_...70 more bytes 2 UPDATE `sigrid_model_snapshot` SET model_output = ? WHERE model_snapshot_id = ?
xdb task tell someone there's a problem $ xdb task xdb.fb_learning_mysql --template size assigned_to=1369320034 tags=[u'dba', u'xdb', u'oncall', u'xdb_enforcement', u'disk_space'] title=XDB xdb.learning_mysql exceeding allowed disk space desc=An xdb that you own ( xdb.learning_mysql ) has exceeded disk space limits. Please cleanup some data immediately (see https://wiki.fb.com/out_of_space ). Instance sizes: https://ods.fb.com/455023229 Table sizes: https://ods.fb.com/455023234 Table sizes (information_schema): Schema Table Size(GB) learning_mysql channels 607.990 learning_mysql workflow_runs 157.080 learning_mysql operator_plans 133.090 learning_mysql retention 60.550 learning_mysql operator_runs 31.510 learning_mysql job_instance_status 28.230 learning_mysql job_instance_status_updates 24.980 learning_mysql operator_run_outputs 19.290
DB Portal
xdb.myservice_data DB Portal
Looking Forward
Some random thoughts from the roadmap Automatic tasks / pages for shard owners • Share our monitoring subscriptions Automatic detection (& killing) of bad queries Stricter enforcement of quotas / capacity
One System To Fit Them All: Shared MySQL Hosting At Facebook Andrew Regner Production Engineer | MySQL Infrastructure | aregner@fb.com
Recommend
More recommend