errant gtids breaking replication
play

Errant GTIDs breaking replication Dieter Adriaenssens Ghent - PowerPoint PPT Presentation

Errant GTIDs breaking replication Dieter Adriaenssens Ghent University Who am I? Dieter Adriaenssens Linux System Administrator MySQL DBA Works at Ghent University Open Source : former phpMyAdmin team member Lives in


  1. Errant GTIDs breaking replication Dieter Adriaenssens Ghent University

  2. Who am I? Dieter Adriaenssens • Linux System Administrator • MySQL DBA • Works at Ghent University • Open Source : former phpMyAdmin team member • Lives in Ghent, Belgium • Climber • E-mail : dieter.adriaenssens@ugent.be • Twitter : @dcadriaenssens • Pictures : (c) Thomas Kindermans https://en.wikipedia.org/wiki/File:Gent_Graslei_zonsondergang.JPG − (c) Rouslan Placella − 2

  3. Errant GTIDs breaking replication How to detect and avoid them

  4. Context Replicating MySQL cluster ‘[..] the master has purged binary • Primary with some replicas • Master failover logs containing GTIDs that the • Replication breaks from new primary slave requires.’ (previous replica) to remaining replicas • Where did those transactions on the new primary come from? 4

  5. Overview • Replication and GTID • Errant GTIDs on a replica • Detection • Where do they come from? • How to avoid them • How to fix them • Demo 5

  6. Introduction Replication, GTID, data consistency

  7. Replication MySQL server replication : Replic • Primary (master) a • Several replicas (slaves) • Identical state across several servers Primar y Replic a 7

  8. Replication : Why • High availability • Master failover • Disaster recovery • Scaling load • Regional distribution • ... 8

  9. Replication : How MySQL master-slave replication : • All nodes start from an identical state • Replicas apply transactions from the primary node’s binary logs • Transaction in correct order • Asynchronous : replica initiates • Maintain identical state Read/write on primary ● Replicas should be readonly ● 9

  10. Replication : How Keeping track of binlog position : • Several binlogfiles : mysql_bin.xxxxxx • Rotation (size based, restart server) • Every transaction has a unique (incremental) position • Replication start : Binary log file name ● position in the binlog ● • Problematic multiple primaries ● Master failover ● Restore from backup ● 10

  11. GTID Global Transaction identifier • Unique identifier of a transaction Server_uuid:transaction_id • GTID = Origin ID + transaction ID 3E11FA47-71CA-11E1-9E33-C80AA9429562:23 • Origin ID : server UUID • Transaction ID : incremental number SELECT @@GLOBAL.server_uuid; 3E11FA47-71CA-11E1-9E33-C80AA9429562 11

  12. GTID set Executed GTID set SHOW SLAVE STATUS; SHOW MASTER STATUS; • Set of applied GTIDs on a server SELECT @@GLOBAL.gtid_executed; • Defines a state of data consistency • Easy to compare between nodes Executed_Gtid_Set: • Replica applies GTIDs of a primary 27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113, • Simplifies failovers (no need to remember binlog position) 50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-80192973, c004c0eb-c84e-11e6-8efc-aa00009002fd: 1-6084195:6140951-6141015 12

  13. ROW based replication • Changes in rows are replicated • <-> transaction-based : actual queries • In combination with GTID => data consistency across the cluster nodes 13

  14. Replicating cluster with GTID orchestrator -c topology –alias=demo node1:3306 [0s,ok,5.7.25,rw,ROW,>>,GTID] MySQL Server cluster + node2:3306 [0s,ok,5.7.25,ro,ROW,>>,GTID] • Primary (R/W) + node3:3306 [0s,ok,5.7.25,ro,ROW,>>,GTID] • Replicas (R/O) Orchestrator • GTID enabled • Row based replication • ProxySQL to redirect traffic to correct cluster node Node 2 • Orchestrator managing the cluster nodes Prox Nod (automatic master failover) y e1 SQL Node 3 14

  15. Errant GTID Definition, consequences, detection, examples, how to avoid, fixes

  16. Errant GTID • Unreplicated transaction existing only on a replica • Data is not the same on all nodes • Cluster is no longer in a consistent state 16

  17. Errant GTID : consequences • Everything is fine? • Inconsistent state between nodes Split brain ● Different data when reading from ● that replica • Unexpected behaviour when a replica is ‘[..] the master has purged binary promoted to master logs containing GTIDs that the Replication might fail ● slave requires.’ If GTID is purged from binlog, on ● master failover → replication stops 17

  18. Errant GTID detection

  19. Errant GTID detection • Compare executed GTID sets between primary node and replica nodes • Replica has more GTIDs than primary => errant GTID 19

  20. GTID executed set # primary SELECT @@GLOBAL.gtid_executed; 27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-75521 20 # replica SELECT @@GLOBAL.gtid_executed; 27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113, 50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4 20

  21. GTID subset SELECT GTID_SUBSET('<gtid_executed_replica>', '<gtid_executed_primary>'); SELECT GTID_SUBSET( '27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113', '27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120') AS is_subset; +-----------+ | is_subset | +-----------+ | 1 | +-----------+ 1 row in set (0.00 sec) Replica GTID set is a subset of primary GTID set : OK 21

  22. GTID subset SELECT GTID_SUBSET('<gtid_executed_replica>', '<gtid_executed_primary>'); SELECT GTID_SUBSET( '27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113, 50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4 ', '27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120') AS is_subset; +-----------+ | is_subset | +-----------+ | 0 | +-----------+ 1 row in set (0.00 sec) Replica GTID set is NOT a subset of primary GTID set => Errant GTID on replica 22

  23. Find errant GTIDs SELECT GTID_SUBTRACT('<gtid_executed_replica>', '<gtid_executed_primary>'); SELECT GTID_SUBTRACT( '27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113, 50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4 ', '27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120') AS errant_gtid; +------------------------------------------+ | errant_gtid | +------------------------------------------+ | 50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4 | +------------------------------------------+ 1 row in set (0.00 sec) Result is subset of errant GTIDs 23

  24. Errant GTID : automatic detection • Icinga monitoring check • Orchestrator • Other tools : Mysqlfailover ● Mysqlrpladmin ● 24

  25. Errant GTID : detection Monitoring check • Automate checking for errant GTID • Icinga compatible output format • Uses orchestrator for cluster info • https://github.com/UGent-DICT/check_mysql_gtid 25

  26. Errant GTID : monitoring check ./check_mysql_gtid <clustername> ./check_mysql_gtid demo MYSQL_CLUSTER_GTID OK - GTIDs on all nodes are replicated in the cluster! Cluster demo (primary : node1) : - node2 : OK - node3 : OK Everything is fine! 26

  27. Errant GTID : monitoring check ./check_mysql_gtid <clustername> ./check_mysql_gtid demo MYSQL_CLUSTER_GTID WARNING : replicas containing unreplicated GTIDs : node2 Cluster demo (primary : node1) : - node2 : GTIDs only exist on the replica : 50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4 - node3 : OK Check reports unreplicated (errant) GTIDs on one of the nodes 27

  28. Errant GTID : detection Orchestrator • Reports errant GTIDs (>= v3.0.13) ● Provides ways to fix them (>= v3.0.14) 28

  29. Errant GTID Find transaction • Look for GTID in binary logs • Each binlog mentions the executed GTID set (initial state) • Select relevant binlog • Find transaction in that binlog 29

  30. Errant GTID : find transaction mysqlbinlog --base64-output=DECODE-ROWS --verbose node2-bin.00* | grep 50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1 -A6 SET @@SESSION.GTID_NEXT= '50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1'/*!*/; # at 23890384 #180305 6:25:02 server id 590100 end_log_pos 23890464 CRC32 0x2fe8266b Query thread_id=3848059 exec_time=0 error_code=0 SET TIMESTAMP=1520227502/*!*/; SET @@session.sql_mode=1073741824/*!*/; FLUSH ERROR LOGS /*!*/; 30

  31. Errant GTID : Examples Transactions on a replica • Manual transactions (by accident on a replica) • Scripted maintenance tasks (eg. config management) User creation ● Database creation ● • Master failover gone bad : Split brain ● Writes redirected to a replica (eg. host is r/w by ● accident, or after a restart) • Log flushes 31

  32. Intermezzo : flush-logs • FLUSH LOGS; → not added to binlog (1) • FLUSH ERROR LOGS; → added to binlog (2) • FLUSH [LOCAL | NO_WRITE_TO_BINLOG] ERROR LOGS; → not added to binlog • mysqladmin flush-logs executes FLUSH LOGS; • mysqladmin flush-logs error (3) executes FLUSH ERROR LOGS; (1) expected behaviour, according to documentation : “FLUSH LOGS, FLUSH BINARY LOGS, FLUSH ● TABLES WITH READ LOCK (with or without a table list), and FLUSH TABLES tbl_name ... FOR EXPORT are not written to the binary log in any case because they would cause problems if replicated to a slave.“ https://dev.mysql.com/doc/refman/8.0/en/flush.html • (2) Related bug report : https://bugs.mysql.com/bug.php?id=88720 • (3) introduced in MySQL 5.7.4 32

  33. Avoid errant GTIDs

  34. Avoid errant GTIDs Avoid transactions on replica nodes: • Don’t execute (write) statements on a replica 34

Recommend


More recommend