Understanding Percona XtraDB Cluster 5.7 Operation and Key Algorithms Krunal Bauskar PXC Product Lead (Percona Inc.)
Objective ● “I want to use Percona XtraDB Cluster but is it suitable for my needs and can it handle my workloads?” ● “As an active user of Percona XtraDB Cluster I wonder why my transaction keeps failing and will this workload run correctly with the software?”
Agenda ● PXC technology ● Understanding State Snapshot Transfer (SST) ● How replication works ? ● How certification works ? ● Different types of failures ● Understanding IST (Incremental State Transfer) ● Common cause of failures with PXC ● What's new in PXC-5.7 ● Introducing pxc_strict_mode ● Monitoring PXC through PFS ● Securing PXC cluster ● Geo-distributed Cluster Setup ● Proxy-SQL compatible
PXC technology ● PXC is multi-master solution that provides High Availability using MySQL and Galera replicator technology based on synchronous replication theme. ● PXC node can operate in 2 modes: Standalone that is Percona-Server compatible or with Galera replicator loaded as PXC node but PXC is different from Percona Server in terms of binary. Layer-1: Percona Server with write-set replication plugin. Layer-2: Galera Replicator Layer-3: gcomm communication channel.
Understanding SST ● Concept: ○ Initial/First PXC node is started in cluster bootstrap mode (wsrep_cluster_address=gcomm:// or --wsrep_new_cluster). This node creates a new cluster. ○ Follow-up nodes connect to the existing cluster and sync with cluster state before they start processing workload. ○ This gives rise to a DONOR and JOINER relationship in PXC. DONOR donates the data (also known as write-sets). JOINER receives the data. This process is called SST where-in almost complete data-dir is copied over from DONOR. ● PXC support 3 different ways of doing SST: ○ rsync ○ mysqldump ○ xtrabackup (RECOMMENDED and DEFAULT)
Understanding SST ● How does PXC detect and handle this ? ○ Each node of cluster locally maintains a graph/picture of a cluster. ○ When new node joins the cluster, node is made part of cluster and graph is updated. This is important so that newly generated write-sets can now be delivered to new node too. ○ New node then detects that its local state (-1) is behind the cluster state (say N). ○ Node then searches for DONOR. ○ JOINER then sends a request to DONOR for SST (that is complete state transfer). ○ DONOR node enter DONOR/DESYNC state and start servicing the request. While DONOR node is servicing write-set to JOINER it will continue to get cluster write-sets too. ○ Once JOINER get the needed snapshot, JOINER can fill out for pending write-sets from its own gcache to get in sync with CLUSTER state. ○ New node is now ready to service workload. ● Why Xtrabackup ? ○ SST is time consuming operation especially when you have TBs of data. ○ XB is optimized to use BACKUP LOCKS where-in DONOR node is not paused during data-transmission. ○ It also has an option to secure PXC-SST traffic.
Understanding SST (DONOR selection) ● SST DONOR Selection ○ Search by name (if specified by user --wsrep_sst_donor). ■ If node state restrict it from acting as DONOR things get delayed. Member 0.0 (n3) requested state transfer from 'n2', but it is impossible to select State Transfer donor: Resource temporarily unavailable ○ Search donor by state: ■ Scan all nodes. ■ Check if a node can act as DONOR (avoid DESYNCEd node, arbitrator). ● YES: Is node part of same segment (like JOINER) -> Yes -> Got DONOR ● NO: Keep searching but cache this remote segment too. Worse case will use this.
How replication works ? ● Short answer ○ Using binary log events. (Same unit that MySQL uses for Master-Slave replication). ● Transaction execution steps: ○ User initiate a transaction on said node. ○ Node processes the transaction and keeps track of each data-object that is being modified. ○ On commit, write-set is generated with binlog events + certification data (what is modified). ○ This write-set is then replicated on group-channel. ○ All nodes (including originating node listen to group channel). ○ If node = originating node it has to attest receival of write-set and update counters. ○ If node != originating node it has to consume the write-set, certify it, apply it and then commit it. ○ After originating node submit the packet on the channel originating node will certify the packet and commit the transaction.
How replication works ? Node-1 Node-2 Node-3 Trx Process & apply Write-set Group channel certify certify certify ACK. commit apply apply Write-set. Update counter commit commit
How replication works ? (TOI replication) ● TOI stants for Total Order Isolation . ● All DDL or MyISAM replication execute in TOI fashion. Any statement that is not allowed to fail due to conflict are TOI executed. (For example: DDL or MyISAM as MyISAM is non-transactional). ● Let’s understand TOI flow ○ User initiate TOI statement (say DDL). ○ Query is wrapper as packet to replicate and along with certification key (which in this case is db and table name) is added to channel. ○ Packet is serviced using a special TOI flow path that locks complete operation by holding Apply and Commit Monitor for period of TOI execution. This ensure no other transaction is allowed to proceed and existing transactions are done with their work. ● If there is no parallel transaction then why do we need certification key. ○ Say N1 executed drop table <table> that is replicated on N2 and N2 is trying to insert in same table. ○ Without certification N2 will replicate, certify and proceed to commit. In meantime, table is attempted for removal but removal path discover that table is in use which generate conflicts then.
How replication works ? (Parallel Replication) ● PXC can apply write-sets in parallel thereby improving overall throughput. ● --wsrep_slave_threads helps control this option. ● Let’s quickly understand how parallel replication works. ○ 3 nodes cluster: N1, N2, N3 all are healthy and processing write-sets. ○ Say N1, N2 generated non-conflicting write-sets. N2 goes first followed by N1. ○ Let’s take a view from N3 perspective. N3 has set wsrep_slave_thread = 2 so it can apply both write-sets in parallel. (Note: This is just like MySQL executing 2 workloads from 2 different clients). ○ While apply can proceed in parallel commits are ordered. That is N2 has to commit first even if N1 replication thread is allowed to proceed. (These co-ordinating units are called monitor (Apply, Commit). There is also local monitor meant for local node action co-ordination). ○ Also note that, given that apply action is only about execution of write-set this is pretty quick and less resource hogging unless query demands it. (As in parsing, filtering,etc all has been taken care on host machine that initiated the initial transaction so in general write-set are optimized unit for application).
How certification works ? ● Let’s now understand most crucial part of the replication “How certification works”. ● Basic principles: ○ ORIGINATOR NODE also certify its own transaction. ○ FIRST COMMITTER TO GROUP CHANNEL WINS. T1: (i int, primary key pk(i)) (1, 2, 3) N1 N2 N3 T2: (i int, primary key pk(i)); (11, 22, 33) N1: update t1 set i = i + 10; N2: update t1 set i = i + 100; N3: update t2 set i = i + 10; N2-wset N3-wset N1-wset N1-wset: {db.t1.r1, db.t1.r2, db.t1.r3} N2-wset: {db.t1.r1, db.t1.r2, db.t1.r3} N3-wset: {db.t2.r1, db.t2.r2, db.t2.r3}
How certification works ? (N1) N1 N2-wset N1 CERTIFY N1 N1 N2-writeset certified. db.t1.r1 -> N2 CCV CCV db.t1.r2 -> N2 db.t1.r3 -> N2 CERTIFY N1 N1-wset N1 N1 N1 N1 N1 N1-writeset rejected Conflicts: db.t1.r1 -> N2 db.t1.r1 -> N2 db.t1.r1 (N2 != N1) CCV CCV db.t1.r2 -> N2 db.t1.r2 -> N2 db.t1.r2 (N2 != N1) db.t1.r3 -> N2 db.t1.r3 -> N2 db.t1.r3 (N2 != N1) N1 CERTIFY N3-wset N1 N1 N1 N1 N1 N3-writeset certified. db.t1.r1 -> N2 db.t1.r1 -> N2 db.t2.r1 -> N3 CCV CCV db.t1.r2 -> N2 db.t1.r2 -> N2 db.t2.r2 -> N3 db.t1.r3 -> N2 db.t1.r3 -> N2 db.t2.r3 -> N3
How certification works ? (N2) N1 N2-wset N1 CERTIFY N2 N2 N2-writeset certified. db.t1.r1 -> N2 CCV CCV db.t1.r2 -> N2 db.t1.r3 -> N2 CERTIFY N1 N1-wset N2 N1 N2 N1 N1 N1-writeset rejected Conflicts: db.t1.r1 -> N2 db.t1.r1 -> N2 db.t1.r1 (N2 != N1) CCV CCV db.t1.r2 -> N2 db.t1.r2 -> N2 db.t1.r2 (N2 != N1) db.t1.r3 -> N2 db.t1.r3 -> N2 db.t1.r3 (N2 != N1) N1 CERTIFY N3-wset N1 N2 N2 N1 N1 N3-writeset certified. db.t1.r1 -> N2 db.t1.r1 -> N2 db.t2.r1 -> N3 CCV CCV db.t1.r2 -> N2 db.t1.r2 -> N2 db.t2.r2 -> N3 db.t1.r3 -> N2 db.t1.r3 -> N2 db.t2.r3 -> N3
Recommend
More recommend