untangling and restructuring ctdb
play

Untangling and Restructuring CTDB Martin Schwenke < - PowerPoint PPT Presentation

Untangling and Restructuring CTDB Martin Schwenke < martin@meltin.net > Samba Team IBM (Australia Development Laboratory, Linux Technology Center) Martin Schwenke Untangling and Restructuring CTDB What are we talking about? Martin


  1. Twelve months of untangling NAT gateway Had daemon support: NAT gateway master capability Martin Schwenke Untangling and Restructuring CTDB

  2. Twelve months of untangling NAT gateway Had daemon support: NAT gateway master capability “ ctdb natgwlist ” calculated NAT gateway master node Martin Schwenke Untangling and Restructuring CTDB

  3. Twelve months of untangling NAT gateway Had daemon support: NAT gateway master capability “ ctdb natgwlist ” calculated NAT gateway master node Capability unset on a node indicated “slave-only” Martin Schwenke Untangling and Restructuring CTDB

  4. Twelve months of untangling NAT gateway Had daemon support: NAT gateway master capability “ ctdb natgwlist ” calculated NAT gateway master node Capability unset on a node indicated “slave-only” Observed that NAT gateway nodes file lines could be augmented with “ slave-only ” keyword Martin Schwenke Untangling and Restructuring CTDB

  5. Twelve months of untangling NAT gateway Had daemon support: NAT gateway master capability “ ctdb natgwlist ” calculated NAT gateway master node Capability unset on a node indicated “slave-only” Observed that NAT gateway nodes file lines could be augmented with “ slave-only ” keyword No capability needed, so no daemon support! Martin Schwenke Untangling and Restructuring CTDB

  6. Twelve months of untangling NAT gateway Had daemon support: NAT gateway master capability “ ctdb natgwlist ” calculated NAT gateway master node Capability unset on a node indicated “slave-only” Observed that NAT gateway nodes file lines could be augmented with “ slave-only ” keyword No capability needed, so no daemon support! New helper script: “ ctdb_natgw master|list|status ” Martin Schwenke Untangling and Restructuring CTDB

  7. Twelve months of untangling NAT gateway Had daemon support: NAT gateway master capability “ ctdb natgwlist ” calculated NAT gateway master node Capability unset on a node indicated “slave-only” Observed that NAT gateway nodes file lines could be augmented with “ slave-only ” keyword No capability needed, so no daemon support! New helper script: “ ctdb_natgw master|list|status ” “ ctdb natgw master|list|status ” runs helper Martin Schwenke Untangling and Restructuring CTDB

  8. Twelve months of untangling NAT gateway Had daemon support: NAT gateway master capability “ ctdb natgwlist ” calculated NAT gateway master node Capability unset on a node indicated “slave-only” Observed that NAT gateway nodes file lines could be augmented with “ slave-only ” keyword No capability needed, so no daemon support! New helper script: “ ctdb_natgw master|list|status ” “ ctdb natgw master|list|status ” runs helper NAT gateway event script also calls out to helper Martin Schwenke Untangling and Restructuring CTDB

  9. Twelve months of untangling NAT gateway Had daemon support: NAT gateway master capability “ ctdb natgwlist ” calculated NAT gateway master node Capability unset on a node indicated “slave-only” Observed that NAT gateway nodes file lines could be augmented with “ slave-only ” keyword No capability needed, so no daemon support! New helper script: “ ctdb_natgw master|list|status ” “ ctdb natgw master|list|status ” runs helper NAT gateway event script also calls out to helper NAT gateway support now reduced to 2 non-core scripts Martin Schwenke Untangling and Restructuring CTDB

  10. Twelve months of untangling LVS Had daemon support: LVS capability, single public IP Martin Schwenke Untangling and Restructuring CTDB

  11. Twelve months of untangling LVS Had daemon support: LVS capability, single public IP “ ctdb lvsmaster ” calculated LVS master node Martin Schwenke Untangling and Restructuring CTDB

  12. Twelve months of untangling LVS Had daemon support: LVS capability, single public IP “ ctdb lvsmaster ” calculated LVS master node Re-implemented using same model as NAT gateway Martin Schwenke Untangling and Restructuring CTDB

  13. Twelve months of untangling LVS Had daemon support: LVS capability, single public IP “ ctdb lvsmaster ” calculated LVS master node Re-implemented using same model as NAT gateway New helper script: “ ctdb_lvs master|list|status ” Martin Schwenke Untangling and Restructuring CTDB

  14. Twelve months of untangling LVS Had daemon support: LVS capability, single public IP “ ctdb lvsmaster ” calculated LVS master node Re-implemented using same model as NAT gateway New helper script: “ ctdb_lvs master|list|status ” LVS support reduced to 2 non-core scripts Martin Schwenke Untangling and Restructuring CTDB

  15. Twelve months of untangling LVS Had daemon support: LVS capability, single public IP “ ctdb lvsmaster ” calculated LVS master node Re-implemented using same model as NAT gateway New helper script: “ ctdb_lvs master|list|status ” LVS support reduced to 2 non-core scripts Simplified IP takeover code due to absence of single public IP Martin Schwenke Untangling and Restructuring CTDB

  16. Twelve months of untangling TCP connection killing Was combination of “ ctdb killtcp ” and daemon support Martin Schwenke Untangling and Restructuring CTDB

  17. Twelve months of untangling TCP connection killing Was combination of “ ctdb killtcp ” and daemon support Daemon side validated server-side IP address Martin Schwenke Untangling and Restructuring CTDB

  18. Twelve months of untangling TCP connection killing Was combination of “ ctdb killtcp ” and daemon support Daemon side validated server-side IP address Daemon also sent “tickle ACKS”, listened for responses and sent RSTs Martin Schwenke Untangling and Restructuring CTDB

  19. Twelve months of untangling TCP connection killing Was combination of “ ctdb killtcp ” and daemon support Daemon side validated server-side IP address Daemon also sent “tickle ACKS”, listened for responses and sent RSTs No need to validate server-side IP address Martin Schwenke Untangling and Restructuring CTDB

  20. Twelve months of untangling TCP connection killing Was combination of “ ctdb killtcp ” and daemon support Daemon side validated server-side IP address Daemon also sent “tickle ACKS”, listened for responses and sent RSTs No need to validate server-side IP address New helper ctdb_killtcp reads connections from stdin Martin Schwenke Untangling and Restructuring CTDB

  21. Twelve months of untangling TCP connection killing Was combination of “ ctdb killtcp ” and daemon support Daemon side validated server-side IP address Daemon also sent “tickle ACKS”, listened for responses and sent RSTs No need to validate server-side IP address New helper ctdb_killtcp reads connections from stdin Much faster than talking to daemon Martin Schwenke Untangling and Restructuring CTDB

  22. Twelve months of untangling TCP connection killing Was combination of “ ctdb killtcp ” and daemon support Daemon side validated server-side IP address Daemon also sent “tickle ACKS”, listened for responses and sent RSTs No need to validate server-side IP address New helper ctdb_killtcp reads connections from stdin Much faster than talking to daemon SOCK_PACKET drops packets. . . Martin Schwenke Untangling and Restructuring CTDB

  23. Twelve months of untangling TCP connection killing Was combination of “ ctdb killtcp ” and daemon support Daemon side validated server-side IP address Daemon also sent “tickle ACKS”, listened for responses and sent RSTs No need to validate server-side IP address New helper ctdb_killtcp reads connections from stdin Much faster than talking to daemon SOCK_PACKET drops packets. . . Bidirectional killing, packets got mixed up! Martin Schwenke Untangling and Restructuring CTDB

  24. Twelve months of untangling TCP connection killing Was combination of “ ctdb killtcp ” and daemon support Daemon side validated server-side IP address Daemon also sent “tickle ACKS”, listened for responses and sent RSTs No need to validate server-side IP address New helper ctdb_killtcp reads connections from stdin Much faster than talking to daemon SOCK_PACKET drops packets. . . Bidirectional killing, packets got mixed up! Some internal filtering and tuning needed Martin Schwenke Untangling and Restructuring CTDB

  25. Twelve months of untangling TCP connection killing Was combination of “ ctdb killtcp ” and daemon support Daemon side validated server-side IP address Daemon also sent “tickle ACKS”, listened for responses and sent RSTs No need to validate server-side IP address New helper ctdb_killtcp reads connections from stdin Much faster than talking to daemon SOCK_PACKET drops packets. . . Bidirectional killing, packets got mixed up! Some internal filtering and tuning needed Helper invoked directly from event script Martin Schwenke Untangling and Restructuring CTDB

  26. Twelve months of untangling Recovery lock fcntl(2) lock on cluster filesystem Martin Schwenke Untangling and Restructuring CTDB

  27. Twelve months of untangling Recovery lock fcntl(2) lock on cluster filesystem Lock is taken on first recovery. . . Martin Schwenke Untangling and Restructuring CTDB

  28. Twelve months of untangling Recovery lock fcntl(2) lock on cluster filesystem Lock is taken on first recovery. . . . . . and released on election loss Martin Schwenke Untangling and Restructuring CTDB

  29. Twelve months of untangling Recovery lock fcntl(2) lock on cluster filesystem Lock is taken on first recovery. . . . . . and released on election loss Combination of “cluster master lock” and “recovery lock” Martin Schwenke Untangling and Restructuring CTDB

  30. Twelve months of untangling Recovery lock fcntl(2) lock on cluster filesystem Lock is taken on first recovery. . . . . . and released on election loss Combination of “cluster master lock” and “recovery lock” Want to split this. . . Martin Schwenke Untangling and Restructuring CTDB

  31. Twelve months of untangling Recovery lock fcntl(2) lock on cluster filesystem Lock is taken on first recovery. . . . . . and released on election loss Combination of “cluster master lock” and “recovery lock” Want to split this. . . . . . and allow other forms of cluster mutex than fcntl(2) lock Martin Schwenke Untangling and Restructuring CTDB

  32. Twelve months of untangling Recovery lock fcntl(2) lock on cluster filesystem Lock is taken on first recovery. . . . . . and released on election loss Combination of “cluster master lock” and “recovery lock” Want to split this. . . . . . and allow other forms of cluster mutex than fcntl(2) lock New helper ctdb_mutex_fcntl_helper Martin Schwenke Untangling and Restructuring CTDB

  33. Twelve months of untangling Recovery lock fcntl(2) lock on cluster filesystem Lock is taken on first recovery. . . . . . and released on election loss Combination of “cluster master lock” and “recovery lock” Want to split this. . . . . . and allow other forms of cluster mutex than fcntl(2) lock New helper ctdb_mutex_fcntl_helper Or: CTDB_RECOVERY_LOCK=\ "!/my/cluster/mutex/helper args ..." Martin Schwenke Untangling and Restructuring CTDB

  34. Twelve months of untangling Recovery lock fcntl(2) lock on cluster filesystem Lock is taken on first recovery. . . . . . and released on election loss Combination of “cluster master lock” and “recovery lock” Want to split this. . . . . . and allow other forms of cluster mutex than fcntl(2) lock New helper ctdb_mutex_fcntl_helper Or: CTDB_RECOVERY_LOCK=\ "!/my/cluster/mutex/helper args ..." Recovery lock not split yet Martin Schwenke Untangling and Restructuring CTDB

  35. Twelve months of untangling Monitoring in recovery daemon Recovery daemon runs main_loop at 1 second intervals Martin Schwenke Untangling and Restructuring CTDB

  36. Twelve months of untangling Monitoring in recovery daemon Recovery daemon runs main_loop at 1 second intervals Cluster leadership/elections, nodes states/flags, database recovery, IP failover & monitoring are all intertwined Martin Schwenke Untangling and Restructuring CTDB

  37. Twelve months of untangling Monitoring in recovery daemon Recovery daemon runs main_loop at 1 second intervals Cluster leadership/elections, nodes states/flags, database recovery, IP failover & monitoring are all intertwined Continuously revisit and improve. . . Martin Schwenke Untangling and Restructuring CTDB

  38. The pattern? Martin Schwenke Untangling and Restructuring CTDB

  39. The pattern? Helpers! Martin Schwenke Untangling and Restructuring CTDB

  40. The pattern? Helpers! Helpers! Martin Schwenke Untangling and Restructuring CTDB

  41. The pattern? Helpers! Helpers! Helpers! Martin Schwenke Untangling and Restructuring CTDB

  42. The pattern? Helpers! Helpers! Helpers! Call-outs! Martin Schwenke Untangling and Restructuring CTDB

  43. The pattern? Helpers! Helpers! Helpers! Call-outs! Helpers! Martin Schwenke Untangling and Restructuring CTDB

  44. The pattern? Helpers for incremental re-write Martin Schwenke Untangling and Restructuring CTDB

  45. The pattern? Helpers for incremental re-write Helpers can be used for writing shiny new code. . . Martin Schwenke Untangling and Restructuring CTDB

  46. The pattern? Helpers for incremental re-write Helpers can be used for writing shiny new code. . . . . . to replace self-contained parts of the code Martin Schwenke Untangling and Restructuring CTDB

  47. The pattern? Helpers for incremental re-write Helpers can be used for writing shiny new code. . . . . . to replace self-contained parts of the code Works well for infrequently executed code Martin Schwenke Untangling and Restructuring CTDB

  48. The pattern? Helpers for incremental re-write Helpers can be used for writing shiny new code. . . . . . to replace self-contained parts of the code Works well for infrequently executed code Most of the code we want to move out is (relatively) infrequently executed. . . Martin Schwenke Untangling and Restructuring CTDB

  49. The pattern? Helpers for incremental re-write Helpers can be used for writing shiny new code. . . . . . to replace self-contained parts of the code Works well for infrequently executed code Most of the code we want to move out is (relatively) infrequently executed. . . A lot of it needs to be made more self-contained first! Martin Schwenke Untangling and Restructuring CTDB

  50. What’s next? Martin Schwenke Untangling and Restructuring CTDB

  51. What’s next? Split the recovery lock Martin Schwenke Untangling and Restructuring CTDB

  52. What’s next? Split the recovery lock Drop support for “ ctdb setreclock ... ” Martin Schwenke Untangling and Restructuring CTDB

  53. What’s next? Split the recovery lock Drop support for “ ctdb setreclock ... ” What do you do when it fails? Martin Schwenke Untangling and Restructuring CTDB

  54. What’s next? Split the recovery lock Drop support for “ ctdb setreclock ... ” What do you do when it fails? Split recovery lock into separate cluster & recovery locks Martin Schwenke Untangling and Restructuring CTDB

  55. What’s next? Split the recovery lock Drop support for “ ctdb setreclock ... ” What do you do when it fails? Split recovery lock into separate cluster & recovery locks Split out election code Martin Schwenke Untangling and Restructuring CTDB

  56. What’s next? Split the recovery lock Drop support for “ ctdb setreclock ... ” What do you do when it fails? Split recovery lock into separate cluster & recovery locks Split out election code Drop recovery lock? Martin Schwenke Untangling and Restructuring CTDB

Recommend


More recommend