exposing design flaws in shared clock systems using tla
play

Exposing Design Flaws in Shared-Clock Systems using TLA+ Russell - PowerPoint PPT Presentation

Exposing Design Flaws in Shared-Clock Systems using TLA+ Russell Mull, Auxon Corporation TLA+ Conf September 12, 2019 About Me Russell Mull Software Engineer, Auxon Corporation How safe electronic systems are designed How safe electronic


  1. Exposing Design Flaws in Shared-Clock Systems using TLA+ Russell Mull, Auxon Corporation TLA+ Conf September 12, 2019

  2. About Me Russell Mull Software Engineer, Auxon Corporation

  3. How safe electronic systems are designed

  4. How safe electronic systems are designed ● Decide what matters (safety requirements)

  5. How safe electronic systems are designed ● Decide what matters (safety requirements) ● Decide how much it matters (Assign a Safety Integrity Level - SIL)

  6. How safe electronic systems are designed ● Decide what matters (safety requirements) ● Decide how much it matters (Assign a Safety Integrity Level - SIL) ● Analyze the parts of the system that matter (Fault Tree Analysis)

  7. How safe electronic systems are designed ● Decide what matters (safety requirements) ● Decide how much it matters (Assign a Safety Integrity Level - SIL) ● Analyze the parts of the system that matter (Fault Tree Analysis) ● Not good enough? Add redundancy.

  8. Example: Industrial Press

  9. Example: Industrial Press ● Safety requirement: Turn off press with emergency stop button

  10. Example: Industrial Press ● Safety requirement: Turn off press with emergency stop button ● SIL: 4

  11. Example: Industrial Press ● Safety requirement: Turn off press with emergency stop button ● SIL: 4 ● Fault tree: the actuator is only SIL 3

  12. Example: Industrial Press ● Safety requirement: Turn off press with emergency stop button ● SIL: 4 ● Fault tree: the actuator is only SIL 3 ● Redundancies: use two, design a SIL 4 failover mechanism

  13. Functional Safety ● IEC 61508 ● Power plants, chemical plants, cars, trains, heavy machinery, etc.

  14. This works well, until…

  15. This works well, until…

  16. This works well, until…

  17. This works well, until…

  18. In software, shared clock failures are lumpy and unpredictable

  19. The story of a system made from lots of computers, sensors, actuators, and clocks 19

  20. A client project for ██████████

  21. A client project for ██████████ ● Can’t say anything specific

  22. A client project for ██████████ ● Can’t say anything specific ● Relies fundamentally on a common timebase

  23. A client project for ██████████ ● Can’t say anything specific ● Relies fundamentally on a common timebase ● Appeared to be vulnerable to drift

  24. My Goal: Demonstrate the problem

  25. A naïve model

  26. A naïve model VARIABLES node_clock, system Nodes == { "A", "B", "C" }

  27. A naïve model VARIABLES node_clock, system Nodes == { "A", "B", "C" } Init == /\ node_clock = [ n \in Nodes |-> 0 ] /\ system = ...

  28. A naïve model VARIABLES node_clock, system Nodes == { "A", "B", "C" } Init == /\ node_clock = [ n \in Nodes |-> 0 ] /\ system = ... Next == \/ \E node \in DOMAIN node_clock: /\ node_clock' = [node_clock EXCEPT ![node] = @ + 1] /\ UNCHANGED << system >>

  29. A naïve model VARIABLES node_clock, system Nodes == { "A", "B", "C" } Init == /\ node_clock = [ n \in Nodes |-> 0 ] /\ system = ... Next == \/ \E node \in DOMAIN node_clock: /\ node_clock' = [node_clock EXCEPT ![node] = @ + 1] /\ UNCHANGED << system >> \/ /\ SystemStep(system) /\ UNCHANGED << node_clock >>

  30. A naïve model VARIABLES node_clock, system Nodes == { "A", "B", "C" } Init == /\ node_clock = [ n \in Nodes |-> 0 ] /\ system = ... Next == \/ \E node \in DOMAIN node_clock: /\ node_clock' = [node_clock EXCEPT ![node] = @ + 1] /\ UNCHANGED << system >> \/ /\ SystemStep(system) /\ UNCHANGED << node_clock >> \/ SyncClocks(node_clock, system)

  31. A naïve model VARIABLES node_clock, system Nodes == { "A", "B", "C" } Init == /\ node_clock = [ n \in Nodes |-> 0 ] /\ system = ... Next == \/ \E node \in DOMAIN node_clock: /\ node_clock' = [node_clock EXCEPT ![node] = @ + 1] /\ UNCHANGED << system >> \/ /\ SystemStep(system) /\ UNCHANGED << node_clock >> \/ SyncClocks(node_clock, system) SystemStep(s) == ... SyncClocks(cs,s) == ...

  32. This approach is not great.

  33. This approach is not great. ● Massive state explosion

  34. This approach is not great. ● Massive state explosion ● Customer doesn’t care about the sync protocol

  35. Model the drift, not the sync 35

  36. Drift Modeling (1) CONSTANTS SIMULATED_CYCLES, BOUNDED_DRIFT VARIABLES node_clock, system, global_clock

  37. Drift Modeling (1) CONSTANTS SIMULATED_CYCLES, BOUNDED_DRIFT VARIABLES global_clock, node_clock, system Nodes == { "A", "B", "C" }

  38. Drift Modeling (1) CONSTANTS SIMULATED_CYCLES, BOUNDED_DRIFT VARIABLES global_clock, node_clock, system Nodes == { "A", "B", "C" } Init == /\ global_clock = 0 /\ node_clock = [ n \in Nodes |-> 0 ]

  39. Drift Modeling (1) CONSTANTS SIMULATED_CYCLES, BOUNDED_DRIFT VARIABLES global_clock, node_clock, system Nodes == { "A", "B", "C" } Init == /\ global_clock = 0 /\ node_clock = [ n \in Nodes |-> 0 ] Next == \/ ClockStep /\ UNCHANGED system \/ SystemStep /\ UNCHANGED global_clock /\ UNCHANGED node_clock

  40. Drift Modeling (1) CONSTANTS SIMULATED_CYCLES, BOUNDED_DRIFT VARIABLES global_clock, node_clock, system Nodes == { "A", "B", "C" } Init == /\ global_clock = 0 /\ node_clock = [ n \in Nodes |-> 0 ] Next == \/ ClockStep /\ UNCHANGED system \/ SystemStep /\ UNCHANGED global_clock /\ UNCHANGED node_clock SystemStep == ...

  41. Drift Modeling (2) ClockStep ==

  42. Drift Modeling (2) ClockStep == \* Tick the global clock \/ /\ global_clock' = global_clock + 1 /\ UNCHANGED << node_clock >> /\ ClockDriftInBounds(global_clock', node_clock)

  43. Drift Modeling (2) ClockStep == \* Tick the global clock \/ /\ global_clock' = global_clock + 1 /\ UNCHANGED << node_clock >> /\ ClockDriftInBounds(global_clock', node_clock) \* Tick a node clock \/ \E node \in DOMAIN node_clock: /\ node_clock' = [node_clock EXCEPT ![node] = @ + 1] /\ UNCHANGED << global_clock >> /\ ClockDriftInBounds(global_clock, node_clock')

  44. Drift Modeling (3) ClockDriftInBounds(g, n) == /\ g <= SIMULATED_CYCLES /\ \A node \in DOMAIN n : /\ n[node] <= SIMULATED_CYCLES /\ Abs(c[node] - g) <= BOUNDED_DRIFT

  45. This works better 45

  46. This works better ● Narrower state space 46

  47. This works better ● Narrower state space ● Directly addresses relevant failure domain 47

  48. The system was more vulnerable to drift than previously thought

  49. Delivering a Model ● Literate PDF ● Makefile / .cfg file ● Config Instructions

  50. TLA+ is tricky to use this way ● Difficult setup ● Easier development ● Easier delivery

  51. Give models to your customers

  52. Extending the technique ● Asymmetric Drift ● Action on Tick ● Cyclical Clock

  53. Closing Thoughts

  54. Closing Thoughts ● Fake a real clock

  55. Closing Thoughts ● Fake a real clock ● Bound the drift

  56. Closing Thoughts ● Fake a real clock ● Bound the drift ● Give models to your customers

  57. Closing Thoughts ● Fake a real clock ● Bound the drift ● Give models to your customers ● I owe Hillel Wayne a great debt

  58. Russell Mull @mullr russell@auxon.io

Recommend


More recommend