randomized testing of distributed systems
play

Randomized Testing of Distributed Systems Burcu Kulahcioglu Ozkan - PowerPoint PPT Presentation

Randomized Testing of Distributed Systems Burcu Kulahcioglu Ozkan TU Kaiserslautern Summer Term 2019 Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 Distributed systems are prone to bugs! Distribution


  1. Randomized Testing of Distributed Systems Burcu Kulahcioglu Ozkan TU Kaiserslautern Summer Term 2019 Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019

  2. Distributed systems are prone to bugs! ‣ Distribution ‣ Asynchrony ‣ Replication ‣ … They are difficult to test! ‣ Many components, many sources of nondeterminism Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 2

  3. Testing is a practical approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Systematic testing - infeasible Random testing – no guarantees Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 3

  4. Randomized Testing with Probabilistic Guarantees (joint work with Rupak Majumdar, Filip Niksic, Simin Oraee, Mitra Tabaei Befrouei, Georg Weissenbacher) ‣ We propose a randomized scheduling algorithm: - for arbitrary partially ordered sets of events revealed online as the program is being executed - Guaranteeing a lower bound on the probability of exposing a bug Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019

  5. PCTCP on an example Upgrowing Poset: Handler Logger Terminator Request Request Log Log Terminate Terminate Buggy if: Flush executes Flush before Log! Flush Flushed Flushed Online chain partitioning: The program is decomposed into 𝐷1 = 𝑆𝑓𝑟𝑣𝑓𝑡𝑢 𝐷1 = 𝑆𝑓𝑟𝑣𝑓𝑡𝑢, 𝑀𝑝𝑕 𝐷1 = 𝑆𝑓𝑟𝑣𝑓𝑡𝑢, 𝑀𝑝𝑕 𝐷1 = 𝑆𝑓𝑟𝑣𝑓𝑡𝑢, 𝑀𝑝𝑕 causally dependent chains of events: 𝐷2 = 𝑈𝑓𝑠𝑛𝑗𝑜𝑏𝑢𝑓, 𝐺𝑚𝑣𝑡ℎ, 𝐺𝑚𝑣𝑡ℎ𝑓𝑒 𝐷2 = 𝑈𝑓𝑠𝑛𝑗𝑜𝑏𝑢𝑓, 𝐺𝑚𝑣𝑡ℎ 𝐷2 = 𝑈𝑓𝑠𝑛𝑗𝑜𝑏𝑢𝑓 𝑞𝑠𝑗𝑝𝑠𝑗𝑢𝑧(𝐷1) > 𝑞𝑠𝑗𝑝𝑠𝑗𝑢𝑧 (𝐷2) Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 5

  6. PCTCP on an example Upgrowing Poset: Handler Logger Terminator Request Request Log Log Terminate Terminate Buggy if: Flush executes Flush before Log! Flush Flushed Flushed Online chain partitioning: The bug is detected with probability: 𝐷1 = 𝑆𝑓𝑟𝑣𝑓𝑡𝑢, 𝑀𝑝𝑕 𝐷2 = 𝑈𝑓𝑠𝑛𝑗𝑜𝑏𝑢𝑓, 𝐺𝑚𝑣𝑡ℎ, 𝐺𝑚𝑣𝑡ℎ𝑓𝑒 𝑄𝐷𝑈𝐷𝑄: 1/2 𝑆𝑏𝑜𝑒𝑝𝑛 𝑥𝑏𝑚𝑙: 1/4 𝑞𝑠𝑗𝑝𝑠𝑗𝑢𝑧(𝐷2) > 𝑞𝑠𝑗𝑝𝑠𝑗𝑢𝑧 (𝐷1) Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 6

  7. Bug depth: Minimum tuple of events to expose the bug ‣ 𝑒 = 2 ⟨𝑓 E , ⟩ 𝑓 G e.g. order violation ‣ 𝑒 = 3 ⟨𝑓 E , ⟩ 𝑓 G , 𝑓 I e.g. atomicity violation ‣ 𝑒 = 𝑜 ⟨𝑓 E , ⟩ … , 𝑓 K more complicated bugs Bug in Cassandra 2.0.0 (img. from Leesatapornwongsa et. al. ASPLOS’16) Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 7

  8. Coverage: Strong 𝑒 -Hitting families of schedules A schedule 𝛽 strongly hits ⟨𝑓 M , ⟩ … , 𝑓 NOE if for all 𝑓 ∈ 𝑄 : 𝑓 ≥ R 𝑓 S implies 𝑓 ≥ 𝑓 T for some 𝑘 ≥ 𝑗 𝑏 𝑐 𝑑 𝛽1 = 𝑏, 𝑐, 𝑑, 𝑒, 𝑔, 𝑓, 𝑕 𝑒 𝑕 strongly hits 1−tuple 𝑕 , 2−tuple 𝑓, 𝑕 𝑓 𝑔 𝛽2 = 𝑏, 𝑐, 𝑑, 𝑒, 𝑔, 𝑕, 𝑓 strongly hits 1−tuple 𝑓 , 2−tuple 𝑕, 𝑓 , 3-tuple 𝑒, 𝑕, 𝑓 For each d-tuple, a strong 𝒆 -hitting family has a schedule which strongly hits it. Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 8

  9. Challenge: How to sample uniformly at random from strong 𝑒 -hitting family for distributed systems? ‣ Events in a distributed message passing system: upgrowing poset, revealed during execution ‣ Mutual dependency to the schedule 𝑏 𝑐 𝑑 - Build a schedule online - For an arbitrary ordering 𝑒 𝑕 Use combinatorial results for posets! 𝑓 𝑔 Schedule: 𝑏 𝑒 𝑓 𝑐 𝑔 𝑑 𝑕 Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 9

  10. Realizer and dimension of a poset 𝑏 𝑐 𝑑 Realizer of P is a set of linear orders: 𝐺 𝑆 = {𝑀 1 , 𝑀 2 , … , 𝑀 𝑜 } 𝑒 𝑕 such that: 𝑀 1 ⋂ 𝑀 2 … ⋂ 𝑀 𝑜 = 𝑄 𝑓 𝑔 Dimension of P is the minimum size of a realizer 𝑀 E = 𝑏 𝑒 𝑓 𝑐 𝑔 𝑑 𝑕 𝑀 G = 𝑑 𝑏 𝑒 𝑓 𝑐 𝑕 𝑔 Realizer of size dim(𝑄) - Covers all pairwise orderings! 𝑀 I = 𝑑 𝑐 𝑕 𝑔 𝑏 𝑒 𝑓 dim(𝑄) = 3 Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 10

  11. Adaptive chain covering ~ Online dimension algorithm 𝑏 𝑐 𝑑 ‣ Decompose P into chains ‣ Compute linear extensions of P 𝑕 𝑒 𝑓 C1 C2 C3 𝑔 𝑀1 = 𝑏 𝑀1 = 𝒄 𝑏 𝑒 𝑀1 = 𝑏 𝒆 𝑀1 = 𝑑 𝑐 𝒈 𝑏 𝑒 𝑓 𝑀1 = 𝑑 𝑐 𝑕 𝑔 𝑏 𝑒 𝑓 𝑀1 = 𝑑 𝑐 𝒉 𝑔 𝑏 𝑒 𝑓 𝑀1 = 𝒅 𝑐 𝑏 𝑒 𝑓 𝑀1 = 𝑐 𝑏 𝑒 𝒇 𝑑 𝑐 𝑏 𝑀2 = 𝑑 𝑏 𝑒 𝑓 𝑐 𝑕 𝑔 𝑀2 = 𝑑 𝑏 𝑒 𝑓 𝑐 𝒈 𝑀2 = 𝑑 𝑏 𝑒 𝑓 𝑐 𝒉 𝑔 𝑀2 = 𝑏 𝑒 𝒇 𝑐 𝑀2 = 𝒅 𝑏 𝑒 𝑓 𝑐 𝑀2 = 𝑏 𝑒 𝒄 𝑀3 = 𝑏 𝑒 𝑓 𝑐 𝑔 𝑑 𝑕 𝑀3 = 𝑏 𝑒 𝑓 𝑐 𝒅 𝑀3 = 𝑏 𝑒 𝑓 𝑐 𝒈 𝑑 𝑀3 = 𝑏 𝑒 𝑓 𝑐 𝑔 𝑑 𝒉 𝑕 𝑔 𝑒 This is a strong 1-hitting family! 𝑓 Adaptive chain covering ~ Online dimension algorithm Adaptive chain covering ~ Strong 1-hitting family ~ Online dimension algorithm [Felsner’97, Kloch’07] [Felsner’97, Kloch’07] Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 11

  12. Strong 𝒆 -hitting family ~ Adaptive chain covering [Felsner, Kloch] Strong 1-hitting family ~ Adaptive chain covering ℎ𝑗𝑢(𝑥) = 𝑏𝑒𝑏𝑞𝑢(𝑥) [Our main result] Strong 𝒆 -hitting family ~ Adaptive chain covering 𝑜 : number of events K ℎ𝑗𝑢 N 𝑥, 𝑜 ≤ 𝑏𝑒𝑏𝑞𝑢 𝑥 𝑒 − 1 ! 𝑒 : bug depth NOE Index the schedules in the 𝜇, 𝑜 E , 𝑜 G , … , 𝑜 NOE strongly hits e M ∈ 𝐷ℎ𝑏𝑗𝑜(𝜇) strong d-hitting family by: and 𝑓 E , 𝑓 2 , … , 𝑓 NOE chain id steps in which 𝑓 E , 𝑓 G , … , 𝑓 NOE Sample from this set of were added schedules! Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 12

  13. PCTCP : PCT + Chain Partitioning Generates randomly a schedule index 𝜇, 𝑜 E , 𝑜 G , … , 𝑜 NOE : ‣ Randomly generate a (𝑒 − 1) -tuple: 𝑜 E , 𝑜 G , … , 𝑜 NOE strongly hits e M ∈ 𝐷ℎ𝑏𝑗𝑜(𝜇) ‣ Partition P into chains online and 𝑓 E , 𝑓 2 , … , 𝑓 NOE ‣ Assign random distinct initial priorities > 𝑒 ‣ Reduce priority at: 𝑓 E , 𝑓 G , … , 𝑓 NOE to (𝑒 − 𝑗 − 1) for 𝑓 S 𝑓 M 𝑓 I 𝑓 I 𝑓 G 𝑓 G …. 𝑓 E 𝑓 E C k-1 C 1 C 2 C 2 ? ? ? C k-1 C 1 C k = 𝜇 Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 13

  14. The prob. of hitting a bug – Generalizes the PCT result K 𝑒 − 1 ! ≤ 𝑏𝑒𝑏𝑞𝑢 𝑥 𝑜 NOE ℎ𝑗𝑢 N 𝑥, 𝑜 ≤ 𝑏𝑒𝑏𝑞𝑢 𝑥 NOE online width of the poset of width 𝑥 Not possible to partition 𝑄 of width 𝑥 into 𝑥 chains online in general: ‣ C1 C2 C1 C1 C2 C1 C2 C3 [Felsner, 95] The best possible on-line partitioning algorithm ‣ partitions upgrowing 𝑄 of width 𝑥 into kpE chains! G We sample from at most 𝑥 G 𝑜 NOE schedules, 𝑜 : number of events E hitting a bug of depth 𝑒 with a probability of at least 𝑒 : bug depth k l K mno Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 14

  15. Experimental results - Cassandra # Event Max # Avg of Max # # Runs #Buggy Time(s) Labels (d) Events (n) Max # Chains Chains Random Walk - 54 6.97 11 1000 0 481.95 PCTCP d = 4 54 5.65 11 1000 0 505.73 PCTCP d = 5 54 5.73 11 1000 1 503.81 PCTCP d = 6 54 5.80 11 1000 1 512.00 Bug in Cassandra 2.0.0 (img. from Leesatapornwongsa et. al. ASPLOS’16) Source code at: https://gitlab.mpi-sws.org/fniksic/PSharp Source code at: https://gitlab.mpi-sws.org/burcu/pctcp-cass Source code at: https://gitlab.mpi-sws.org/rupak/hitmc Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 15

  16. Experimental results - ZooKeeper Start(1) Start(3) Start(2) Msg(1,1) Crash(1) Msg(2,1) Crash(2) Msg(3,1) Crash(3) Msg(1,2) Start(2) Msg(1,3) Msg(2,1) Crash(2) Msg(1,1) Msg(2,2) Source code at: https://gitlab.mpi-sws.org/fniksic/PSharp Source code at: https://gitlab.mpi-sws.org/burcu/pctcp-cass Source code at: https://gitlab.mpi-sws.org/rupak/hitmc Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 16

  17. Related Work PCT for multithreaded programs, linear orders d-Hitting families of schedules, trees [Burckhardt, Kothari, Musuvathi, Nagarakatte, 2010] [Chistikov, Majumdar, Niksic, 2016] 𝑏 𝑒 ℎ 𝑏 𝑐 𝑑 𝑐 𝑓 𝑗 𝑒 𝑓 𝑔 𝑑 𝑔 𝑘 𝑕 ℎ 𝑕 E Our method hits a bug with a prob. Our method samples from hitting families qNqrs(k)K mno for any arbitrary upgrowing poset E Generalizes the PCT result t K mno Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 17

Recommend


More recommend