Randomized Testing of Distributed Systems Burcu Kulahcioglu Ozkan TU Kaiserslautern Summer Term 2019 Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019
Distributed systems are prone to bugs! ‣ Distribution ‣ Asynchrony ‣ Replication ‣ … They are difficult to test! ‣ Many components, many sources of nondeterminism Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 2
Testing is a practical approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Systematic testing - infeasible Random testing – no guarantees Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 3
Randomized Testing with Probabilistic Guarantees (joint work with Rupak Majumdar, Filip Niksic, Simin Oraee, Mitra Tabaei Befrouei, Georg Weissenbacher) ‣ We propose a randomized scheduling algorithm: - for arbitrary partially ordered sets of events revealed online as the program is being executed - Guaranteeing a lower bound on the probability of exposing a bug Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019
PCTCP on an example Upgrowing Poset: Handler Logger Terminator Request Request Log Log Terminate Terminate Buggy if: Flush executes Flush before Log! Flush Flushed Flushed Online chain partitioning: The program is decomposed into 𝐷1 = 𝑆𝑓𝑟𝑣𝑓𝑡𝑢 𝐷1 = 𝑆𝑓𝑟𝑣𝑓𝑡𝑢, 𝑀𝑝 𝐷1 = 𝑆𝑓𝑟𝑣𝑓𝑡𝑢, 𝑀𝑝 𝐷1 = 𝑆𝑓𝑟𝑣𝑓𝑡𝑢, 𝑀𝑝 causally dependent chains of events: 𝐷2 = 𝑈𝑓𝑠𝑛𝑗𝑜𝑏𝑢𝑓, 𝐺𝑚𝑣𝑡ℎ, 𝐺𝑚𝑣𝑡ℎ𝑓𝑒 𝐷2 = 𝑈𝑓𝑠𝑛𝑗𝑜𝑏𝑢𝑓, 𝐺𝑚𝑣𝑡ℎ 𝐷2 = 𝑈𝑓𝑠𝑛𝑗𝑜𝑏𝑢𝑓 𝑞𝑠𝑗𝑝𝑠𝑗𝑢𝑧(𝐷1) > 𝑞𝑠𝑗𝑝𝑠𝑗𝑢𝑧 (𝐷2) Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 5
PCTCP on an example Upgrowing Poset: Handler Logger Terminator Request Request Log Log Terminate Terminate Buggy if: Flush executes Flush before Log! Flush Flushed Flushed Online chain partitioning: The bug is detected with probability: 𝐷1 = 𝑆𝑓𝑟𝑣𝑓𝑡𝑢, 𝑀𝑝 𝐷2 = 𝑈𝑓𝑠𝑛𝑗𝑜𝑏𝑢𝑓, 𝐺𝑚𝑣𝑡ℎ, 𝐺𝑚𝑣𝑡ℎ𝑓𝑒 𝑄𝐷𝑈𝐷𝑄: 1/2 𝑆𝑏𝑜𝑒𝑝𝑛 𝑥𝑏𝑚𝑙: 1/4 𝑞𝑠𝑗𝑝𝑠𝑗𝑢𝑧(𝐷2) > 𝑞𝑠𝑗𝑝𝑠𝑗𝑢𝑧 (𝐷1) Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 6
Bug depth: Minimum tuple of events to expose the bug ‣ 𝑒 = 2 ⟨𝑓 E , ⟩ 𝑓 G e.g. order violation ‣ 𝑒 = 3 ⟨𝑓 E , ⟩ 𝑓 G , 𝑓 I e.g. atomicity violation ‣ 𝑒 = 𝑜 ⟨𝑓 E , ⟩ … , 𝑓 K more complicated bugs Bug in Cassandra 2.0.0 (img. from Leesatapornwongsa et. al. ASPLOS’16) Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 7
Coverage: Strong 𝑒 -Hitting families of schedules A schedule 𝛽 strongly hits ⟨𝑓 M , ⟩ … , 𝑓 NOE if for all 𝑓 ∈ 𝑄 : 𝑓 ≥ R 𝑓 S implies 𝑓 ≥ 𝑓 T for some 𝑘 ≥ 𝑗 𝑏 𝑐 𝑑 𝛽1 = 𝑏, 𝑐, 𝑑, 𝑒, 𝑔, 𝑓, 𝑒 strongly hits 1−tuple , 2−tuple 𝑓, 𝑓 𝑔 𝛽2 = 𝑏, 𝑐, 𝑑, 𝑒, 𝑔, , 𝑓 strongly hits 1−tuple 𝑓 , 2−tuple , 𝑓 , 3-tuple 𝑒, , 𝑓 For each d-tuple, a strong 𝒆 -hitting family has a schedule which strongly hits it. Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 8
Challenge: How to sample uniformly at random from strong 𝑒 -hitting family for distributed systems? ‣ Events in a distributed message passing system: upgrowing poset, revealed during execution ‣ Mutual dependency to the schedule 𝑏 𝑐 𝑑 - Build a schedule online - For an arbitrary ordering 𝑒 Use combinatorial results for posets! 𝑓 𝑔 Schedule: 𝑏 𝑒 𝑓 𝑐 𝑔 𝑑 Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 9
Realizer and dimension of a poset 𝑏 𝑐 𝑑 Realizer of P is a set of linear orders: 𝐺 𝑆 = {𝑀 1 , 𝑀 2 , … , 𝑀 𝑜 } 𝑒 such that: 𝑀 1 ⋂ 𝑀 2 … ⋂ 𝑀 𝑜 = 𝑄 𝑓 𝑔 Dimension of P is the minimum size of a realizer 𝑀 E = 𝑏 𝑒 𝑓 𝑐 𝑔 𝑑 𝑀 G = 𝑑 𝑏 𝑒 𝑓 𝑐 𝑔 Realizer of size dim(𝑄) - Covers all pairwise orderings! 𝑀 I = 𝑑 𝑐 𝑔 𝑏 𝑒 𝑓 dim(𝑄) = 3 Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 10
Adaptive chain covering ~ Online dimension algorithm 𝑏 𝑐 𝑑 ‣ Decompose P into chains ‣ Compute linear extensions of P 𝑒 𝑓 C1 C2 C3 𝑔 𝑀1 = 𝑏 𝑀1 = 𝒄 𝑏 𝑒 𝑀1 = 𝑏 𝒆 𝑀1 = 𝑑 𝑐 𝒈 𝑏 𝑒 𝑓 𝑀1 = 𝑑 𝑐 𝑔 𝑏 𝑒 𝑓 𝑀1 = 𝑑 𝑐 𝒉 𝑔 𝑏 𝑒 𝑓 𝑀1 = 𝒅 𝑐 𝑏 𝑒 𝑓 𝑀1 = 𝑐 𝑏 𝑒 𝒇 𝑑 𝑐 𝑏 𝑀2 = 𝑑 𝑏 𝑒 𝑓 𝑐 𝑔 𝑀2 = 𝑑 𝑏 𝑒 𝑓 𝑐 𝒈 𝑀2 = 𝑑 𝑏 𝑒 𝑓 𝑐 𝒉 𝑔 𝑀2 = 𝑏 𝑒 𝒇 𝑐 𝑀2 = 𝒅 𝑏 𝑒 𝑓 𝑐 𝑀2 = 𝑏 𝑒 𝒄 𝑀3 = 𝑏 𝑒 𝑓 𝑐 𝑔 𝑑 𝑀3 = 𝑏 𝑒 𝑓 𝑐 𝒅 𝑀3 = 𝑏 𝑒 𝑓 𝑐 𝒈 𝑑 𝑀3 = 𝑏 𝑒 𝑓 𝑐 𝑔 𝑑 𝒉 𝑔 𝑒 This is a strong 1-hitting family! 𝑓 Adaptive chain covering ~ Online dimension algorithm Adaptive chain covering ~ Strong 1-hitting family ~ Online dimension algorithm [Felsner’97, Kloch’07] [Felsner’97, Kloch’07] Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 11
Strong 𝒆 -hitting family ~ Adaptive chain covering [Felsner, Kloch] Strong 1-hitting family ~ Adaptive chain covering ℎ𝑗𝑢(𝑥) = 𝑏𝑒𝑏𝑞𝑢(𝑥) [Our main result] Strong 𝒆 -hitting family ~ Adaptive chain covering 𝑜 : number of events K ℎ𝑗𝑢 N 𝑥, 𝑜 ≤ 𝑏𝑒𝑏𝑞𝑢 𝑥 𝑒 − 1 ! 𝑒 : bug depth NOE Index the schedules in the 𝜇, 𝑜 E , 𝑜 G , … , 𝑜 NOE strongly hits e M ∈ 𝐷ℎ𝑏𝑗𝑜(𝜇) strong d-hitting family by: and 𝑓 E , 𝑓 2 , … , 𝑓 NOE chain id steps in which 𝑓 E , 𝑓 G , … , 𝑓 NOE Sample from this set of were added schedules! Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 12
PCTCP : PCT + Chain Partitioning Generates randomly a schedule index 𝜇, 𝑜 E , 𝑜 G , … , 𝑜 NOE : ‣ Randomly generate a (𝑒 − 1) -tuple: 𝑜 E , 𝑜 G , … , 𝑜 NOE strongly hits e M ∈ 𝐷ℎ𝑏𝑗𝑜(𝜇) ‣ Partition P into chains online and 𝑓 E , 𝑓 2 , … , 𝑓 NOE ‣ Assign random distinct initial priorities > 𝑒 ‣ Reduce priority at: 𝑓 E , 𝑓 G , … , 𝑓 NOE to (𝑒 − 𝑗 − 1) for 𝑓 S 𝑓 M 𝑓 I 𝑓 I 𝑓 G 𝑓 G …. 𝑓 E 𝑓 E C k-1 C 1 C 2 C 2 ? ? ? C k-1 C 1 C k = 𝜇 Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 13
The prob. of hitting a bug – Generalizes the PCT result K 𝑒 − 1 ! ≤ 𝑏𝑒𝑏𝑞𝑢 𝑥 𝑜 NOE ℎ𝑗𝑢 N 𝑥, 𝑜 ≤ 𝑏𝑒𝑏𝑞𝑢 𝑥 NOE online width of the poset of width 𝑥 Not possible to partition 𝑄 of width 𝑥 into 𝑥 chains online in general: ‣ C1 C2 C1 C1 C2 C1 C2 C3 [Felsner, 95] The best possible on-line partitioning algorithm ‣ partitions upgrowing 𝑄 of width 𝑥 into kpE chains! G We sample from at most 𝑥 G 𝑜 NOE schedules, 𝑜 : number of events E hitting a bug of depth 𝑒 with a probability of at least 𝑒 : bug depth k l K mno Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 14
Experimental results - Cassandra # Event Max # Avg of Max # # Runs #Buggy Time(s) Labels (d) Events (n) Max # Chains Chains Random Walk - 54 6.97 11 1000 0 481.95 PCTCP d = 4 54 5.65 11 1000 0 505.73 PCTCP d = 5 54 5.73 11 1000 1 503.81 PCTCP d = 6 54 5.80 11 1000 1 512.00 Bug in Cassandra 2.0.0 (img. from Leesatapornwongsa et. al. ASPLOS’16) Source code at: https://gitlab.mpi-sws.org/fniksic/PSharp Source code at: https://gitlab.mpi-sws.org/burcu/pctcp-cass Source code at: https://gitlab.mpi-sws.org/rupak/hitmc Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 15
Experimental results - ZooKeeper Start(1) Start(3) Start(2) Msg(1,1) Crash(1) Msg(2,1) Crash(2) Msg(3,1) Crash(3) Msg(1,2) Start(2) Msg(1,3) Msg(2,1) Crash(2) Msg(1,1) Msg(2,2) Source code at: https://gitlab.mpi-sws.org/fniksic/PSharp Source code at: https://gitlab.mpi-sws.org/burcu/pctcp-cass Source code at: https://gitlab.mpi-sws.org/rupak/hitmc Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 16
Related Work PCT for multithreaded programs, linear orders d-Hitting families of schedules, trees [Burckhardt, Kothari, Musuvathi, Nagarakatte, 2010] [Chistikov, Majumdar, Niksic, 2016] 𝑏 𝑒 ℎ 𝑏 𝑐 𝑑 𝑐 𝑓 𝑗 𝑒 𝑓 𝑔 𝑑 𝑔 𝑘 ℎ E Our method hits a bug with a prob. Our method samples from hitting families qNqrs(k)K mno for any arbitrary upgrowing poset E Generalizes the PCT result t K mno Burcu Kulahcioglu Ozkan Programming Distributed Systems Summer Term 2019 17
Recommend
More recommend