ISHCS 2016 (International Symposium on High Confidence Software), PKU, Beijing, Dec. 18, 2016 Probabilistic Detection and Sampling of Concurrency Bugs Yan Cai ( 蔡彦 ) ycai.mail@gmail.com State Key Lab. of Computer Science, I nstitute of S oftware, C hinese A cademy of S ciences 中科院软件所 · 计算机科学国家重点实验室
Radius-aware Probabilistic Deadlock detection ASE’16 Yan Cai and Zijiang Yang
Locks and Deadlocks Thread 1 Thread 1 Thread 2 Thread 2 Read Deadlock Write Read Write Read Write Data Data 2 1 Thread t 1 Thread t 2 acq ( m ) acq ( n ) acq ( n ) acq ( m ) 3
Deadlock Testing • Random testing – OS scheduling + random manipulation – Stress testing – Heuristic directed random testing – Systematic scheduling No Guarantee to find a concurrency bug (e.g., Deadlock) 4
PCT – Probabilistic Concurrency Testing • PCT Algorithm – Mathematical randomness with Probabilistic Guarantees 1 n : #threads, k : #events, d : bug depth 𝑜 × 𝑙 𝑒−1 Thread t 1 Thread t 2 k =8, n =2, d =2 s 01 acq ( m ) 1 s 05 acq ( n ) 2 × 8 2−1 = 1/16 s 02 acq ( n ) s 06 acq ( m ) s 03 rel ( n ) s 04 rel ( m ) s 07 rel ( m ) s 08 rel ( n ) 5
PCT – Probabilistic Concurrency Testing • PCT : – Intuition of guaranteed probability: 1. satisfy the 1 st order by assigning the thread a largest priority ( 1/𝑜 ) 2. select d – 1 priority change points at the remaining d – 1 order 1 1 position ( 1/𝑙 × 1/k × … × 1/𝑙 = 𝑙 𝑒−1 ) ⇒ 𝑜×𝑙 𝑒−1 Thread t 1 Thread t 2 k =8, n =2, d =2 s 01 acq ( m ) s 05 acq ( n ) 1 2 × 8 2−1 = 1/16 s 02 acq ( n ) acq ( m ) s 06 s 03 rel ( n ) s 04 rel ( m ) rel ( m ) s 07 s 08 rel ( n ) 6
PCT – Probabilistic Concurrency Testing • Provide a guarantee (a probability ): Threads t 1 , t 2 , … t n , … 1 n : #threads, k : #events, d : bug depth … 𝑜 × 𝑙 𝑒−1 … Execution But … • Theoretical model, not consider thread interaction: real executions do not follow designed executions • Guaranteed probability decreases exponentially with increase of bug 1 depth: due to factor 𝑙 𝑒−1 . (a) Uniform distribution 7
RPro- Radius aware • Our approach: RPro – Radius aware Probabilistic testing Threads t 1 , t 2 , … t n • Consider thread interaction Threads t 1 , t 2 , … t n , … … … … • Guaranteed probability Execution 1 𝑠 (not 1 𝑙 , r ≪ k ) decreases: 1 1 𝑜 × 𝑙 𝑒−1 𝑜 × 𝑙 × 𝑠 𝑒−2 (a) Uniform distribution PCT v.s. RPro 8
RPro- Radius aware • RPro: Theoretical guarantee Probability PCT : Guaranteed probability RPro : Guaranteed probability RPro : Probability in practice 𝑜 × 𝑙 × 𝑠 𝑒−2 1 𝑜 × 𝑙 𝑒−1 1 0 0 Bug Radius r bug – 1 r bug r = k How to find r bug ? 9
0.07 0.05 r =17, p =0.0439 r =3, p =0.0632 PCT 0.06 0.04 Experiment RPro 0.05 0.03 0.04 0.02 p = 0.0385 0.03 0.01 p =0.0020 0.02 0.00 0 15 30 45 60 75 90 105 120 135 150 0 15 30 45 60 75 90 105 120 135 150 Probability PCT : Guaranteed probability (b) JDBC-2 (a) JDBC-1 RPro : Guaranteed probability 0.03 0.12 • Results r= 5, p= 0.1123 r =11, p =0.0229 RPro : Probability in practice 0.11 0.02 0.10 0.02 0.09 𝑜 × 𝑙 × 𝑠 𝑒−2 1 0.01 0.08 0.01 0.07 p = 0.0005 p = 0.0680 0.00 0.06 0 15 30 45 60 75 90 105 120 135 150 0 15 30 45 60 75 90 105 120 135 150 𝑜 × 𝑙 𝑒−1 1 0 (c) JDBC-3 (d) JDBC-4 0.50 0.70 r= 2, p= 0.453 r= 2, p= 0.6863 0 0.45 0.65 Bug Radius 0.40 0.60 r bug – 1 r bug r = k 0.35 0.55 0.30 0.50 0.25 p = 0.4326 0.45 Table 1. The best radiuses ( r best ) of each benchmarks. 0.20 p = 0.1755 0.15 0.40 𝒔 𝒄𝒇𝒕𝒖 0 15 30 45 60 75 90 105 120 135 150 0 15 30 45 60 75 90 105 120 135 150 # # bug # 𝒇𝒘𝒇𝒐𝒖𝒕 Probability (e) Hawknl (f) SQLite Benchmark depth 𝒔 𝒄𝒇𝒕𝒖 * events threads 0.0024 0.0300 r =47, p =0.0022 r =27, p =0.0256 0.0250 Hawknl 28 3 3 2 - 0.4530 0.0019 0.0200 0.0014 SQLite 16 3 3 2 - 0.6863 0.0150 0.0009 JDBC-2 5,050 3 3 3 0.059% 0.0632 p = 0.0088 0.0100 0.0004 JDBC-4 5,090 3 3 5 0.098% 0.1123 p = 0.0004 0.0050 -0.0001 JDBC-3 5,080 3 3 11 0.217% 0.0229 0 50 100 150 200 250 300 0 15 30 45 60 75 90 105 120 135 150 (g) MySQL-1 (h) MySQL-2 JDBC-1 5,088 3 3 17 0.334% 0.0439 0.0049 0.0069 r= 20, p= 0.0062 r= 114, p= 0.0039 MySQL-4 444,621 19 3 20 0.005% 0.0062 0.0059 0.0039 0.0049 MySQL-2 15,066 17 3 27 0.179% 0.0256 0.0029 0.0039 MySQL-1 19,300 16 3 47 0.244% 0.0022 0.0029 0.0019 0.0019 MySQL-3 406,117 22 6 114 0.028% 0.0039 0.0009 0.0009 p = 0.0000 p = 0.0000 10 -0.0001 -0.0001 (* All rows are sorted on the data in this column.) 0 50 100 150 200 250 300 0 15 30 45 60 75 90 105 120 135 150 (i) MySQL-3 (j) MySQL-4
Deployable Data Race Sampling FSE’16 Yan Cai , Jian Zhang, Lingwei Cao, and Jian Liu
Concurrency bugs • Difficult to detect – Non-determinism (space explosion) – Inadequate test inputs – … • Even after software release, concurrency bugs may still occur 12
Concurrency bugs • It is necessary to detect concurrency bugs in deployed products • Challenges: Detector not to disturb normal executions – light-weighted <5% overhead – … Sample user executions 13
Existing works • Data Race Two threads concurrently access the same memory location and at least one access is a write. • Happens-before (HB Race) • Access pairs not ordered by happens-before relation (HBR) Thread t 1 Thread t 2 Thread t 1 Thread t 2 x++; x++; sync(m){} sync(m) sync(m) sync(m){} {x++;} {x++;} Value of x: +1 or +2? Value of x: +2. 14
Existing works • Happens-before Races – Track full Happens-before relation • Incurring many O(n) operations 0% sampling rate => ~30% overhead (Pacer, PLDI’10) ~15% in our experiment Insight 1: Not to track Full Happens-before Relation 15
Existing works • Hardware based (e.g., DataCollider , OSDI’10) – Code Breakpoints and Data Breakpoints (or Watchpoints ) – Collision Races • A data race: two accesses – Select a memory address => Set a data breakpoint => Wait for the breakpoint to be fired – The waiting time directly increases the sampling overhead Insight 2: Not to directly delay executions 16
Existing works • … • See our paper for more insights 17
Our Proposal • Clock Race – For data race sampling purpose • CRSampler – To detect clock races 18
Clock Race • Clock Race – Thread-local clock : an integer for each thread, increased on synchronization operation. – Two accesses (with at least a write) form a Clock Race if: at least one thread-local clock is not changed in between the two accesses Thread 1 Thread 2 Thread 1 Thread 2 time 1 time 1 1 1 Time elapse Time elapse sync sync time 2 time 2 2 2 19 1 𝑙 is not changed between time 1 and time 2 . 1 𝑙 No clock races
Clock Race • A Quick Demonstration Maintain thread-local clocks Thread 1 Thread 2 1 𝑙 2 𝑙 10 8 acquire ( l ) onSync( ); acquire ( k ) onSync( ); 11 9 … x = 0; sample( x ); 11 9 … 11 9 release ( k ) onSync( ); 11 10 x ++ ; Sampled access 11 10 release ( l ) onSync( ); 12 On this read, t 1 .clock remains 11, a clock race on x is reported 20
Clock Race • Clock Race – Race checking does not need to delay any thread. – But: after e 1 appears, how much time is required to check two accesses? • Given a short time, it is not enough to trap the second access. • Given a long time, all threads’ lock clocks are changed. Thread 1 Thread 2 time 1 1 Time elapse One second, or … time 2 2 1 𝑙 is not changed between time 1 and time 2 . 21
Setup • Implementation – Jikes RVM – Sampling: Java class load time – Memory accesses Linux Kernel Execution On firing Core of Netlink User-site Kernel CPU DC/CR Com. Agent Site Set breakpoints JikesRVM User space Kernel space • Benchmarks – Dacapo benchmark suite 22
Setup • Comparisons – Sampling rate: 0.1% to 1.0% – Pacer (PLDI’10) – Data Collider (OSDI’10) DC 15 , DC 30 15ms, 30ms – CRSampler CR 15 , CR 30 • ThinkPad Workstation – I7-4710MQ CPU, four cores, 16G memory, 250G SSD 23
Recommend
More recommend