Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader
Challenges of Design Verification • Contemporary hardware designs require mi mill llions ons of lines of RTL code – More lines of code written for verification than for the implementation itself • Tradeoff between performance and design complexity – Speculative execution, shared caches, instruction reordering – Performance wins out GTC 2016, San Jose, CA 2
Performance vs. Design Complexity • Programmer burden – Requires correct usage of synchronization • Time to market – Earlier remediation of bugs is less costly – Re-spins on tapeout are expensive • Significant time spent of verification – Verification techniques are often NP- complete GTC 2016, San Jose, CA 3
Memory Consistency Models • Contract between SW and HW regarding the semantics of memory operations • Classic example: Sequential Consistency (SC) – All processors observe the same ordering of operations serviced by memory – Too strict for modern optimizations/architectures • Nomenclature – ST[A] → 1 1 “Wrote a value of 1 to location A” – LD[B] ← 2 2 “Read a value of 2 from location B” GTC 2016, San Jose, CA 4
ARM Idiosyncrasies • Our focus: ARMv8 Mv8 • Speculative eculative Ex Execution cution is allowed • Threads can reo eorde rder rea eads ds an and wr writes tes – Assuming no dependency exists • Writes are not t guar aranteed anteed to to be be simu multaneo ltaneously usly vi visible ible to other cores GTC 2016, San Jose, CA 5
Problem Setup • Given an inst. trace from a simulator, RTL, or silicon 1. Construct an initial graph – Vertices represent load, store, CPU 0 and barrier insts ST[B] → 90 – Edges represent memory ordering ST[B] → 92 CPU 1 • Based on architectural rules 2. Iter 2. erati tive vely ly infer er add ddit itio ional LD[B] ← 92 LD[A] ← 2 edge ed ges to the e gr graph ph LD[B] ← 92 LD[B] ← 93 – Based on existing relationships 3. Check for cycles – If one exists: contradiction! GTC 2016, San Jose, CA 6
TSOtool • Hangal et al. , ISCA ’04 – Designed for SPARC, but portable to ARM • Each store writes a unique value to memory – Easily map a load to the store that wrote its data • Tradeoff between accuracy and runtime – Polynomial time, but false positives are possible – If a cycle is found, a bug indeed exists – If no cycles are found, execution appears ars consistent GTC 2016, San Jose, CA 7
Need for Scalability • Must run many tests to maximize coverage – Stress different portions of the memory subsystem • Longer tests put supporting logic in more interesting states – Many instructions are required to build history in an LRU cache, for instance • Using a CPU cluster does not suffice – The results of one set of tests dictate the structure of the ensuing tests – Faster tests help with interactivity! • Solution: Efficient algorithms and parallelism GTC 2016, San Jose, CA 8
Inferred Edge Insertions (Rule 6) • S can reach X S: ST[A] → 1 • X does not load data from S W: ST[A] → 2 X: LD[A] ← 2 GTC 2016, San Jose, CA 9
Inferred Edge Insertions (Rule 6) • S can reach X S: ST[A] → 1 • X does not load data from S W: ST[A] → 2 • S co comes mes be before ore th the node that stored X’s X: LD[A] ← 2 dat ata GTC 2016, San Jose, CA 10
Inferred Edge Insertions (Rule 7) • S can reach X • Loads read data from S, not X S: ST[A] → 1 L: LD[A] → 1 M: LD[A] → 1 X: ST[A] → 2 GTC 2016, San Jose, CA 11
Inferred Edge Insertions (Rule 7) • S can reach X • Loads read data from S, not X • Load ads s ca came me be befor ore e X S: ST[A] → 1 L: LD[A] → 1 M: LD[A] → 1 X: ST[A] → 2 GTC 2016, San Jose, CA 12
Initial Algorithm for Inferring Edges for_each(store vertex S) { for_each(reachable vertex X from S) //Getting this set is expensive! { if(location[S] == location[X]) { if((type[X] == LD) && (data[S] != data[X])) { //Add Rule 6 edge from S to W, the store that X read from } else if(type[X] == ST) { for_each(load vertex L that reads data from S) { //Add Rule 7 edge from L to X } } //End if instruction type is store } //End if location } //End for each reachable vertex } //End for each store GTC 2016, San Jose, CA 13
Virtual Processors (vprocs) • Split instructions from physical to virtual processors • Each vproc is sequentially consistent – Program order ↔ Memory order CPU 0 VPROC 0 VPROC 1 ST[B] → 91 ST[A] → 1 ST[B] → 91 ST[A] → 1 VPROC 2 ST[B] → 92 LD[A] ← 2 LD[A] ← 2 ST[B] → 92 GTC 2016, San Jose, CA 14
Reverse Time Vector Clocks (RTVC) Consider the RTVC of • CPU 0 ST[B] = 90 ST[B] → 90 Purple: ST[B] = 92 Blue: NULL ST[B] → 92 CPU 1 Green: LD[B] = 92 Orange: LD[B] = 92 LD[B] ← 92 LD[A] ← 2 • Track the earliest successor from each LD[B] ← 92 LD[B] ← 93 vertex to each vproc – Captures transitivity Complexity of inferring edges: 𝑃 𝑜 2 𝑞 2 𝑒 𝑛𝑏𝑦 GTC 2016, San Jose, CA 15
Updating RTVCs • Computing RTVCs once is fast – Process vertices in the reverse order of a topological sort – Check neighbors directly, then their RTVCs • Every time a new edge is inserted, RTVC values need to change – # of edge insertions ≈ 𝑛 • TSOtool implements both vprocs and RTVCs GTC 2016, San Jose, CA 16
Facilitating Parallelism • Repeatedly updating RTVCs is expensive – For 𝑙 edge insertions, RTVC updates take 𝑃(𝑙𝑞𝑜) time • 𝑙 = 𝑃 𝑜 2 , but usually is a small multiple of 𝑜 • Idea: Update RTVCs once per iteration rather than per edge insertion – For 𝑗 iterations RTVC updates take 𝑃(𝑗𝑞𝑜) time • 𝑗 ≪ 𝑙 (less than 10 for all test cases) – Less communication between threads • Complexity of inferring edges: 𝑃(𝑜 2 𝑞) GTC 2016, San Jose, CA 17
Correctness • Inferred edges found by our approach will not be the same as the edges found by TSOtool – Might not infer an edge that TSOtool does • RTVC for TSOtool can change mid-iteration – Might infer an edge that TSOtool does not • Our approach will have “stale” RTVC values • Both approaches make forward progress – Number of edges monotonic otonically ally increases eases • Any edge inserted by our approach could have been inserted by the naïve approach [Thm 1] • If TSOtool finds a cycle, we will also find a cycle [Thm 2] GTC 2016, San Jose, CA 18
Parallel Implementations • OpenMP – Each thread keeps its own partition of added edges – After each iteration of inferring edges, reduce • CUDA – Assign threads to each store instruction – Threads independently traverse the vprocs of this store – Atomically add edges to a preallocated array in global memory GTC 2016, San Jose, CA 19
Experimental Setup • Intel Core i7-2600K CPU – Quad core, 3.4GHz, 8MB LLC, 16GB DRAM • NVIDIA GeForce GTX Titan – 14 SMs, 837 MHz base clock, 6GB DRAM • ARM system under test – Cortex-A57, quad core • Instruction graphs range from 𝑜 = 2 18 to 𝑜 = 2 22 vertices, 𝑜 ≈ 𝑛 – Sparse, high-diameter, low-degree – Tests vary by their distribution of LD/ST/DMB instructions, # of vprocs, and inst dependencies GTC 2016, San Jose, CA 20
Importance of Scaling • 512K instructions per core • 2M total instructions GTC 2016, San Jose, CA 21
Speedup over TSOtool (Application) Graph Size # of tests Lazy RTVC OMP 2 OMP 4 GPU 64K*4 = 256K 27 5.64x 7.62x 9.43x 10.79x 128K*4 = 512K 27 5.31x 7.12x 8.90x 10.76x 256K*4 = 1M 23 6.30x 9.05x 12.13x 15.47x 512K*4 = 2M 10 3.68x 6.41x 10.81x 24.55x 1M*4 = 4M 2 3.05x 5.58x 9.97x 37.64x • GPU is always best; scales much better to larger tests • Extreme case: 9 hours rs using TSOtool → unde der 10 min inutes es using our GPU approach • Avg. Parallel speedups over our improved sequential approach: – 1.92x (OMP 2), 3.53x (OMP 4), 5.05x (GPU) GTC 2016, San Jose, CA 22
Summary • Relaxing the updates to RTVCs lead to a better sequential approach and and facilitated parallel implementations – Trade off between redundant work and parallelism • Faster execution leads to interactive bug-finding • The GPU scales well to larger problem instances – Helpful for corner case bugs that slip through pre-silicon verification • For the twelve largest test cases our GPU implementation achieves a 26.36x average application speedup GTC 2016, San Jose, CA 23
Acknowledgments • Shankar Govindaraju, and Tom Hart for their help on understanding NVIDIA’s implementation of TSOtool for ARM GTC 2016, San Jose, CA 24
Questions “ To raise new questions, new possibilities, to regard old problems from a new angle, requires creative imagination and marks real advance in science .”– Albert Einstein GTC 2016, San Jose, CA 25
Backup GTC 2016, San Jose, CA 26
Sequential Consistency Examples • ST[x] → 1 handled before • Valid ST[x] → 2 P1: ST[x] → 1 P2: LD[x] ← 1 LD[x] ← 2 P3: LD[x] ← 1 LD[x] ← 2 P4: ST[x] → 2 t=0 t=1 t=2 • Writes propagate to P2 • Invalid and P3 in a different P1: ST[x] → 1 P2: LD[x] ← 1 LD[x] ← 2 order P3: LD[x] ← 2 LD[x] ← 1 – Valid for weaker memory P4: ST[x] → 2 t=0 t=1 t=2 models GTC 2016, San Jose, CA 27
Recommend
More recommend