compiler construction of idempotent regions
play

Compiler Construction of Idempotent Regions and Applications in - PowerPoint PPT Presentation

Compiler Construction of Idempotent Regions and Applications in Architecture Design Marc de Kruijf Advisor: Karthikeyan Sankaralingam PhD Defense 07/20/2012 Example source code int int sum(int int *array, int int len) { int int x = 0;


  1. ISA Sensitivity you might be curious ISA matters? (1) two-address (e.g. x86) vs. three-address (e.g. ARM) (2) register-memory (e.g. x86) vs. register-register (e.g. ARM) (3) number of available registers the short version ( ) – impact of (1) & (2) not significant (+/- 2% overall) – even less significant as regions grow larger – impact of (3): to get same performance w/ idempotence – increase registers by 0% (large) to ~60% (small regions) 40

  2. Compiler Design & Evaluation a summary design and implementation – static analysis algorithms, modular and perform well – code-gen algorithms, modular and perform well – LLVM implementation source code available * findings – pressure-related performance overheads range from: – 0% (large regions) to ~15% (small regions) – greatest opportunity: loop-intensive applications – ISA effects are insignificant * http://research.cs.wisc.edu/vertical/iCompiler 41

  3. Overview ❶ Idempotence Models in Architecture ❷ Compiler Design & Evaluation ❸ Architecture Design & Evaluation 42

  4. Architecture Recovery: It’s Real safety first speed first (safety second) 43

  5. Architecture Recovery: It’s Real lots of sharp turns 1 Fetch 2 Decode 3 Execute 4 Write-back 2 Decode 1 Fetch closer to 3 Execute the truth 4 Write-back 44

  6. Architecture Recovery: It’s Real lots of interaction 1 Fetch 2 Decode 3 Execute 4 Write-back !!! 2 Decode 1 Fetch too late! 3 Execute 4 Write-back 45

  7. Architecture Recovery: It’s Real bad stuff can happen mis-speculation hardware faults exceptions (a) branch mis-prediction, (d) wear-out fault, (g) page fault, (b) memory re-ordering, (e) particle strike, (h) divide-by-zero, (c) transaction violation, (f) voltage spike, (i) mis-aligned access, etc. etc. etc. 46

  8. Architecture Recovery: It’s Real bad stuff can happen detection register re-execution latency pressure time mis-speculation hardware faults exceptions (a) branch mis-prediction, (d) wear-out fault, (g) page fault, (b) memory re-ordering, (e) particle strike, (h) divide-by-zero, (c) transaction violation, (f) voltage spike, (i) mis-aligned access, etc. etc. etc. 47

  9. Architecture Recovery: It’s Real bad stuff can happen mis-speculation hardware faults exceptions (a) branch mis-prediction, (d) wear-out fault, (g) page fault, (b) memory re-ordering, (e) particle strike, (h) divide-by-zero, (c) transaction violation, (f) voltage spike, (i) mis-aligned access, etc. etc. etc. 48

  10. Architecture Recovery: It’s Real bad stuff can happen integrated GPU low-power CPU high-reliability systems exceptions hardware faults (g) page fault, (d) wear-out fault, (h) divide-by-zero, (e) particle strike, (i) mis-aligned access, (f) voltage spike, etc. etc. 49

  11. GPU Exception Support 50

  12. GPU Exception Support why would we want it? GPU/CPU integration – unified address space: support for demand paging – numerous secondary benefits as well… 51

  13. GPU Exception Support why is it hard? the CPU solution pipeline buffers registers 52

  14. GPU Exception Support why is it hard? CPU: 10s of registers/core GPU: 10s of registers/thread 32 threads/warp 48 warps per “core” 10,000s of registers/core 53

  15. GPU Exception Support idempotence on GPUs GPUs hit the sweet spot (1) extractably large regions (low compiler overheads) (2) detection latencies long or hard to bound (large is good) (3) exceptions are infrequent (low re-execution overheads) detection register re-execution latency pressure time 54

  16. GPU Exception Support idempotence on GPUs GPUs hit the sweet spot (1) extractably large regions (low compiler overheads) (2) detection latencies long or hard to bound (large is good) (3) exceptions are infrequent (low re-execution overheads) GPU design topics D ETAILS – compiler flow – hardware support – exception live-lock – bonus: fast context switching 55

  17. GPU Exception Support evaluation methodology compiler – LLVM targeting ARM simulation – gem5 for ARM: simple dual-issue in-order (e.g. Fermi) – 10-cycle page fault detection latency benchmarks – Parboil GPU benchmarks for CPUs, modified measurement – performance overhead in execution cycles 56

  18. GPU Exception Support evaluation results performance overhead 1.5% 1.0% 0.54% 0.5% 0.0% cutcp fft histo mri-q sad tpacf gmean 57

  19. CPU Exception Support 58

  20. CPU Exception Support why is it a problem? the CPU solution pipeline buffers registers 59

  21. CPU Exception Support why is it a problem? Before Integer Bypass Integer Branch Multiply Decode, Fetch RF Rename, Load/Store & Issue Flush? FP Replay? … Replay IEEE FP queue … 60

  22. CPU Exception Support why is it a problem? After Integer Integer Branch Multiply Decode & Fetch RF Issue Load/Store FP … 61

  23. CPU Exception Support idempotence on CPUs CPU design simplification – in ARM Cortex-A8 (dual-issue in-order) can remove: – bypass / staging register file, replay queue – rename pipeline stage – IEEE-compliant floating point unit – pipeline flush for exceptions and replays D ETAILS – all associated control logic leaner hardware – bonus: cheap (but modest) OoO issue 62

  24. CPU Exception Support evaluation methodology compiler – LLVM targeting ARM, minimize pressure ( take 2/3 ) simulation – gem5 for ARM: aggressive dual-issue in-order (e.g. A8) – stall on potential in-flight exception benchmarks – SPEC 2006 & PARSEC suites ( unmodified ) measurement – performance overhead in execution cycles 63

  25. CPU Exception Support evaluation results performance overhead 14% 12% 9.1% 10% 8% 6% 4% 2% 0% SPEC INT SPEC FP PARSEC OVERALL 64

  26. Hardware Fault Tolerance 65

  27. Hardware Fault Tolerance what is the opportunity? reliability trends – CMOS reliability is a growing problem – future CMOS alternatives are no better architecture trends – hardware power and complexity are premium – desire for simple hardware + efficient recovery application trends – emerging workloads consist of large idempotent regions – increasing levels of software abstraction 66

  28. Hardware Fault Tolerance design topics fault detection capability – fine- grained in hardware (e.g. Argus, MICRO ‘07) or – fine-grained in software (e.g. instruction/region DMR) hardware organizations – homogenous : idempotence everywhere F AULT – statically heterogeneous : e.g. accelerators MODEL – dynamically heterogeneous : adaptive cores fault model (aka ISA semantics) – similar to pipeline-based (e.g. ROB) recovery 67

  29. Hardware Fault Tolerance evaluation methodology compiler – LLVM targeting ARM (compiled to minimize pressure) simulation – gem5 for ARM: simple dual-issue in-order – DMR detection; compare against checkpoint/log and TMR benchmarks – SPEC 2006, PARSEC, and Parboil suites (unmodified) measurement – performance overhead in execution cycles 68

  30. Hardware Fault Tolerance evaluation results 29.3 performance overhead 35% 22.2 30% idempotence 25% checkpoint/log 20% 9.1 TMR 15% 10% 5% 0% 69

  31. Overview ❶ Idempotence Models in Architecture ❷ Compiler Design & Evaluation ❸ Architecture Design & Evaluation 70

  32. R ELATED W ORK C ONCLUSIONS 71

  33. Conclusions idempotence: not good for everything – small regions are expensive – preserving register state is difficult with limited flexibility – large regions are cheap – preserving register state is easy with amortization effect – preserving memory state is mostly “for free” idempotence: synergistic with modern trends – programmability (for GPUs) – low power (for everyone) – high-level software efficient recovery (for everyone) 72

  34. The End 73

  35. Back-Up: Chronology prelim defense Time MapReduce MICRO ’11: for CELL Idempotent Processors SELSE ’09: ISCA ’12: CGO ??: Synergy iGPU Code Gen ISCA ‘10: PLDI ’12: Relax Static Analysis and Compiler Design DSN ’10: TACO ??: TS model Models 74

  36. Choose Your Own Adventure Slides 75

  37. Idempotence Analysis is this idempotent? 2 Yes 76

  38. Idempotence Analysis how about this? 2 No 77

  39. Idempotence Analysis maybe this? 2 Yes 78

  40. Idempotence Analysis it’s all about the data dependences operation sequence dependence chain idempotent? write Yes read, write No write, read, write Yes 79

  41. Idempotence Analysis it’s all about the data dependences operation sequence dependence chain idempotent? C LOBBER A NTIDEPENDENCE write, read Yes antidependence with an exposed read read, write No write, read, write Yes 80

  42. Semantic Idempotence two types of program state (1) local (“ pseudoregister ”) state: can be renamed to remove clobber antidependences* does not semantically constrain idempotence (2) non- local (“memory”) state: cannot “rename” to avoid clobber antidependences semantically constrains idempotence semantic idempotence = no non-local clobber antidep. preserve local state by renaming and careful allocation 81

  43. Region Partitioning Algorithm steps one, two, and three Step 1: transform function remove artificial dependences, remove non-clobbers Step 2: construct regions around antidependences cut all non-local antidependences in the CFG Step 3: refine for correctness & performance account for loops, optimize for dynamic behavior 82

  44. Step 1: Transform not one, but two transformations Transformation 1: SSA for pseudoregister antidependences But we still have a problem: region identification depends on region clobber boundaries antidependences 83

  45. Step 1: Transform not one, but two transformations Transformation 1: SSA for pseudoregister antidependences Transformation 2: Scalar replacement of memory variables [x] = a; [x] = a; b = [x] ; b = a; [x] = c; [x] = c; non-clobber antidependences … GONE! 84

  46. Step 1: Transform not one, but two transformations Transformation 1: SSA for pseudoregister antidependences Transformation 2: Scalar replacement of memory variables region identification depends on region clobber boundaries antidependences 85

  47. Region Partitioning Algorithm steps one, two, and three Step 1: transform function remove artificial dependences, remove non-clobbers Step 2: construct regions around antidependences cut all non-local antidependences in the CFG Step 3: refine for correctness & performance account for loops, optimize for dynamic behavior 86

  48. Step 2: Cut the CFG cut, cut, cut… construct regions by “cutting” non -local antidependences antidependence 87

  49. Step 2: Cut the CFG but where to cut…? sources of overhead overhead optimal region size? rough sketch region size larger is (generally) better: large regions amortize the cost of input preservation 88

  50. Step 2: Cut the CFG but where to cut…? goal : the minimum set of cuts that cuts all antidependence paths intuition : minimum cuts fewest regions large regions approach : a series of reductions: minimum vertex multi-cut (NP-complete) minimum hitting set among paths minimum hitting set among “dominating nodes” details omitted 89

  51. Region Partitioning Algorithm steps one, two, and three Step 1: transform function remove artificial dependences, remove non-clobbers Step 2: construct regions around antidependences cut all non-local antidependences in the CFG Step 3: refine for correctness & performance account for loops, optimize for dynamic behavior 90

  52. Step 3: Loop-Related Refinements loops affect correctness and performance correctness: Not all local antidependences removed by SSA… loop-carried antidependences may clobber depends on boundary placement; handled as a post-pass performance: Loops tend to execute multiple times… to maximize region size, place cuts outside of loop algorithm modified to prefer cuts outside of loops details omitted 91

  53. Code Generation Algorithms idempotence preservation background & concepts: live intervals, region intervals, and shadow intervals compiling for contextual idempotence : potentially variable control flow upon re-execution compiling for architectural idempotence : invariable control flow upon re-execution 92

  54. Code Generation Algorithms live intervals and region intervals x ’s live region x = ... interval boundaries ... = f(x) y = ... region interval 93

  55. Code Generation Algorithms shadow intervals shadow interval the interval over which a variable must not be overwritten specifically to preserve idempotence different for architectural and contextual idempotence 94

  56. Code Generation Algorithms for contextual idempotence x ’s live region x = ... interval boundaries ... = f(x) y = ... x ’s shadow interval 95

  57. Code Generation Algorithms for architectural idempotence x ’s live region x = ... interval boundaries ... = f(x) y = ... x ’s shadow interval 96

  58. Code Generation Algorithms for architectural idempotence x ’s live region x = ... interval boundaries ... = f(x) y = ... y ’s live interval x ’s shadow interval 97

  59. Big Regions Re: Problem #2 (cut in loops are bad) C code CFG + SSA for (i = 0; i < X; i++) { i 0 = φ (0, i 1 ) ... } i 1 = i 0 + 1 if (i 1 < X) 98

  60. Big Regions Re: Problem #2 (cut in loops are bad) C code machine code for (i = 0; R0 = 0 i < X; i++) { ... } R0 = R0 + 1 if (R0 < X) 99

  61. Big Regions Re: Problem #2 (cut in loops are bad) C code machine code for (i = 0; R0 = 0 i < X; i++) { ... } R0 = R0 + 1 if (R0 < X) 100

Recommend


More recommend