transparent parallelization of binary code
play

Transparent Parallelization of Binary Code Benot Pradelle Alain - PowerPoint PPT Presentation

Transparent Parallelization of Binary Code Benot Pradelle Alain Ketterlin Philippe Clauss Universit de Strasbourg INRIA (CAMUS team, Centre Nancy Grand-Est) CNRS (LSIIT, UMR 7005) First International Workshop on Polyhedral Compilation


  1. Transparent Parallelization of Binary Code Benoît Pradelle Alain Ketterlin Philippe Clauss Université de Strasbourg INRIA (CAMUS team, Centre Nancy Grand-Est) CNRS (LSIIT, UMR 7005) First International Workshop on Polyhedral Compilation Techniques (IMPACT 2011) Sunday April 3, 2011

  2. Overview Raising Parallelizing Lowering Results Conclusion Overview transform/ Intermediate Parallel form form optimize lowering raising Original binary Parallel sequential program executable 1. Bring the (x86-64) code into “something usable” 2. Apply parallelizing transformations 3. Translate back into “something executable” 4. Empirical evaluation

  3. Overview Raising Parallelizing Lowering Results Conclusion Raising / Decompiling x86-64 1. Rebuild CFG interp_ 2. Natural loops 0x4039f2.1 0x403a99.1 3. Points-to to discriminate 0x403b9f.1 ◮ current stack frame 0x403bdf.1 ◮ “outer” memory 0x403bef.1 → track stack slots 0x403c10.1 4. SSA 0x403c2d.1 0x403c5a.1 5. Slicing/symbolic analysis 0x403c6b.1 0x403c74.1 ◮ memory addresses 0x403d18.1 ◮ branch conditions 0x403d2e.1 6. Induction variables 0x403d5b.1 0x403d87.1 → normalized counters 0x403dc6.1 7. Control dependence 0x403de6.1 0x403def.1 ◮ trip-counts 0x404213.1 0x403e24.1 ◮ block constraints 0x403e38.1 × 2 8. Loop selection 0x403e8c.1 0x403ea2.1 0x403ecf.1 0x403efa.1 0x403f39.1 0x403f5a.1 0x403f63.1 0x404031.1 0x404062.1 0x4040a1.1 0x404104.1 0x404168.1 0x4041a5.1 0x4041b0.1 0x4041ea.1 0x403e2f.1 (exit)

  4. Overview Raising Parallelizing Lowering Results Conclusion Raising / Decompiling x86-64 1. Rebuild CFG mov [rsp+0xe8], 0x2 ; -> _V_42.0 L1: 2. Natural loops _V_42.1 = φ (_V_42.0,_V_42.2) 3. Points-to to discriminate ; @ [rsp 1 - 0x2e0] ◮ current stack frame = 2 + I ... ◮ “outer” memory mov rax 29 , [rsp+0xf0] → track stack slots L2: 4. SSA rax 30 = φ (rax 29 ,rax 31 ) 5. Slicing/symbolic analysis = ... + J*0x8 ◮ memory addresses ... addsd xmm1, [rax 30 ] ◮ branch conditions ; @ ... + 8192*I + 8*J 6. Induction variables ... add rax 31 30 , 0x8 → normalized counters jmp L2 7. Control dependence ◮ trip-counts add [rsp+0xe8], 0x1 ; -> _V_42.2 ◮ block constraints ... jmp L1 8. Loop selection

  5. Overview Raising Parallelizing Lowering Results Conclusion Raising / Decompiling x86-64 → affine loop nests over a single array M xor ebp, ebp mov r11, rbp for (t1 = 0; -1023 + t1 <= 0; t1++) for (t2 = 0; -1023 + t2 <= 0; t2++) { mov M[23371872 + 8536*t1 + 8*t2], 0x0 mov M[rsp.1-0x30], r11 movsd xmm1, M[rsp.1-0x30] // <- 0. for (t3 = 0; -1023 + t3 <= 0; t3++) { movsd xmm0, M[6299744 + 8296*t1 + 8*t3] mulsd xmm0, M[14794848 + 8*t2 + 8376*t3] addsd xmm1, xmm0 } movsd M[23371872 + 8536*t1 + 8*t2], xmm1 } → almost directly usable

  6. Overview Raising Parallelizing Lowering Results Conclusion Parallelizing / Adapting to the tools... ◮ Outlining: exact instructions do not matter, shown as ⊙ ◮ Array reconstruction: split memory into disjoint pieces Note: parametric bounds would lead to runtime checks (not really needed anymore...) ◮ Forward substitution of scalars ◮ The previous example becomes for (t1 = 0; -1023 + t1 <= 0; t1++) for (t2 = 0; -1023 + t2 <= 0; t2++) { A2[t1][8*t2] = 0 xmm1 = 0 for (t3 = 0; -1023 + t3 <= 0; t3++) xmm1 = xmm1 ⊙ ( A1[t1][8*t3] ⊙ A3[t3][8*t2] ) A2[t1][8*t2] = xmm1 }

  7. Overview Raising Parallelizing Lowering Results Conclusion Parallelizing / Removing scalars ◮ Scalar expansion, then transformation? ◮ We don’t want this! for (t1 = 0; t1 <= 1023; t1++) for (t2 = 0; t2 <= 1023; t2++) xmm1[t1][t2] = 0; for (t1 = 0; t1 <= 1023; t1++) for (t2 = 0; t2 <= 1023; t2++) for (t3 = 0; t3 <= 1023; t3++) xmm1[t1][t2] = xmm1[t1][t2] ⊙ (A1[t1][8*t3] ⊙ A3[t3][8*t2]); for (t1 = 0; t1 <= 1023; t1++) for (t2 = 0; t2 <= 1023; t2++) A2[t1][8*t2] = xmm1[t1][t2];

  8. Overview Raising Parallelizing Lowering Results Conclusion Parallelizing / Removing scalars ◮ Instead we do “backward substitution”: A2[t1][8*t2] = 0 xmm1 = 0 for (t3 = 0; -1023 + t3 <= 0; t3++) xmm1 = xmm1 ⊙ (A1[t1][8*t3] ⊙ A3[t3][8*t2]) A2[t1][8*t2] = xmm1 becomes A2[t1][8*t2] = 0 for (t3 = 0; -1023 + t3 <= 0; t3++) A2[t1][8*t2] = A2[t1][8*t2] ⊙ (A1[t1][8*t3] ⊙ A3[t3][8*t2]) [ xmm1 = A2[t1][8*t2] ] ◮ Restrictions: ◮ no data dependence (we use isl ) ◮ no complex mixing with other registers ◮ If we can’t back-substitute, we need to “freeze” the fragment

  9. Overview Raising Parallelizing Lowering Results Conclusion Parallelizing / PLUTO run PLUTO

  10. Overview Raising Parallelizing Lowering Results Conclusion Lowering / Restoring semantics ◮ Identifying statements (note: some have been moved, some duplicated... — we do not tolerate fusion/splitting) ◮ Thanks PLUTO for providing stable numbering ◮ The resulting nest(s) is(are) made of abstract statements ◮ acting on memory cells, with address expressions ◮ using registers for intermediate results → generating C is simpler than reusing the original code

  11. Overview Raising Parallelizing Lowering Results Conclusion Lowering / Restoring semantics ◮ Identifying statements (note: some have been moved, some duplicated... — we do not tolerate fusion/splitting) ◮ Thanks PLUTO for providing stable numbering ◮ The resulting nest(s) is(are) made of abstract statements ◮ acting on memory cells, with address expressions ◮ using registers for intermediate results → generating C is simpler than reusing the original code ◮ Memory addresses are cast into pointers: (void*)(23371872+8536*t4+8*t5) ◮ Loads and stores use intrinsic functions xmm0 = _mm_load_sd((double*)(6299744+8296*t4+8*t7)); _mm_store_sd((double*)(23371872+8536*t4+8*t5), xmm1); ◮ Basic operations use intrinsics as well: xmm1 = _mm_add_sd(xmm1, xmm0);

  12. Overview Raising Parallelizing Lowering Results Conclusion Lowering / Restoring semantics #pragma omp parallel for private(t2,t3,t4,t5) for (t2=0; t2<=1023/32; t2++) for (t3=0; t3<=1023/32; t3++) for (t4=32*t2; t4<=min(1023,32*t2+31); t4++) for (t5=32*t3; t5<=min(1023,32*t3+31); t5++) { void *tmp0 = (void*)(23371872 + 8536*t4 + 8*t5); asm volatile("movq $0, (%0)":: "r"(tmp0)); } #pragma omp parallel for private(t2,t3,t4,t5,xmm0,xmm1) for (t2=0; t2<=1023/32; t2++) for (t3=0; t3<=1023/32; t3++) for (t4=32*t2; t4<=min(1023,32*t2+31);t4++) for (t5=32*t3;t5<=min(1023,32*t3+31);t5++) { double tmp1 = 0.; xmm1 = _mm_load_sd(&tmp1); for (t7=0; t7<=1023; t7++) { xmm0 = _mm_load_sd((double*)(6299744 + 8296*t4 + 8*t7)); __m128d tmp2 = _mm_load_sd((double*)(14794848 + 8*t5 + 8376*t7)); xmm0 = _mm_mul_sd(xmm0, tmp2); xmm1 = _mm_add_sd(xmm1, xmm0); } _mm_store_sd((double*)(23371872 + 8536*t4 + 8*t5), xmm1); }

  13. Overview Raising Parallelizing Lowering Results Conclusion Lowering / Monitoring execution ◮ Transformed/parallelized loop nests ◮ are compiled as functions with gcc ◮ and placed in a shared library ◮ We use run-time monitoring to replace a loop nest ◮ the monitoring process ptrace -s the child ◮ the child process runs the original executable ◮ breakpoints are set at loop entry ◮ and loop exit ◮ the monitor redirects (parallelized) loop executions ◮ If you think this is too complex... you’re right (we have a hidden agenda)

  14. Overview Raising Parallelizing Lowering Results Conclusion Results / Coverage ◮ On polybench 1.0 , compiled with gcc -O2 (4.4.5) Benchmark Parallelized In source Rate 7 7 100% 2mm 10 10 100% 3mm 2 2 100% atax 2 2 100% bicg 3 5 60% correlation 3 3 100% doitgen 4 4 100% gemm 3 4 75% gemver 1 2 50% gramschmidt 1 2 50% lu Sum 36 41 87.8%

  15. Overview Raising Parallelizing Lowering Results Conclusion Results / Speedup 8 source binary 7 6 5 4 3 2 1 0 2 3 a b c d g g g l u m m t i o o e e r a c a r i m m m m x g r t m e g m v e l a e s n c r t h i o m n i d t

  16. Overview Raising Parallelizing Lowering Results Conclusion Results / Speedup ◮ Intel Xeon W 3520, 4 cores 8 source/OpenMP source/PLuTo binary/PLuTo 7 6 5 4 3 2 1 0 2 3 a b c c d g g g j l a u m m t i o o o e e r a c a c r v i m m m m x g r t m o e a g m v b r e l a i e s i a n c - r 2 t n h i d o c m - n e i i m d t p e r

  17. Overview Raising Parallelizing Lowering Results Conclusion Conclusion What about “real” programs? ◮ Parameters everywhere: ◮ loop bounds ◮ access functions ◮ block constraints → conservative dependence analysis, and runtime-tests ◮ Address expressions like: 0x8 + 8*rbx.3 + 8*rdx.3 + 1*rsi.1 + 8*K + 8*rbx.3*J + 8*rdx.3*I ◮ “Fake” non-static control (on floating point values) → what exactly is a statement?

  18. Shameless advertisement! Benoît PRADELLE ( b.pradelle@gmail.com ) Expert in: ◮ Runtime selection of parallel schedules ◮ Parallelization of binary code ◮ and more Will graduate December 2011 Available January 2012 for a post-doc position

Recommend


More recommend