Transparent Parallelization of Binary Code Benoît Pradelle Alain Ketterlin Philippe Clauss Université de Strasbourg INRIA (CAMUS team, Centre Nancy Grand-Est) CNRS (LSIIT, UMR 7005) First International Workshop on Polyhedral Compilation Techniques (IMPACT 2011) Sunday April 3, 2011
Overview Raising Parallelizing Lowering Results Conclusion Overview transform/ Intermediate Parallel form form optimize lowering raising Original binary Parallel sequential program executable 1. Bring the (x86-64) code into “something usable” 2. Apply parallelizing transformations 3. Translate back into “something executable” 4. Empirical evaluation
Overview Raising Parallelizing Lowering Results Conclusion Raising / Decompiling x86-64 1. Rebuild CFG interp_ 2. Natural loops 0x4039f2.1 0x403a99.1 3. Points-to to discriminate 0x403b9f.1 ◮ current stack frame 0x403bdf.1 ◮ “outer” memory 0x403bef.1 → track stack slots 0x403c10.1 4. SSA 0x403c2d.1 0x403c5a.1 5. Slicing/symbolic analysis 0x403c6b.1 0x403c74.1 ◮ memory addresses 0x403d18.1 ◮ branch conditions 0x403d2e.1 6. Induction variables 0x403d5b.1 0x403d87.1 → normalized counters 0x403dc6.1 7. Control dependence 0x403de6.1 0x403def.1 ◮ trip-counts 0x404213.1 0x403e24.1 ◮ block constraints 0x403e38.1 × 2 8. Loop selection 0x403e8c.1 0x403ea2.1 0x403ecf.1 0x403efa.1 0x403f39.1 0x403f5a.1 0x403f63.1 0x404031.1 0x404062.1 0x4040a1.1 0x404104.1 0x404168.1 0x4041a5.1 0x4041b0.1 0x4041ea.1 0x403e2f.1 (exit)
Overview Raising Parallelizing Lowering Results Conclusion Raising / Decompiling x86-64 1. Rebuild CFG mov [rsp+0xe8], 0x2 ; -> _V_42.0 L1: 2. Natural loops _V_42.1 = φ (_V_42.0,_V_42.2) 3. Points-to to discriminate ; @ [rsp 1 - 0x2e0] ◮ current stack frame = 2 + I ... ◮ “outer” memory mov rax 29 , [rsp+0xf0] → track stack slots L2: 4. SSA rax 30 = φ (rax 29 ,rax 31 ) 5. Slicing/symbolic analysis = ... + J*0x8 ◮ memory addresses ... addsd xmm1, [rax 30 ] ◮ branch conditions ; @ ... + 8192*I + 8*J 6. Induction variables ... add rax 31 30 , 0x8 → normalized counters jmp L2 7. Control dependence ◮ trip-counts add [rsp+0xe8], 0x1 ; -> _V_42.2 ◮ block constraints ... jmp L1 8. Loop selection
Overview Raising Parallelizing Lowering Results Conclusion Raising / Decompiling x86-64 → affine loop nests over a single array M xor ebp, ebp mov r11, rbp for (t1 = 0; -1023 + t1 <= 0; t1++) for (t2 = 0; -1023 + t2 <= 0; t2++) { mov M[23371872 + 8536*t1 + 8*t2], 0x0 mov M[rsp.1-0x30], r11 movsd xmm1, M[rsp.1-0x30] // <- 0. for (t3 = 0; -1023 + t3 <= 0; t3++) { movsd xmm0, M[6299744 + 8296*t1 + 8*t3] mulsd xmm0, M[14794848 + 8*t2 + 8376*t3] addsd xmm1, xmm0 } movsd M[23371872 + 8536*t1 + 8*t2], xmm1 } → almost directly usable
Overview Raising Parallelizing Lowering Results Conclusion Parallelizing / Adapting to the tools... ◮ Outlining: exact instructions do not matter, shown as ⊙ ◮ Array reconstruction: split memory into disjoint pieces Note: parametric bounds would lead to runtime checks (not really needed anymore...) ◮ Forward substitution of scalars ◮ The previous example becomes for (t1 = 0; -1023 + t1 <= 0; t1++) for (t2 = 0; -1023 + t2 <= 0; t2++) { A2[t1][8*t2] = 0 xmm1 = 0 for (t3 = 0; -1023 + t3 <= 0; t3++) xmm1 = xmm1 ⊙ ( A1[t1][8*t3] ⊙ A3[t3][8*t2] ) A2[t1][8*t2] = xmm1 }
Overview Raising Parallelizing Lowering Results Conclusion Parallelizing / Removing scalars ◮ Scalar expansion, then transformation? ◮ We don’t want this! for (t1 = 0; t1 <= 1023; t1++) for (t2 = 0; t2 <= 1023; t2++) xmm1[t1][t2] = 0; for (t1 = 0; t1 <= 1023; t1++) for (t2 = 0; t2 <= 1023; t2++) for (t3 = 0; t3 <= 1023; t3++) xmm1[t1][t2] = xmm1[t1][t2] ⊙ (A1[t1][8*t3] ⊙ A3[t3][8*t2]); for (t1 = 0; t1 <= 1023; t1++) for (t2 = 0; t2 <= 1023; t2++) A2[t1][8*t2] = xmm1[t1][t2];
Overview Raising Parallelizing Lowering Results Conclusion Parallelizing / Removing scalars ◮ Instead we do “backward substitution”: A2[t1][8*t2] = 0 xmm1 = 0 for (t3 = 0; -1023 + t3 <= 0; t3++) xmm1 = xmm1 ⊙ (A1[t1][8*t3] ⊙ A3[t3][8*t2]) A2[t1][8*t2] = xmm1 becomes A2[t1][8*t2] = 0 for (t3 = 0; -1023 + t3 <= 0; t3++) A2[t1][8*t2] = A2[t1][8*t2] ⊙ (A1[t1][8*t3] ⊙ A3[t3][8*t2]) [ xmm1 = A2[t1][8*t2] ] ◮ Restrictions: ◮ no data dependence (we use isl ) ◮ no complex mixing with other registers ◮ If we can’t back-substitute, we need to “freeze” the fragment
Overview Raising Parallelizing Lowering Results Conclusion Parallelizing / PLUTO run PLUTO
Overview Raising Parallelizing Lowering Results Conclusion Lowering / Restoring semantics ◮ Identifying statements (note: some have been moved, some duplicated... — we do not tolerate fusion/splitting) ◮ Thanks PLUTO for providing stable numbering ◮ The resulting nest(s) is(are) made of abstract statements ◮ acting on memory cells, with address expressions ◮ using registers for intermediate results → generating C is simpler than reusing the original code
Overview Raising Parallelizing Lowering Results Conclusion Lowering / Restoring semantics ◮ Identifying statements (note: some have been moved, some duplicated... — we do not tolerate fusion/splitting) ◮ Thanks PLUTO for providing stable numbering ◮ The resulting nest(s) is(are) made of abstract statements ◮ acting on memory cells, with address expressions ◮ using registers for intermediate results → generating C is simpler than reusing the original code ◮ Memory addresses are cast into pointers: (void*)(23371872+8536*t4+8*t5) ◮ Loads and stores use intrinsic functions xmm0 = _mm_load_sd((double*)(6299744+8296*t4+8*t7)); _mm_store_sd((double*)(23371872+8536*t4+8*t5), xmm1); ◮ Basic operations use intrinsics as well: xmm1 = _mm_add_sd(xmm1, xmm0);
Overview Raising Parallelizing Lowering Results Conclusion Lowering / Restoring semantics #pragma omp parallel for private(t2,t3,t4,t5) for (t2=0; t2<=1023/32; t2++) for (t3=0; t3<=1023/32; t3++) for (t4=32*t2; t4<=min(1023,32*t2+31); t4++) for (t5=32*t3; t5<=min(1023,32*t3+31); t5++) { void *tmp0 = (void*)(23371872 + 8536*t4 + 8*t5); asm volatile("movq $0, (%0)":: "r"(tmp0)); } #pragma omp parallel for private(t2,t3,t4,t5,xmm0,xmm1) for (t2=0; t2<=1023/32; t2++) for (t3=0; t3<=1023/32; t3++) for (t4=32*t2; t4<=min(1023,32*t2+31);t4++) for (t5=32*t3;t5<=min(1023,32*t3+31);t5++) { double tmp1 = 0.; xmm1 = _mm_load_sd(&tmp1); for (t7=0; t7<=1023; t7++) { xmm0 = _mm_load_sd((double*)(6299744 + 8296*t4 + 8*t7)); __m128d tmp2 = _mm_load_sd((double*)(14794848 + 8*t5 + 8376*t7)); xmm0 = _mm_mul_sd(xmm0, tmp2); xmm1 = _mm_add_sd(xmm1, xmm0); } _mm_store_sd((double*)(23371872 + 8536*t4 + 8*t5), xmm1); }
Overview Raising Parallelizing Lowering Results Conclusion Lowering / Monitoring execution ◮ Transformed/parallelized loop nests ◮ are compiled as functions with gcc ◮ and placed in a shared library ◮ We use run-time monitoring to replace a loop nest ◮ the monitoring process ptrace -s the child ◮ the child process runs the original executable ◮ breakpoints are set at loop entry ◮ and loop exit ◮ the monitor redirects (parallelized) loop executions ◮ If you think this is too complex... you’re right (we have a hidden agenda)
Overview Raising Parallelizing Lowering Results Conclusion Results / Coverage ◮ On polybench 1.0 , compiled with gcc -O2 (4.4.5) Benchmark Parallelized In source Rate 7 7 100% 2mm 10 10 100% 3mm 2 2 100% atax 2 2 100% bicg 3 5 60% correlation 3 3 100% doitgen 4 4 100% gemm 3 4 75% gemver 1 2 50% gramschmidt 1 2 50% lu Sum 36 41 87.8%
Overview Raising Parallelizing Lowering Results Conclusion Results / Speedup 8 source binary 7 6 5 4 3 2 1 0 2 3 a b c d g g g l u m m t i o o e e r a c a r i m m m m x g r t m e g m v e l a e s n c r t h i o m n i d t
Overview Raising Parallelizing Lowering Results Conclusion Results / Speedup ◮ Intel Xeon W 3520, 4 cores 8 source/OpenMP source/PLuTo binary/PLuTo 7 6 5 4 3 2 1 0 2 3 a b c c d g g g j l a u m m t i o o o e e r a c a c r v i m m m m x g r t m o e a g m v b r e l a i e s i a n c - r 2 t n h i d o c m - n e i i m d t p e r
Overview Raising Parallelizing Lowering Results Conclusion Conclusion What about “real” programs? ◮ Parameters everywhere: ◮ loop bounds ◮ access functions ◮ block constraints → conservative dependence analysis, and runtime-tests ◮ Address expressions like: 0x8 + 8*rbx.3 + 8*rdx.3 + 1*rsi.1 + 8*K + 8*rbx.3*J + 8*rdx.3*I ◮ “Fake” non-static control (on floating point values) → what exactly is a statement?
Shameless advertisement! Benoît PRADELLE ( b.pradelle@gmail.com ) Expert in: ◮ Runtime selection of parallel schedules ◮ Parallelization of binary code ◮ and more Will graduate December 2011 Available January 2012 for a post-doc position
Recommend
More recommend