eecs 583 class 10 code generation
play

EECS 583 Class 10 Code Generation University of Michigan October - PowerPoint PPT Presentation

EECS 583 Class 10 Code Generation University of Michigan October 6, 2014 Announcements Reminder: HW 2 Due this Thursday, You should have started by now Class project proposals Think about partners/topic! - 1 - Course Project


  1. EECS 583 – Class 10 Code Generation University of Michigan October 6, 2014

  2. Announcements ❖ Reminder: HW 2 » Due this Thursday, You should have started by now ❖ Class project proposals » Think about partners/topic! - 1 -

  3. Course Project – Time to Start Thinking About This ❖ Mission statement: Design and implement something “interesting” in a compiler » LLVM preferred, but others are fine » Groups of 2-4 people (1 or 5 persons is possible in some cases) » Extend existing research paper or go out on your own ❖ Topic areas » Dynamic optimization » Approximate Computing » Memory system optimization » Machine learning for compilation » Automatic parallelization/SIMDization » Compiling for GPU/GPU-like architecture » Creating custom processors » Reliability » Energy - 2 -

  4. Course Projects – Timetable ❖ Now » Start thinking about potential topics, identify group members ❖ Oct 20-22 (week after fall break): Project proposals » No class that week » Chang-hung and I will meet with each group, slot signups in class Oct 15 » Ideas/proposal discussed at meeting » Written proposal (a paragraph or 2 plus some references) due Monday, Oct 29 from each group ❖ Nov 3 – Dec 3: Research presentations » Each group present a research paper related to their project (20 mins + 5 mins Q&A) ❖ Late Nov: Project checkpoint » Update on your progress, what left to do ❖ Dec 8-12: Project demos » Each group, 30 min slot - Presentation/Demo/whatever you like » Turn in short report on your project - 3 -

  5. Class Problem from Last Time à Answer r1 = 0 r2 = 0 r111 = r5 * 2 r109 = r1 << 1 Optimize this applying r1 = 0 r113 = r2 -1 induction var str reduction r2 = 0 r5 = r5 + 1 Note, after copy r5 = r5 + 1 r111 = r111 + 2 propagation, r10 r11 = r5 * 2 r11 = r111 and r4 can be r10 = r11 + 2 r10 = r11 + 2 strength reduced r12 = load (r10+0) r12 = load (r10+0) as well. r9 = r109 r9 = r1 << 1 r4 = r9 - 10 r4 = r9 - 10 r3 = load(r4+4) r3 = load(r4+4) r3 = r3 + 1 r3 = r3 + 1 store(r4+0, r3) store(r4+0, r3) r7 = r3 << 2 r6 = load(r7+0) r7 = r3 << 2 r13 = r113 r6 = load(r7+0) r1 = r1 + 1 r13 = r2 - 1 r109 = r109 + 2 r1 = r1 + 1 r2 = r2 + 1 r2 = r2 + 1 r113 = r113 + 1 r13, r12, r6, r10 - 4 - liveout r13, r12, r6, r10 liveout

  6. ILP Optimization ❖ Traditional optimizations » Redundancy elimination » Reducing operation count ❖ ILP (instruction-level parallelism) optimizations » Increase the amount of parallelism and the ability to overlap operations » Operation count is secondary, often trade parallelism for extra instructions (avoid code explosion) ❖ ILP increased by breaking dependences » True or flow = read after write dependence » False or (anti/output) = write after read, write after write - 5 -

  7. Back Substitution ❖ Generation of expressions by compiler frontends is very y = a + b + c – d + e – f; sequential » Account for operator precedence r9 = r1 + r2 » Apply left-to-right within r10 = r9 + r3 same precedence r11 = r10 - r4 ❖ Back substitution r12 = r11 + r5 r13 = r12 – r6 » Create larger expressions Ÿ Iteratively substitute RHS expression for LHS variable Subs r12: » Note – may correspond to r13 = r11 + r5 – r6 multiple source statements Subs r11: » Enable subsequent optis r13 = r10 – r4 + r5 – r6 ❖ Optimization Subs r10 r13 = r9 + r3 – r4 + r5 – r6 » Re-compute expression in a Subs r9 more favorable manner r13 = r1 + r2 + r3 – r4 + r5 – r6 - 6 -

  8. Tree Height Reduction original: r9 = r1 + r2 Re-compute expression as a ❖ r10 = r9 + r3 balanced binary tree r11 = r10 - r4 » Obey precedence rules r12 = r11 + r5 » Essentially re-parenthesize r13 = r12 – r6 » Combine literals if possible after back subs: Effects ❖ r13 = r1 + r2 + r3 – r4 + r5 – r6 » Height reduced (n terms) Ÿ n-1 (assuming unit latency) r1 + r2 r3 – r4 r5 – r6 Ÿ ceil(log2(n)) » Number of operations remains final code: constant t1 = r1 + r2 » Cost + t2 = r3 – r4 Ÿ Temporary registers “live” longer t3 = r5 – r6 » Watch out for t4 = t1 + t2 Ÿ Always ok for integer arithmetic + r13 = t4 + t3 Ÿ Floating-point – may not be!! r13 - 7 -

  9. Class Problem Assume: + = 1, * = 3 operand 0 0 0 1 2 0 arrival times r1 r2 r3 r4 r5 r6 r10 = r1 * r2 r11 = r10 + r3 r12 = r11 + r4 r13 = r12 – r5 r14 = r13 + r6 Back susbstitute Re-express in tree-height reduced form Account for latency and arrival times - 8 -

  10. Optimizing Unrolled Loops loop: r1 = load(r2) loop: r1 = load(r2) r3 = load(r4) r3 = load(r4) r5 = r1 * r3 r5 = r1 * r3 iter1 r6 = r6 + r5 unroll 3 times r6 = r6 + r5 r2 = r2 + 4 r2 = r2 + 4 r4 = r4 + 4 r4 = r4 + 4 r1 = load(r2) if (r4 < 400) goto loop r3 = load(r4) r5 = r1 * r3 iter2 r6 = r6 + r5 Unroll = replicate loop body r2 = r2 + 4 n-1 times. r4 = r4 + 4 r1 = load(r2) Hope to enable overlap of r3 = load(r4) operation execution from r5 = r1 * r3 iter3 r6 = r6 + r5 different iterations r2 = r2 + 4 r4 = r4 + 4 Not possible! if (r4 < 400) goto loop - 9 -

  11. Register Renaming on Unrolled Loop loop: r1 = load(r2) loop: r1 = load(r2) r3 = load(r4) r3 = load(r4) r5 = r1 * r3 r5 = r1 * r3 iter1 r6 = r6 + r5 iter1 r6 = r6 + r5 r2 = r2 + 4 r2 = r2 + 4 r4 = r4 + 4 r4 = r4 + 4 r1 = load(r2) r11 = load(r2) r3 = load(r4) r13 = load(r4) r5 = r1 * r3 r15 = r11 * r13 iter2 iter2 r6 = r6 + r5 r6 = r6 + r15 r2 = r2 + 4 r2 = r2 + 4 r4 = r4 + 4 r4 = r4 + 4 r1 = load(r2) r21 = load(r2) r3 = load(r4) r23 = load(r4) r5 = r1 * r3 r25 = r21 * r23 iter3 iter3 r6 = r6 + r5 r6 = r6 + r25 r2 = r2 + 4 r2 = r2 + 4 r4 = r4 + 4 r4 = r4 + 4 if (r4 < 400) goto loop if (r4 < 400) goto loop - 10 -

  12. Register Renaming is Not Enough! ❖ Still not much overlap possible loop: r1 = load(r2) r3 = load(r4) ❖ Problems r5 = r1 * r3 » r2, r4, r6 sequentialize the iter1 r6 = r6 + r5 r2 = r2 + 4 iterations r4 = r4 + 4 » Need to rename these r11 = load(r2) ❖ 2 specialized renaming optis r13 = load(r4) » Accumulator variable r15 = r11 * r13 iter2 r6 = r6 + r15 expansion (r6) r2 = r2 + 4 » Induction variable expansion r4 = r4 + 4 (r2, r4) r21 = load(r2) r23 = load(r4) r25 = r21 * r23 iter3 r6 = r6 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop - 11 -

  13. Accumulator Variable Expansion r16 = r26 = 0 ❖ Accumulator variable loop: r1 = load(r2) r3 = load(r4) » x = x + y or x = x – y r5 = r1 * r3 » where y is loop variant!! iter1 r6 = r6 + r5 r2 = r2 + 4 ❖ Create n-1 temporary r4 = r4 + 4 accumulators r11 = load(r2) ❖ Each iteration targets a r13 = load(r4) different accumulator r15 = r11 * r13 iter2 r16 = r16 + r15 ❖ Sum up the accumulator r2 = r2 + 4 variables at the end r4 = r4 + 4 ❖ May not be safe for floating- r21 = load(r2) r23 = load(r4) point values r25 = r21 * r23 iter3 r26 = r26 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop r6 = r6 + r16 + r26 - 12 -

  14. Induction Variable Expansion r12 = r2 + 4, r22 = r2 + 8 r14 = r4 + 4, r24 = r4 + 8 ❖ Induction variable r16 = r26 = 0 » x = x + y or x = x – y loop: r1 = load(r2) r3 = load(r4) » where y is loop invariant!! r5 = r1 * r3 ❖ Create n-1 additional induction iter1 r6 = r6 + r5 variables r2 = r2 + 12 r4 = r4 + 12 ❖ Each iteration uses and r11 = load(r12) modifies a different induction r13 = load(r14) variable r15 = r11 * r13 iter2 r16 = r16 + r15 ❖ Initialize induction variables to r12 = r12 + 12 init, init+step, init+2*step, etc. r14 = r14 + 12 ❖ Step increased to n*original r21 = load(r22) step r23 = load(r24) r25 = r21 * r23 ❖ Now iterations are completely iter3 r26 = r26 + r25 independent !! r22 = r22 + 12 r24 = r24 + 12 if (r4 < 400) goto loop r6 = r6 + r16 + r26 - 13 -

  15. Better Induction Variable Expansion r16 = r26 = 0 ❖ With base+displacement loop: r1 = load(r2) r3 = load(r4) addressing, often don’t need r5 = r1 * r3 additional induction variables iter1 r6 = r6 + r5 » Just change offsets in each iterations to reflect step r11 = load(r2+4) » Change final increments to n r13 = load(r4+4) * original step r15 = r11 * r13 iter2 r16 = r16 + r15 r21 = load(r2+8) r23 = load(r4+8) r25 = r21 * r23 iter3 r26 = r26 + r25 r2 = r2 + 12 r4 = r4 + 12 if (r4 < 400) goto loop r6 = r6 + r16 + r26 - 14 -

  16. Homework Problem loop: loop: r1 = load(r2) r1 = load(r2) r5 = r6 + 3 r5 = r6 + 3 r6 = r5 + r1 r6 = r5 + r1 r2 = r2 + 4 r2 = r2 + 4 if (r2 < 400) goto loop r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 r1 = load(r2) r5 = r6 + 3 Optimize the unrolled r6 = r5 + r1 loop r2 = r2 + 4 if (r2 < 400) goto loop Renaming Tree height reduction Ind/Acc expansion - 15 -

  17. Code Generation ❖ Map optimized “machine-independent” assembly to final assembly code ❖ Input code » Classical optimizations » ILP optimizations » Formed regions (sbs, hbs), applied if-conversion (if appropriate) ❖ Virtual à physical binding » 2 big steps » 1. Scheduling Ÿ Determine when every operation executions Ÿ Create MultiOps » 2. Register allocation Ÿ Map virtual à physical registers Ÿ Spill to memory if necessary - 16 -

Recommend


More recommend