EECS 583 Class 10 Code Generation University of Michigan October - PowerPoint PPT Presentation

EECS 583 – Class 10 Code Generation University of Michigan October 6, 2014

Announcements ❖ Reminder: HW 2 » Due this Thursday, You should have started by now ❖ Class project proposals » Think about partners/topic! - 1 -

Course Project – Time to Start Thinking About This ❖ Mission statement: Design and implement something “interesting” in a compiler » LLVM preferred, but others are fine » Groups of 2-4 people (1 or 5 persons is possible in some cases) » Extend existing research paper or go out on your own ❖ Topic areas » Dynamic optimization » Approximate Computing » Memory system optimization » Machine learning for compilation » Automatic parallelization/SIMDization » Compiling for GPU/GPU-like architecture » Creating custom processors » Reliability » Energy - 2 -

Course Projects – Timetable ❖ Now » Start thinking about potential topics, identify group members ❖ Oct 20-22 (week after fall break): Project proposals » No class that week » Chang-hung and I will meet with each group, slot signups in class Oct 15 » Ideas/proposal discussed at meeting » Written proposal (a paragraph or 2 plus some references) due Monday, Oct 29 from each group ❖ Nov 3 – Dec 3: Research presentations » Each group present a research paper related to their project (20 mins + 5 mins Q&A) ❖ Late Nov: Project checkpoint » Update on your progress, what left to do ❖ Dec 8-12: Project demos » Each group, 30 min slot - Presentation/Demo/whatever you like » Turn in short report on your project - 3 -

Class Problem from Last Time à Answer r1 = 0 r2 = 0 r111 = r5 * 2 r109 = r1 << 1 Optimize this applying r1 = 0 r113 = r2 -1 induction var str reduction r2 = 0 r5 = r5 + 1 Note, after copy r5 = r5 + 1 r111 = r111 + 2 propagation, r10 r11 = r5 * 2 r11 = r111 and r4 can be r10 = r11 + 2 r10 = r11 + 2 strength reduced r12 = load (r10+0) r12 = load (r10+0) as well. r9 = r109 r9 = r1 << 1 r4 = r9 - 10 r4 = r9 - 10 r3 = load(r4+4) r3 = load(r4+4) r3 = r3 + 1 r3 = r3 + 1 store(r4+0, r3) store(r4+0, r3) r7 = r3 << 2 r6 = load(r7+0) r7 = r3 << 2 r13 = r113 r6 = load(r7+0) r1 = r1 + 1 r13 = r2 - 1 r109 = r109 + 2 r1 = r1 + 1 r2 = r2 + 1 r2 = r2 + 1 r113 = r113 + 1 r13, r12, r6, r10 - 4 - liveout r13, r12, r6, r10 liveout

ILP Optimization ❖ Traditional optimizations » Redundancy elimination » Reducing operation count ❖ ILP (instruction-level parallelism) optimizations » Increase the amount of parallelism and the ability to overlap operations » Operation count is secondary, often trade parallelism for extra instructions (avoid code explosion) ❖ ILP increased by breaking dependences » True or flow = read after write dependence » False or (anti/output) = write after read, write after write - 5 -

Back Substitution ❖ Generation of expressions by compiler frontends is very y = a + b + c – d + e – f; sequential » Account for operator precedence r9 = r1 + r2 » Apply left-to-right within r10 = r9 + r3 same precedence r11 = r10 - r4 ❖ Back substitution r12 = r11 + r5 r13 = r12 – r6 » Create larger expressions Ÿ Iteratively substitute RHS expression for LHS variable Subs r12: » Note – may correspond to r13 = r11 + r5 – r6 multiple source statements Subs r11: » Enable subsequent optis r13 = r10 – r4 + r5 – r6 ❖ Optimization Subs r10 r13 = r9 + r3 – r4 + r5 – r6 » Re-compute expression in a Subs r9 more favorable manner r13 = r1 + r2 + r3 – r4 + r5 – r6 - 6 -

Tree Height Reduction original: r9 = r1 + r2 Re-compute expression as a ❖ r10 = r9 + r3 balanced binary tree r11 = r10 - r4 » Obey precedence rules r12 = r11 + r5 » Essentially re-parenthesize r13 = r12 – r6 » Combine literals if possible after back subs: Effects ❖ r13 = r1 + r2 + r3 – r4 + r5 – r6 » Height reduced (n terms) Ÿ n-1 (assuming unit latency) r1 + r2 r3 – r4 r5 – r6 Ÿ ceil(log2(n)) » Number of operations remains final code: constant t1 = r1 + r2 » Cost + t2 = r3 – r4 Ÿ Temporary registers “live” longer t3 = r5 – r6 » Watch out for t4 = t1 + t2 Ÿ Always ok for integer arithmetic + r13 = t4 + t3 Ÿ Floating-point – may not be!! r13 - 7 -

Class Problem Assume: + = 1, * = 3 operand 0 0 0 1 2 0 arrival times r1 r2 r3 r4 r5 r6 r10 = r1 * r2 r11 = r10 + r3 r12 = r11 + r4 r13 = r12 – r5 r14 = r13 + r6 Back susbstitute Re-express in tree-height reduced form Account for latency and arrival times - 8 -

Optimizing Unrolled Loops loop: r1 = load(r2) loop: r1 = load(r2) r3 = load(r4) r3 = load(r4) r5 = r1 * r3 r5 = r1 * r3 iter1 r6 = r6 + r5 unroll 3 times r6 = r6 + r5 r2 = r2 + 4 r2 = r2 + 4 r4 = r4 + 4 r4 = r4 + 4 r1 = load(r2) if (r4 < 400) goto loop r3 = load(r4) r5 = r1 * r3 iter2 r6 = r6 + r5 Unroll = replicate loop body r2 = r2 + 4 n-1 times. r4 = r4 + 4 r1 = load(r2) Hope to enable overlap of r3 = load(r4) operation execution from r5 = r1 * r3 iter3 r6 = r6 + r5 different iterations r2 = r2 + 4 r4 = r4 + 4 Not possible! if (r4 < 400) goto loop - 9 -

Register Renaming on Unrolled Loop loop: r1 = load(r2) loop: r1 = load(r2) r3 = load(r4) r3 = load(r4) r5 = r1 * r3 r5 = r1 * r3 iter1 r6 = r6 + r5 iter1 r6 = r6 + r5 r2 = r2 + 4 r2 = r2 + 4 r4 = r4 + 4 r4 = r4 + 4 r1 = load(r2) r11 = load(r2) r3 = load(r4) r13 = load(r4) r5 = r1 * r3 r15 = r11 * r13 iter2 iter2 r6 = r6 + r5 r6 = r6 + r15 r2 = r2 + 4 r2 = r2 + 4 r4 = r4 + 4 r4 = r4 + 4 r1 = load(r2) r21 = load(r2) r3 = load(r4) r23 = load(r4) r5 = r1 * r3 r25 = r21 * r23 iter3 iter3 r6 = r6 + r5 r6 = r6 + r25 r2 = r2 + 4 r2 = r2 + 4 r4 = r4 + 4 r4 = r4 + 4 if (r4 < 400) goto loop if (r4 < 400) goto loop - 10 -

Register Renaming is Not Enough! ❖ Still not much overlap possible loop: r1 = load(r2) r3 = load(r4) ❖ Problems r5 = r1 * r3 » r2, r4, r6 sequentialize the iter1 r6 = r6 + r5 r2 = r2 + 4 iterations r4 = r4 + 4 » Need to rename these r11 = load(r2) ❖ 2 specialized renaming optis r13 = load(r4) » Accumulator variable r15 = r11 * r13 iter2 r6 = r6 + r15 expansion (r6) r2 = r2 + 4 » Induction variable expansion r4 = r4 + 4 (r2, r4) r21 = load(r2) r23 = load(r4) r25 = r21 * r23 iter3 r6 = r6 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop - 11 -

Accumulator Variable Expansion r16 = r26 = 0 ❖ Accumulator variable loop: r1 = load(r2) r3 = load(r4) » x = x + y or x = x – y r5 = r1 * r3 » where y is loop variant!! iter1 r6 = r6 + r5 r2 = r2 + 4 ❖ Create n-1 temporary r4 = r4 + 4 accumulators r11 = load(r2) ❖ Each iteration targets a r13 = load(r4) different accumulator r15 = r11 * r13 iter2 r16 = r16 + r15 ❖ Sum up the accumulator r2 = r2 + 4 variables at the end r4 = r4 + 4 ❖ May not be safe for floating- r21 = load(r2) r23 = load(r4) point values r25 = r21 * r23 iter3 r26 = r26 + r25 r2 = r2 + 4 r4 = r4 + 4 if (r4 < 400) goto loop r6 = r6 + r16 + r26 - 12 -

Induction Variable Expansion r12 = r2 + 4, r22 = r2 + 8 r14 = r4 + 4, r24 = r4 + 8 ❖ Induction variable r16 = r26 = 0 » x = x + y or x = x – y loop: r1 = load(r2) r3 = load(r4) » where y is loop invariant!! r5 = r1 * r3 ❖ Create n-1 additional induction iter1 r6 = r6 + r5 variables r2 = r2 + 12 r4 = r4 + 12 ❖ Each iteration uses and r11 = load(r12) modifies a different induction r13 = load(r14) variable r15 = r11 * r13 iter2 r16 = r16 + r15 ❖ Initialize induction variables to r12 = r12 + 12 init, init+step, init+2*step, etc. r14 = r14 + 12 ❖ Step increased to n*original r21 = load(r22) step r23 = load(r24) r25 = r21 * r23 ❖ Now iterations are completely iter3 r26 = r26 + r25 independent !! r22 = r22 + 12 r24 = r24 + 12 if (r4 < 400) goto loop r6 = r6 + r16 + r26 - 13 -

Better Induction Variable Expansion r16 = r26 = 0 ❖ With base+displacement loop: r1 = load(r2) r3 = load(r4) addressing, often don’t need r5 = r1 * r3 additional induction variables iter1 r6 = r6 + r5 » Just change offsets in each iterations to reflect step r11 = load(r2+4) » Change final increments to n r13 = load(r4+4) * original step r15 = r11 * r13 iter2 r16 = r16 + r15 r21 = load(r2+8) r23 = load(r4+8) r25 = r21 * r23 iter3 r26 = r26 + r25 r2 = r2 + 12 r4 = r4 + 12 if (r4 < 400) goto loop r6 = r6 + r16 + r26 - 14 -

Homework Problem loop: loop: r1 = load(r2) r1 = load(r2) r5 = r6 + 3 r5 = r6 + 3 r6 = r5 + r1 r6 = r5 + r1 r2 = r2 + 4 r2 = r2 + 4 if (r2 < 400) goto loop r1 = load(r2) r5 = r6 + 3 r6 = r5 + r1 r2 = r2 + 4 r1 = load(r2) r5 = r6 + 3 Optimize the unrolled r6 = r5 + r1 loop r2 = r2 + 4 if (r2 < 400) goto loop Renaming Tree height reduction Ind/Acc expansion - 15 -

Code Generation ❖ Map optimized “machine-independent” assembly to final assembly code ❖ Input code » Classical optimizations » ILP optimizations » Formed regions (sbs, hbs), applied if-conversion (if appropriate) ❖ Virtual à physical binding » 2 big steps » 1. Scheduling Ÿ Determine when every operation executions Ÿ Create MultiOps » 2. Register allocation Ÿ Map virtual à physical registers Ÿ Spill to memory if necessary - 16 -

EECS 583 Class 10 Code Generation University of Michigan October - PowerPoint PPT Presentation

EECS 583 Class 10 Code Generation University of Michigan October 6, 2014 Announcements Reminder: HW 2 Due this Thursday, You should have started by now Class project proposals Think about partners/topic! - 1 - Course Project

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

Code Generation Chapter 9 1 Compiler Construction Code Generation Issues in Code Generation

EECS 583 Class 7 Dataflow Analysis Static Single Assignment Form University of Michigan

EECS 583 Class 5 Dataflow Analysis Intro University of Michigan September 17, 2014 Reading

EECS 583 Class 2 Control Flow Analysis LLVM Introduction University of Michigan September

EECS 583 Class 4 Predicated Execution If-conversion University of Michigan September 15,

EECS 583 Class 6 Dataflow Analysis University of Michigan September 22, 2014 Announcements

AMath 483/583 Lecture 20 Notes: Outline: Adaptive quadrature, recursive functions

ss 4 Cl Class CSC 472/583 Software Security System Call, Shellcode Dr. Si Chen

AMath 483/583 Lecture 28 Notes: Outline: Numba and autojit Binary vs. ASCII output

AMath 483/583 Lecture 26 Outline: Monte Carlo methods Random number generators

EECS 583 Advanced Compilers Course Overview, Introduction to Control Flow Analysis Fall

Optimization Models EECS 127 / EECS 227AT Laurent El Ghaoui EECS department UC Berkeley Spring

Instruction Selection and Scheduling Machine code generation cs5363 1 Machine code generation

INF5110 Compiler Construction Spring 2016 1 / 98 Outline 1. Intermediate code generation

INF5110 Compiler Construction Spring 2017 1 / 97 Outline 1. Intermediate code generation

CS244 Advanced Topics in Networking Lecture 6: Switching Nick McKeown High-speed switch

Parallel Splash Belief Propagation Joseph E. Gonzalez Yucheng Low Carlos Guestrin David

ENE 2XX: Renewable Energy Systems and Control LEC 04 : Distributed Optimization of DERs Professor

Chapter 5: CPU Scheduling Outline Wh a t i s s c h e d u l i n g i n t h

Chapter 6 Cloud Resource Management and Scheduling Contents Resource management and

Instruction Scheduling List scheduling [Gibbons & Muchnick 86] Reorder instructions to

Energy-aware job scheduler for high- performance computing 7.9.2011 Olli Mmmel (VTT), Mikko

Claude TADONKI MINES ParisTech PSL Research University Centre de Recherche Informatique

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

EECS 583 Class 10 Code Generation University of Michigan October - PowerPoint PPT Presentation

EECS 583 Class 10 Code Generation University of Michigan October 6, 2014 Announcements Reminder: HW 2 Due this Thursday, You should have started by now Class project proposals Think about partners/topic! - 1 - Course Project

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

Code Generation Chapter 9 1 Compiler Construction Code Generation Issues in Code Generation

EECS 583 Class 7 Dataflow Analysis Static Single Assignment Form University of Michigan

EECS 583 Class 5 Dataflow Analysis Intro University of Michigan September 17, 2014 Reading

EECS 583 Class 2 Control Flow Analysis LLVM Introduction University of Michigan September

EECS 583 Class 4 Predicated Execution If-conversion University of Michigan September 15,

EECS 583 Class 6 Dataflow Analysis University of Michigan September 22, 2014 Announcements

AMath 483/583 Lecture 20 Notes: Outline: Adaptive quadrature, recursive functions

ss 4 Cl Class CSC 472/583 Software Security System Call, Shellcode Dr. Si Chen

AMath 483/583 Lecture 28 Notes: Outline: Numba and autojit Binary vs. ASCII output

AMath 483/583 Lecture 26 Outline: Monte Carlo methods Random number generators

EECS 583 Advanced Compilers Course Overview, Introduction to Control Flow Analysis Fall

Optimization Models EECS 127 / EECS 227AT Laurent El Ghaoui EECS department UC Berkeley Spring

Instruction Selection and Scheduling Machine code generation cs5363 1 Machine code generation

INF5110 Compiler Construction Spring 2016 1 / 98 Outline 1. Intermediate code generation

INF5110 Compiler Construction Spring 2017 1 / 97 Outline 1. Intermediate code generation

CS244 Advanced Topics in Networking Lecture 6: Switching Nick McKeown High-speed switch

Parallel Splash Belief Propagation Joseph E. Gonzalez Yucheng Low Carlos Guestrin David

ENE 2XX: Renewable Energy Systems and Control LEC 04 : Distributed Optimization of DERs Professor

Chapter 5: CPU Scheduling Outline Wh a t i s s c h e d u l i n g i n t h

Chapter 6 Cloud Resource Management and Scheduling Contents Resource management and

Instruction Scheduling List scheduling [Gibbons &amp; Muchnick 86] Reorder instructions to

Energy-aware job scheduler for high- performance computing 7.9.2011 Olli Mmmel (VTT), Mikko

Claude TADONKI MINES ParisTech PSL Research University Centre de Recherche Informatique

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Instruction Scheduling List scheduling [Gibbons & Muchnick 86] Reorder instructions to