ILP: COMPILER-BASED TECHNIQUES Mahdi Nazm Bojnordi Assistant - PowerPoint PPT Presentation

ILP: COMPILER-BASED TECHNIQUES Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture

Overview ¨ Announcements ¤ Homework 2 submission deadline: Feb. 13 th ¤ Homework 1 solutions will be released soon ¨ This lecture ¤ Program execution ¤ Loop optimization ¤ Superscalar pipelines ¤ Software pipelining

Big Picture ¨ Goal: improving performance Software (ILP and IC) Performance = (IPC x F) / IC Increasing IPC: 1. Improve ILP Code gen. 2. Exploit more ILP Increasing F: Architecture 1. Deeper pipeline 2. Faster technology Circuit/Device Hardware (IPC) Inst. Inst. Memory Write Execute Fetch Decode Access back

Big Picture ¨ Goal: improving performance Software (ILP and IC) Architectural Techniques: - Deep pipelining - Ideal speedup = n times - Exploiting ILP - Dynamic scheduling (HW) - Static scheduling (SW) Hardware (IPC) Inst. Inst. Memory Write Execute Fetch Decode Access back

Processor Pipeline ¨ Necessary stall cycles between dependent instructions Producer Consumer Stalls Load Any 1 fp.ALU Any 3 fp.ALU Store 2 int.ALU Branch 1

Program ¨ Loop book-keeping overheads Loop: L.D F0, 0(R1) do { ADD.D F4, F0, F2 m[i] = m[i] + s; S.D F4, 0(R1) i = i - 1; DADDUI R1, R1, #-8 } while(i>0) BNE R1, R2, Loop Producer Consumer Stalls Goal: adding s to all of the array elements Load Any 1 0 1 2 999 m: … fp.ALU Any 3 fp.ALU Store 2 s: int.ALU Branch 1

Execution Schedule ¨ Diverse impact of stall cycles on performance Loop: L.D F0, 0(R1) Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 stall S.D F4, 0(R1) ADD.D F4, F0, F2 DADDUI R1, R1, #-8 stall BNE R1, R2, Loop stall S.D F4, 0(R1) DADDUI R1, R1, #-8 stall Producer Consumer Stalls BNE R1, R2, Loop Load Any 1 stall fp.ALU Any 3 Schedule 1: 5 stall cycles fp.ALU Store 2 3 loop body instructions int.ALU Branch 1 2 loop counter instructions

Loop Optimization

Loop Optimization ¨ Re-ordering and changing immediate values Loop: L.D F0, 0(R1) Loop: L.D F0, 0(R1) DADDUI R1, R1, #-8 stall ADD.D F4, F0, F2 ADD.D F4, F0, F2 stall stall BNE R1, R2, Loop stall S.D F4, 8(R1) S.D F4, 0(R1) DADDUI R1, R1, #-8 stall BNE R1, R2, Loop stall Schedule 2: Schedule 1: 1 stall cycle 5 stall cycles 3 loop body instructions 3 loop body instructions 2 loop counter instructions 2 loop counter instructions

Loop Unrolling ¨ Reducing loop overhead by unrolling do { Loop: L.D F0, 0(R1) Loop: L.D F0, 0(R1) m[i-0] = m[i-0] + s; DADDUI R1, R1, #-8 ADD.D F4, F0, F2 m[i-1] = m[i-1] + s; ADD.D F4, F0, F2 S.D F4, 0(R1) m[i-2] = m[i-2] + s; stall L.D F6, -8(R1) m[i-3] = m[i-3] + s; BNE R1, R2, Loop ADD.D F8, F6, F2 i = i-4; S.D F4, 8(R1) S.D F8, -8(R1) } while(i>0) L.D F10,-16(R1) ADD.D F12, F10, F2 Goal: adding s to all of the array elements S.D F12, -16(R1) 0 1 2 999 L.D F14, -24(R1) m: … ADD.D F16, F14, F2 Schedule 2: S.D F16, -24(R1) 1 stall cycle DADDUI R1, R1, #-32 3 loop body instructions s: BNE R1,R2, Loop 2 loop counter instructions

Loop Unrolling ¨ Reducing loop overhead by unrolling Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) Schedule 3: ADD.D F8, F6, F2 14 stall cycles S.D F8, -8(R1) 12 loop body instructions L.D F10,-16(R1) 2 loop counter instructions ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1,R2, Loop

Instruction Reordering ¨ Eliminating stall cycles by unrolling and scheduling Loop: L.D F0, 0(R1) Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 L.D F6, -8(R1) S.D F4, 0(R1) L.D F10,-16(R1) L.D F6, -8(R1) L.D F14, -24(R1) ADD.D F8, F6, F2 ADD.D F4, F0, F2 S.D F8, -8(R1) ADD.D F8, F6, F2 L.D F10,-16(R1) ADD.D F12, F10, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F12, -16(R1) S.D F4, 0(R1) L.D F14, -24(R1) S.D F8, -8(R1) ADD.D F16, F14, F2 DADDUI R1, R1, #-32 S.D F16, -24(R1) S.D F12, 16(R1) DADDUI R1, R1, #-32 BNE R1,R2, Loop BNE R1,R2, Loop S.D F16, 8(R1)

IPC Limit ¨ Eliminating stall cycles by unrolling and scheduling Schedule 4: Loop: L.D F0, 0(R1) 0 stall cycles L.D F6, -8(R1) 12 loop body instructions L.D F10,-16(R1) 2 loop counter instructions L.D F14, -24(R1) ADD.D F4, F0, F2 + IPC = 1 ADD.D F8, F6, F2 - more instructions ADD.D F12, F10, F2 - more registers ADD.D F16, F14, F2 S.D F4, 0(R1) IPC>1 ? S.D F8, -8(R1) DADDUI R1, R1, #-32 S.D F12, 16(R1) BNE R1,R2, Loop S.D F16, 8(R1)

Summary of Scalar Pipelines ¨ Upper bound on throughput ¤ IPC <= 1 ¨ Unified pipeline for all functional units ¤ Underutilized resources ¨ Inefficient freeze policy ¤ A stall cycle delays all the following cycles ¨ Pipeline hazards ¤ Stall cycles result in limited throughput

Superscalar Pipelines

Superscalar Pipelines ¨ Separate integer and floating point pipelines ¤ An instruction packet is fetched every cycle n Very large instruction word (VLIW) ¤ Inst. packet has one fp. and one int. slots ¤ Compiler’s job is to find instructions for the slots ¤ IPC <= 2 i.EX i.MA i.IF i.ID i.WB fp.IF fp.ID fp.WB fp.EX

Superscalar Pipelines ¨ Forming instruction packets Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 Floating-point ADD.D F12, F10, F2 operations ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, #-32 S.D F12, 16(R1) BNE R1,R2, Loop S.D F16, 8(R1)

Superscalar Pipelines ¨ Ideally, the number of empty slots is zero Loop: L.D F0, 0(R1) NOP L.D F6, -8(R1) NOP L.D F10,-16(R1) ADD.D F4, F0, F2 L.D F14, -24(R1) ADD.D F8, F6, F2 DADDUI R1, R1, #-32 ADD.D F12, F10, F2 S.D F4, 32(R1) ADD.D F16, F14, F2 S.D F8, 24(R1) NOP S.D F12, 16(R1) NOP BNE R1,R2, Loop NOP S.D F16, 8(R1) NOP Schedule 5: 0 stall cycles 8 loop body packets 2 loop overhead cycles IPC = 1.4

Software Pipelining

Software Pipelining LD ADD SD Iter. 1 ADDI BNE LD ADD SD Iter. 2 ADDI BNE Loop: L.D F0, 0(R1) stall ADD.D F4, F0, F2 stall stall S.D F4, 0(R1) DADDUI R1, R1, #-8 stall BNE R1, R2, Loop stall

Software Pipelining LD ADD SD Iter. 1 ADDI BNE LD ADD SD Iter. 2 ADDI BNE LD ADD SD Iter. 3 ADDI BNE LD ADD SD Iter. 4 ADDI BNE LD ADD SD Iter. 5 ADDI BNE LD ADD SD Iter. 6 ADDI BNE … loop: SD (1) Loop: S.D F4, 0(R1) ADD (2) ADD.D F4, F0, F2 LD F0, -16(R1 ) LD (3) DADDUI R1, R1, #- 8 ADDI BNE BNE R1, R2, Loop

Software Pipelining LD ADD SD Iter. 1 ADDI BNE LD ADD SD Iter. 2 ADDI BNE LD ADD SD Iter. 3 ADDI BNE LD ADD SD Iter. 4 ADDI BNE LD ADD SD Iter. 5 ADDI BNE LD ADD SD Iter. 6 ADDI BNE … Prologue and Epilogue?

ILP: COMPILER-BASED TECHNIQUES Mahdi Nazm Bojnordi Assistant - PowerPoint PPT Presentation

ILP: COMPILER-BASED TECHNIQUES Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcements Homework 2 submission deadline: Feb. 13 th Homework 1 solutions

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors Thomas R. Gross Computer

1 ILP Ferrara sept 2018 Games 2 ILP Ferrara sept 2018 Interest of games for AI Excellent

Exploiting More ILP ILP = __________ _ ________

Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C

SAT based Abstraction-Refinement using ILP and Machine Learning Techniques Edmund Clarke Anubhav

MLP yes! Definitions ILP no ! MLP ILP = Instruction Level = Memory Level Parallelism Work

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides ILP

11/8/2012 The Structure of a Compiler (2) The Structure of a Compiler (1) Any compiler must

Compiler Development (CMPSC 401) Janyl Jumadinova January 17, 2018 Janyl Jumadinova Compiler

Principles of Compiler Design - The Brainf*ck Compiler - Clifford Wolf - www.clifford.at

Optimal ILP and Register Tiling: Analytical Model and Optimization Framework Lakshminarayanan.

FEDERAL ENERGY REGULATORY COMMISSION Multi-Stakeholder ILP Effectiveness Technical Conference

Instruction-Level Parallelism (ILP) Fine-grained parallelism Obtained by: instruction

Beyond ILP In Search of More Parallelism Instructor: Nima Honarmand Spring 2015 :: CSE 502

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal

Simulating Multi-Core RISC-V Systems in gem5 Tuan Ta, Lin Cheng, and Christopher Batten School

Proving Skipping Refinement with ACL2s Mitesh Jain and Pete Manolios Northeastern University

CENG3420 Lecture 12: Instruction-Level Parallelism Bei Yu byu@cse.cuhk.edu.hk (Latest update:

CPI < 1 Pipelined CPUs may have multiple execution units of different types (to

CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous

CS 333 Introduction to Operating Systems Class 2 OS-Related Hardware & Software The

Theory Exercises Exercise 1 Consider a simple uniprocessor system with no caches. How does

Machine-Dependent Optimization Machine-Dependent Optimization CS 105 Tour of the Black Holes

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

ILP: COMPILER-BASED TECHNIQUES Mahdi Nazm Bojnordi Assistant - PowerPoint PPT Presentation

ILP: COMPILER-BASED TECHNIQUES Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcements Homework 2 submission deadline: Feb. 13 th Homework 1 solutions

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors Thomas R. Gross Computer

1 ILP Ferrara sept 2018 Games 2 ILP Ferrara sept 2018 Interest of games for AI Excellent

Exploiting More ILP ILP = __________________ _________________ ________________

Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C

SAT based Abstraction-Refinement using ILP and Machine Learning Techniques Edmund Clarke Anubhav

MLP yes! Definitions ILP no ! MLP ILP = Instruction Level = Memory Level Parallelism Work

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides ILP

11/8/2012 The Structure of a Compiler (2) The Structure of a Compiler (1) Any compiler must

Compiler Development (CMPSC 401) Janyl Jumadinova January 17, 2018 Janyl Jumadinova Compiler

Principles of Compiler Design - The Brainf*ck Compiler - Clifford Wolf - www.clifford.at

Optimal ILP and Register Tiling: Analytical Model and Optimization Framework Lakshminarayanan.

FEDERAL ENERGY REGULATORY COMMISSION Multi-Stakeholder ILP Effectiveness Technical Conference

Instruction-Level Parallelism (ILP) Fine-grained parallelism Obtained by: instruction

Beyond ILP In Search of More Parallelism Instructor: Nima Honarmand Spring 2015 :: CSE 502

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal

Simulating Multi-Core RISC-V Systems in gem5 Tuan Ta, Lin Cheng, and Christopher Batten School

Proving Skipping Refinement with ACL2s Mitesh Jain and Pete Manolios Northeastern University

CENG3420 Lecture 12: Instruction-Level Parallelism Bei Yu byu@cse.cuhk.edu.hk (Latest update:

CPI &lt; 1 Pipelined CPUs may have multiple execution units of different types (to

CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous

CS 333 Introduction to Operating Systems Class 2 OS-Related Hardware &amp; Software The

Theory Exercises Exercise 1 Consider a simple uniprocessor system with no caches. How does

Machine-Dependent Optimization Machine-Dependent Optimization CS 105 Tour of the Black Holes

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Exploiting More ILP ILP = __________ _ ________

CPI < 1 Pipelined CPUs may have multiple execution units of different types (to

CS 333 Introduction to Operating Systems Class 2 OS-Related Hardware & Software The