CS356 : Discussion #13 Review for Final Exam Illustrations from - PowerPoint PPT Presentation

CS356 : Discussion #13 Review for Final Exam Illustrations from CS:APP3e textbook

Processor Organization

Pipeline Hazards: Stalling and Forwarding Stalling Forwarding

Structural Hazard: Load for next instruction ld 8(%rdx), %rax add %rax, %rcx While ld is saving %rdx into a register (phase M), add is already using its input to compute a result in phase E. Forwarding is not enough! We need the output of D-Cache, not the input... ● Use stalling and forwarding together . ● add is stalled by 1 phase ○ ld passes back the new value of %rdx during phase WB ○

2-way Very Large Instruction Word Machine No forwarding between instructions of an “issue packet” ● Full forwarding to instructions behind in the pipeline ● ● Stall 1 cycle at “load for next instruction”

2-way VLIW Machine: Scheduling Example Unoptimized Schedule (no gain wrt single pipeline) void incr5 ( int *a, int n) { for (; n != 0; n--, a++) === INTEGER SLOT === === LD/ST SLOT === *a += 5; ld 0(%rdi), %r9 } add $-1 , %esi add $5 , %r9 incr5: st %r9 , 0(%rdi) .L1: add $4 , %rdi ld 0(%rdi), %r9 jne $0 , %esi, .L1 // nop required here add $5 , %r9 st %r9 , 0(%rdi) add $4 , %rdi Optimized Schedule (move up increase of si / di ) add $-1 , %esi jne $0 , %esi, .L1 === INTEGER SLOT === === LD/ST SLOT === add $-1 , %esi ld 0(%rdi), %r9 add $4 , %rdi add $5 , %r9 jne $0 , %esi, .L1 st %r9 , -4(%rdi) From 6/6 = 1 instructions per cycle to 6/4 = 1.5

Loop Unrolling Sometimes we don’t have enough instruction for parallel pipelines. Idea: copy body k times and iterate only n / k times (assume n multiple of k ) ● Different copies of body can run in parallel. void incr5 ( int *a, int n) { incr5: old-incr5: for (; n != 0; n-= 4, a+=4) { .L1: .L1: *a += 5; 0 ld 0(%rdi), %r9 0 ld 0(%rdi), %r9 *(a+1) += 5; 0 add $5 , %r9 0 add $5 , %r9 *(a+2) += 5; 0 st %r9 , 0(%rdi) 0 st %r9 , 0(%rdi) *(a+3) += 5; 1 ld 4(%rdi), %r9 add $4 , %rdi } 1 add $5 , %r9 add $-1 , %esi } 1 st %r9 , 4(%rdi) jne $0 , %esi, .L1 2 ld 8(%rdi), %r9 2 add $5 , %r9 Still can’t run in parallel: all 2 st %r9 , 8(%rdi) copies use the register %r9 3 ld 12(%rdi), %r9 3 add $5 , %r9 ⇒ Read-After-Write (RAW) 3 st %r9 , 12(%rdi) ⇒ Register renaming add $16 , %rdi add $-4 , %esi jne $0 , %esi, .L1

Loop Unrolling and Register Renaming Optimized Schedule incr5: === INTEGER SLOT === === LD/ST SLOT === .L1: ld 0(%rdi), %r9 0 ld 0(%rdi), %r9 add $-4 , %esi ld 4(%rdi), %r10 0 add $5 , %r9 add $5 , %r9 ld 8(%rdi), %r11 0 st %r9 , 0(%rdi) add $5 , %r10 ld 12(%rdi), %r12 1 ld 4(%rdi), %r10 add $5 , %r11 st %r9 , 0(%rdi) 1 add $5 , %r10 add $5 , %r12 st %r10 , 4(%rdi) 1 st %r10 , 4(%rdi) add $16 , %rdi st %r11 , 8(%rdi) 2 ld 8(%rdi), %r11 jne $0 , %esi, .L1 st %r12 , -4(%rdi) 2 add $5 , %r11 2 st %r11 , 8(%rdi) 3 ld 12(%rdi), %r12 3 add $5 , %r12 3 st %r12 , 12(%rdi) IPC = 15/8 add $16 , %rdi add $-4 , %esi jne $0 , %esi, .L1

Exercise: 2-way VLIW Scheduling Unoptimized Schedule void f1 ( int *A, int *B, int N) { for ( ; N != 0 ; A--, B--, N--) { === INTEGER SLOT === === LD/ST SLOT === int temp = *A; ld (%rdi),%eax *A = temp + *B + 9 ; ld (%rsi), %ebx *B = temp; } add %eax , %ebx } add $9 , %ebx st %ebx ,(%rdi) .L1: st %eax ,(%rsi) ld (%rdi),%eax ; load temp=*A add $-4 ,%rdi ld (%rsi),%ebx ; load *B add $-4 ,%rsi add %eax,%ebx ; add temp+*B add $-1 ,%rdx add $9 ,%ebx ; add 9 st %ebx,(%rdi) ; store *A jne $0 ,%rdx, .L1 st %eax,(%rsi) ; store *B add $-4 ,%rdi ; dec. A ptr. add $-4 ,%rsi ; dec. B ptr. You can move or modify code, but cannot apply add $-1 ,%rdx loop unrolling or register renaming. jne $0 ,%rdx, .L1 ; loop

Solution: 2-way VLIW Scheduling Unoptimized Schedule Move Up and Modify Offsets === INTEGER SLOT === === LD/ST SLOT === === INTEGER SLOT === === LD/ST SLOT === ld (%rdi),%eax add $-4 ,%rdi ld (%rdi),%eax ld (%rsi), %ebx add $-4 ,%rsi ld (%rsi), %ebx // nop add $-1 ,%rdx add %eax , %ebx add %eax , %ebx add $9 , %ebx add $9 , %ebx st %ebx ,(%rdi) st %ebx , 4 (%rdi) st %eax ,(%rsi) st %eax , 4 (%rsi) add $-4 ,%rdi add $-4 ,%rsi jne $0 ,%rdx, .L1 add $-1 ,%rdx jne $0 ,%rdx, .L1

Solution: 2-way VLIW Scheduling Can we move more instructions up? Yes! === INTEGER SLOT === === LD/ST SLOT === === INTEGER SLOT === === LD/ST SLOT === add $-4 ,%rdi ld (%rdi),%eax add $-4 ,%rdi ld (%rdi),%eax add $-4 ,%rsi ld (%rsi), %ebx add $-4 ,%rsi ld (%rsi), %ebx add $-1 ,%rdx add $-1 ,%rdx st %eax ,4(%rsi) add %eax , %ebx add %eax , %ebx add $9 , %ebx add $9 , %ebx st %ebx ,4(%rdi) jne $0 ,%rdx, .L1 st %ebx ,4(%rdi) st %eax ,4(%rsi) jne $0 ,%rdx, .L1 IPC = 10 instructions / 6 clocks = 1.67 Note: intermediate instruction between load into %ebx and its use by add Next Exercise: Unroll the loop once (2 total iterations) with register renaming.

Unrolling the loop with register renaming Loop Unrolling void f1 ( int *A, int *B, int N) { for ( ; N != 0 ; A--, B--, N--) { .L1: int temp = *A; ld (%rdi),%eax ; load temp=*A *A = temp + *B + 9 ; ld (%rsi),%ebx ; load *B *B = temp; add %eax,%ebx ; add temp+*B } add $9 ,%ebx ; add 9 } st %ebx,(%rdi) ; store *A st %eax,(%rsi) ; store *B .L1: ld -4 (%rdi),%eax ; 2nd iter ld (%rdi),%eax ; load temp=*A ld -4 (%rsi),%ebx ; ld (%rsi),%ebx ; load *B add %eax,%ebx ; add %eax,%ebx ; add temp+*B add $9 ,%ebx ; add $9 ,%ebx ; add 9 st %ebx, -4 (%rdi) ; st %ebx,(%rdi) ; store *A st %eax, -4 (%rsi) ; st %eax,(%rsi) ; store *B add $-8 ,%rdi ; dec. A ptr. add $-4 ,%rdi ; dec. A ptr. add $-8 ,%rsi ; dec. B ptr. add $-4 ,%rsi ; dec. B ptr. add $-2 ,%rdx add $-1 ,%rdx jne $0 ,%rdx, .L1 ; loop jne $0 ,%rdx, .L1 ; loop

Unrolling the loop with register renaming Loop Unrolling Loop Unrolling / Register Renaming .L1: .L1: ld (%rdi),%eax ; load temp=*A ld (%rdi),%eax ; load temp=*A ld (%rsi),%ebx ; load *B ld (%rsi),%ebx ; load *B add %eax,%ebx ; add temp+*B add %eax,%ebx ; add temp+*B add $9 ,%ebx ; add 9 add $9 ,%ebx ; add 9 st %ebx,(%rdi) ; store *A st %ebx,(%rdi) ; store *A st %eax,(%rsi) ; store *B st %eax,(%rsi) ; store *B ld -4 (%rdi),%eax ; 2nd iter ld -4 (%rdi), %r8d ; 2nd iter ld -4 (%rsi),%ebx ; ld -4 (%rsi), %r9d ; add %eax,%ebx ; add %r8d , %r9d ; add $9 ,%ebx ; add $9 , %r9d ; st %ebx, -4 (%rdi) ; st %r9d , -4 (%rdi) ; st %eax, -4 (%rsi) ; st %r8d , -4 (%rsi) ; add $-8 ,%rdi ; dec. A ptr. add $-8 ,%rdi ; dec. A ptr. add $-8 ,%rsi ; dec. B ptr. add $-8 ,%rsi ; dec. B ptr. add $-2 ,%rdx add $-2 ,%rdx jne $0 ,%rdx, .L1 ; loop jne $0 ,%rdx, .L1 ; loop

Unrolling the loop with register renaming Loop Unrolling / Register Renaming Unoptimized Schedule .L1: === INTEGER SLOT === === LD/ST SLOT === ld (%rdi),%eax ; load temp=*A ld (%rdi),%eax ld (%rsi),%ebx ; load *B ld (%rsi),%ebx add %eax,%ebx ; add temp+*B //nop add $9 ,%ebx ; add 9 add %eax,%ebx st %ebx,(%rdi) ; store *A add $9 ,%ebx st %eax,(%rsi) ; store *B st %ebx,(%rdi) ld -4 (%rdi), %r8d ; 2nd iter st %eax,(%rsi) ld -4 (%rsi), %r9d ; ld -4 (%rdi), %r8d add %r8d , %r9d ; ld -4 (%rsi), %r9d add $9 , %r9d ; //nop st %r9d , -4 (%rdi) ; add %r8d , %r9d st %r8d , -4 (%rsi) ; add $9 , %r9d add $-8 ,%rdi ; dec. A ptr. st %r9d , -4 (%rdi) add $-8 ,%rsi ; dec. B ptr. st %r8d , -4 (%rsi) add $-2 ,%rdx add $-8 ,%rdi jne $0 ,%rdx, .L1 ; loop add $-8 ,%rsi add $-2 ,%rdx jne $0 ,%rdx, .L1

Unrolling the loop with register renaming Unoptimized Schedule Step 1 === INTEGER SLOT === === LD/ST SLOT === === INTEGER SLOT === === LD/ST SLOT === ld (%rdi),%eax add $-8 ,%rdi ld (%rdi),%eax ld (%rsi),%ebx add $-8 ,%rsi ld (%rsi),%ebx //nop add $-2 ,%rdx add %eax,%ebx add %eax,%ebx add $9 ,%ebx add $9 ,%ebx st %ebx,(%rdi) st %ebx, 8 (%rdi) Increased st %eax,(%rsi) st %eax, 8 (%rsi) Offset ld -4 (%rdi), %r8d ld 4 (%rdi), %r8d ld -4 (%rsi), %r9d ld 4 (%rsi), %r9d //nop //nop add %r8d , %r9d add %r8d , %r9d add $9 , %r9d add $9 , %r9d st %r9d , -4 (%rdi) st %r9d , 4 (%rdi) st %r8d , -4 (%rsi) st %r8d , 4 (%rsi) add $-8 ,%rdi jne $0 ,%rdx, .L1 add $-8 ,%rsi add $-2 ,%rdx jne $0 ,%rdx, .L1

Unrolling the loop with register renaming Step 1 Step 2 === INTEGER SLOT === === LD/ST SLOT === === INTEGER SLOT === === LD/ST SLOT === add $-8 ,%rdi ld (%rdi),%eax add $-8 ,%rdi ld (%rdi),%eax add $-8 ,%rsi ld (%rsi),%ebx add $-8 ,%rsi ld (%rsi),%ebx add $-2 ,%rdx add $-2 ,%rdx ld 4 (%rdi), %r8d add %eax,%ebx add %eax,%ebx ld 4 (%rsi), %r9d Reversed add $9 ,%ebx add $9 ,%ebx st %eax, 8 (%rsi) %rsi / %rdi st %ebx, 8 (%rdi) st %ebx, 8 (%rdi) st %eax, 8 (%rsi) ld 4 (%rdi), %r8d ld 4 (%rsi), %r9d //nop //nop add %r8d , %r9d add %r8d , %r9d add $9 , %r9d add $9 , %r9d st %r9d , 4 (%rdi) st %r9d , 4 (%rdi) st %r8d , 4 (%rsi) st %r8d , 4 (%rsi) jne $0 ,%rdx, .L1 jne $0 ,%rdx, .L1

Unrolling the loop with register renaming Step 2 Step 3 === INTEGER SLOT === === LD/ST SLOT === === INTEGER SLOT === === LD/ST SLOT === add $-8 ,%rdi ld (%rdi),%eax add $-8 ,%rdi ld (%rdi),%eax add $-8 ,%rsi ld (%rsi),%ebx add $-8 ,%rsi ld (%rsi),%ebx add $-2 ,%rdx ld 4 (%rdi), %r8d add $-2 ,%rdx ld 4 (%rdi), %r8d Increased add %eax,%ebx ld 4 (%rsi), %r9d add %eax,%ebx ld 4 (%rsi), %r9d Offset add $9 ,%ebx st %eax, 8 (%rsi) add $9 ,%ebx st %eax, 8 (%rsi) st %ebx, 8 (%rdi) add %r8d , %r9d st %ebx, 8 (%rdi) Reversed add $9 , %r9d st %r8d , 4 (%rsi) %rsi / %rdi jne $0 ,%rdx, .L1 st %r9d , 4 (%rdi) //nop IPC = 16 instructions / 8 clocks = 2 add %r8d , %r9d add $9 , %r9d Note: intermediate instructions between st %r9d , 4 (%rdi) st %r8d , 4 (%rsi) loads and uses of a register. jne $0 ,%rdx, .L1

CS356 : Discussion #13 Review for Final Exam Illustrations from - PowerPoint PPT Presentation

CS356 : Discussion #13 Review for Final Exam Illustrations from CS:APP3e textbook Processor Organization Pipeline Hazards: Stalling and Forwarding Stalling Forwarding Structural

Introduction to CS356 CS356 Object-Oriented Design and Programming http://cs356.yusun.io

SOLID: Principles of OOD CS356 Object-Oriented Design and Programming http://cs356.yusun.io

CS356 : Discussion #5 Assembly Procedures and Arrays Procedures Functions are a key abstraction

CS356 : Discussion #9 Cache Lab & Review for Midterm II Illustrations from CS:APP3e textbook

CS356 : Discussion #10 Dynamic Memory and Cache Lab Illustrations from CS:APP3e textbook Cache

CS356 : Discussion #11 Dynamic Memory, Allocation Lab and Linking Illustrations from CS:APP3e

CS356 : Discussion #4 Assembly Instructions & Debugging with GDB Last week: Operand Forms

CS356 : Discussion #3 Assembly Instructions What about programs that operate on data? Integer

CS356 : Discussion #14 Processor Architecture Marco Paolieri (paolieri@usc.edu) Illustrations

CS356 : Discussion #15 Review for Final Exam Marco Paolieri (paolieri@usc.edu) Illustrations

CS356 : Discussion #2 Integer Operations & Floating-Point Operations Integers in C (64-bit

CS356 Unit 4 Intro to x86 Instruction Set 4.2 Why Learn Assembly To understand something of

CS356 Unit 5 x86 Control Flow 5.2 JUMP/BRANCHING OVERVIEW 5.3 Concept of Jumps/Branches

CS356 Unit 10 Memory Allocation & Heap Management 10.2 BASIC OS CONCEPTS & TERMINOLOGY

CS356 Unit 6 x86 Procedures Basic Stack Frames 6.2 Review of Program Counter (IP register)

CS356 Unit 9 Virtual Memory & Address Translation 9.2 Indirection Indirection means

Asynchronous Pattern Matching Amihood Amir Yonatan Aumann, Gary Benson,Tzvika Hartman, Oren

A Simple Streaming Bit-parallel Algorithm for Swap Pattern Matching V aclav Bla zej (joint

Stable Matching Problem Goal. Given a set of preferences among colleges and high school

Proof of Novelty A distributed consensus mechanism for securing content novelty Daniel Severo

Tuples, Records, Need a way to build pairs and a way to

Matching and Modifying with Generics Neil Brown and Adam Sampson Computing Laboratory University

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Data Matching Overview of Computer Science Methods and Research at the ANU Peter Christen

CS356 : Discussion #13 Review for Final Exam Illustrations from - PowerPoint PPT Presentation

CS356 : Discussion #13 Review for Final Exam Illustrations from CS:APP3e textbook Processor Organization Pipeline Hazards: Stalling and Forwarding Stalling Forwarding Structural

Introduction to CS356 CS356 Object-Oriented Design and Programming http://cs356.yusun.io

SOLID: Principles of OOD CS356 Object-Oriented Design and Programming http://cs356.yusun.io

CS356 : Discussion #5 Assembly Procedures and Arrays Procedures Functions are a key abstraction

CS356 : Discussion #9 Cache Lab &amp; Review for Midterm II Illustrations from CS:APP3e textbook

CS356 : Discussion #10 Dynamic Memory and Cache Lab Illustrations from CS:APP3e textbook Cache

CS356 : Discussion #11 Dynamic Memory, Allocation Lab and Linking Illustrations from CS:APP3e

CS356 : Discussion #4 Assembly Instructions &amp; Debugging with GDB Last week: Operand Forms

CS356 : Discussion #3 Assembly Instructions What about programs that operate on data? Integer

CS356 : Discussion #14 Processor Architecture Marco Paolieri (paolieri@usc.edu) Illustrations

CS356 : Discussion #15 Review for Final Exam Marco Paolieri (paolieri@usc.edu) Illustrations

CS356 : Discussion #2 Integer Operations &amp; Floating-Point Operations Integers in C (64-bit

CS356 Unit 4 Intro to x86 Instruction Set 4.2 Why Learn Assembly To understand something of

CS356 Unit 5 x86 Control Flow 5.2 JUMP/BRANCHING OVERVIEW 5.3 Concept of Jumps/Branches

CS356 Unit 10 Memory Allocation &amp; Heap Management 10.2 BASIC OS CONCEPTS &amp; TERMINOLOGY

CS356 Unit 6 x86 Procedures Basic Stack Frames 6.2 Review of Program Counter (IP register)

CS356 Unit 9 Virtual Memory &amp; Address Translation 9.2 Indirection Indirection means

Asynchronous Pattern Matching Amihood Amir Yonatan Aumann, Gary Benson,Tzvika Hartman, Oren

A Simple Streaming Bit-parallel Algorithm for Swap Pattern Matching V aclav Bla zej (joint

Stable Matching Problem Goal. Given a set of preferences among colleges and high school

Proof of Novelty A distributed consensus mechanism for securing content novelty Daniel Severo

Tuples, Records, Need a way to build pairs and a way to

Matching and Modifying with Generics Neil Brown and Adam Sampson Computing Laboratory University

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Data Matching Overview of Computer Science Methods and Research at the ANU Peter Christen

CS356 : Discussion #9 Cache Lab & Review for Midterm II Illustrations from CS:APP3e textbook

CS356 : Discussion #4 Assembly Instructions & Debugging with GDB Last week: Operand Forms

CS356 : Discussion #2 Integer Operations & Floating-Point Operations Integers in C (64-bit

CS356 Unit 10 Memory Allocation & Heap Management 10.2 BASIC OS CONCEPTS & TERMINOLOGY

CS356 Unit 9 Virtual Memory & Address Translation 9.2 Indirection Indirection means