12b.1 12b.2 Goals • Understand the terms and ideas used in a modern, high-performance processor CS356 Unit 12b • Various systems have different kinds of processors and you should understand the pros and cons of each kind of processor Advanced Processor Organization • Terms to listen for and understand the concept: – Superscalar/multiple issue, loop unrolling, register renaming, out-of-order execution, speculation, and branch prediction 12b.3 12b.4 A New Instruction • In x86, we often perform – cmp %rax, %rbx – je L1 or jne L1 • Many instruction sets have a single instruction that both compares and jumps (limited to registers only) – je %rax, %rbx, L1 – jne %rax, %rbx, L1 INSTRUCTION LEVEL PARALLELISM • Let us assume x86 supports such an instruction in our subsequent discussion
12b.5 12b.6 Have We Hit The Limit Exploiting Parallelism • With increasing transistor budgets of modern processors (i.e. • Under ideal circumstances, pipeline would can do more things at the same time) the question becomes allow us to achieve a throughput how do we find enough useful tasks to increase performance, (IPC = Instruction per clock) of __________ or, put another way, what is the most effective ways of exploiting parallelism! • Can we do better? Can we execute more than • Many types of parallelism available one instruction per clock? – _____________ Level Parallelism (ILP): Overlapping instructions within – Not with a single pipeline a single process/thread of execution – ___________ Level Parallelism (TLP): Overlap execution of multiple – But what if we had ____________________ processes / threads – What if we fetched multiple __________ per clock – _________ Level Parallelism (DLP): Overlap an operation (instruction) that is to be applied to multiple data values (usually in an array) and let them run down the pipeline in parallel • for(i=0; i < MAX; i++) { A[i] = A[i] + 5; } • Let's exploit _______________! • We'll focus on ILP in this unit 12b.7 12b.8 Basic Blocks Instruction Level Parallelism (ILP) • Basic Block (def.) = Sequence of instructions that will • Although a program defines a sequential ordering of instructions, in reality ________ be executed ___________ many instructions can be executed in parallel. • ILP refers to the process of finding instructions from a single program/thread ld 0(%r8),%r9 – No conditional branches out and %r10,%r11 of execution that can be executed in parallel This is a L1: add %r8,%r12 – No branch targets coming in basic block or %r11,%r13 • Data flow (data ______________) is what truly _________ ordering (starts w/ sub %r14,%r10 _______, ends – Also called "straight-line" code – We call these dependencies _________________________ Hazards jeq %r12,%r14,L1 with _______) xor %r10,%r15 • Independent instructions can be __________________ – Average size: ______ instrucs. • Control hazards also provide ordering constraints • Instructions in a basic block can be overlapped if ld 0(%r8), %r9 LD AND SUB ADD and %r10, %r11 write %r11 there are no data dependencies or %r11, %r13 read %r11 sub %r14, %r15 write %r15 Dependency add %r10, %r12 • ____________ dependences really limit our window write %r12 Graph OR JE je $0,%r12,L1 read %r12 xor %r15, %rax read %r15 of possible instructions to overlap XOR Cycle 1: / / / – Without extra hardware, we can only overlap execution of Cycle 2: / / / Cycle 3: / / / instructions within a basic block
12b.9 12b.10 Superscalar Superscalar (Multiple Issue) • When airplanes broke the sound barrier we said • Multiple "pipelines" that can fetch, decode, and they were super-sonic potentially execute more than 1 instruction per clock • When processor (HW) can complete ____________ – k-way superscalar = Ability to complete up to k instructions instruction per clock cycle we say they are super- scalar per clock cycle • Problem : The HW can execute 2 or more • Benefits This Photo by Unknown Author is licensed under CC BY-NC-ND instructions during the same cycle but the SW may be written and compiled assuming 1 instruction – Theoretical throughput greater than 1 (IPC > 1) executing at a time. • Problems • Solutions – Hazards – ______________ the code and rely on the ________ to safely order instructions that can be run in parallel • Dependencies between instructions limiting parallelism (static scheduling) • Branch/jump requires flushing all pipelines – Build the ______ to be smart, _______ instructions – Finding enough parallel instructions on the fly while guaranteeing correctness (dynamic scheduling) 12b.11 12b.12 Data Flow and Dependency Graphs ld 0(%r8), %r9 • The compiler produces a and %r9, %r11 or %r11, %r13 sequential order of instructions sub %r14, %r15 add %r10, %r12 je $0,%r12,L1 xor %r15, %r9 • Modern processors will transform the sequential order to execute instructions in Compiler-based solutions parallel LD STATIC MULTIPLE ISSUE MACHINES • Instructions can be executed in AND SUB ADD any valid __________________ OR JE of the dependency graph XOR
12b.13 12b.14 Static Multiple Issue Example 2-way VLIW machine • ___________ is responsible for finding and packaging • One issue slot for INT/BRANCH operations & another for LD/ST instructions instructions that can execute in parallel into issue packets • I-Cache reads out an entire issue packet (more than 1 instruction) – Only certain combinations of instructions can be in a packet together • HW is added to allow many registers to be accessed at one time – Instruction packet example: – Just more multiplexers • (1) Integer/Branch instruction slot • Address Calculation Unit (just a simple adder) • (1) LD/ST instruction • (1) FP operation Integer Slot Integer Slot add %rcx,%rax PC PC • An issue packet is often thought of as an LONG instruction INT/BRANCH ALU ALU containing multiple instructions Reg. Reg. File (a.k.a. _ ery _ ong _ nstruction _ ord) I-Cache I-Cache (4 Read, File Addr. Addr. LD/ST Slot LD/ST Slot – Intel’s Itanium used this technique (static multiple issue) but called it D-Cache D-Cache 2 Write) Calc. Calc. EPIC ( _ xplicitly _ arallel _ nstruction _ omputer) ld 8(%rdi),%rdx LD/ST Issue Packet = More than 1 instruction 12b.15 12b.16 Sample Scheduling 2-way VLIW Scheduling time Int./Branch Slot LD/ST Slot • 1.) No forwarding w/in an issue packet (between instructions in a packet) • Schedule the following • 2.) Full forwarding to previous instructions loop body on our 2-way – Those behind in the pipeline static issue machine • 3.) Still 1 stall cycle necessary when LD is followed by a dependent instruction void f1(int* A, int n) { for( ; n != 0; n--, A++) w/o modifying original code but with code movement *A += 5; 2 IPC = ___ instrucs. / ___ cycles = _____ } sub %rax,%rbx add %rcx,%rax Integer Slot Int./Branch Slot LD/ST Slot PC 1 # %rdi = A ALU # %esi = n = # of iterations st %rax,0(%rdi) or %rcx,%rdx L1: ld 0(%rdi),%r9 Reg. I-Cache or %rcx,%rdx add $5,%r9 3 File Addr. st %r9,0(%rdi) LD/ST Slot D-Cache add $4,%rdi Calc. add $-1,%esi VLIW (issue jne $0,%esi,L1 packet) ld 0(%rdi),%rcx ld 0(%rdi),%rcx 3 w/ modifications and code movement IPC = ___ instrucs. / ___ cycle = _____ This Photo by Unknown Author is licensed under CC BY-SA
Recommend
More recommend