unit 5 pipelining
play

Unit 5: Pipelining Load-use stalling Pipelined multi-cycle - PowerPoint PPT Presentation

This Unit: Pipelining App App App Single-cycle & multi-cycle datapaths System software Latency vs throughput & performance Basic pipelining CIS 501: Computer Architecture Mem CPU I/O Data hazards Bypassing Unit


  1. This Unit: Pipelining App App App • Single-cycle & multi-cycle datapaths System software • Latency vs throughput & performance • Basic pipelining CIS 501: Computer Architecture Mem CPU I/O • Data hazards • Bypassing Unit 5: Pipelining • Load-use stalling • Pipelined multi-cycle operations • Control hazards • Branch prediction Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania' ' with'sources'that'included'University'of'Wisconsin'slides ' by'Mark'Hill,'Guri'Sohi,'Jim'Smith,'and'David'Wood ' CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 1 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 2 Readings In-Class Exercise • Chapter 2.1 of MA:FSPTCM • You have a washer, dryer, and “folder” • Each takes 30 minutes per load • How long for one load in total? • How long for two loads of laundry? • How long for 100 loads of laundry? • Now assume: • Washing takes 30 minutes, drying 60 minutes, and folding 15 min • How long for one load in total? • How long for two loads of laundry? • How long for 100 loads of laundry? CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 3 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 4

  2. In-Class Exercise Answers • You have a washer, dryer, and “folder” • Each takes 30 minutes per load • How long for one load in total? 90 minutes • How long for two loads of laundry? 90 + 30 = 120 minutes • How long for 100 loads of laundry? 90 + 30*99 = 3060 min • Now assume: • Washing takes 30 minutes, drying 60 minutes, and folding 15 min • How long for one load in total? 105 minutes Datapath Background • How long for two loads of laundry? 105 + 60 = 165 minutes • How long for 100 loads of laundry? 105 + 60*99 = 6045 min CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 5 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 6 Recall: The Sequential Model Recall: Maximizing Performance Execution time = • Basic structure of all modern ISAs (instructions/program) * (seconds/cycle) * (cycles/instruction) • Often called VonNeuman, but in ENIAC before (1 billion instructions) * (1ns per cycle) * (1 cycle per insn) • Program order : total order on dynamic insns = 1 second • Instructions per program: • Order and named storage define computation • Determined by program, compiler, instruction set architecture (ISA) • Convenient feature: program counter (PC) • Cycles per instruction: “CPI” • Insn itself stored in memory at location pointed to by PC • Typical range today: 2 to 0.5 • Next PC is next insn unless insn says otherwise • Determined by program, compiler, ISA, micro-architecture • Seconds per cycle: “clock period” - same each cycle • Processor logically executes loop at left • Typical range today: 2ns to 0.25ns • Atomic : insn finishes before next insn starts • Reciprocal is frequency: 0.5 Ghz to 4 Ghz (1 Htz = 1 cycle per sec) • Determined by micro-architecture, technology parameters • Implementations can break this constraint physically • For minimum execution time, minimize each term • But must maintain illusion to preserve correctness • Difficult: often pull against one another CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 7 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 8

  3. Single-Cycle Datapath Multi-Cycle Datapath + + 4 4 A Register Register Insn Insn PC PC IR Mem File Mem File O D Data Data B s1 s2 d s1 s2 d Mem Mem T insn-mem T regfile T ALU T data-mem T regfile T singlecycle • Single-cycle datapath : true “atomic” fetch/execute loop • Multi-cycle datapath : attacks slow clock • Fetch, decode, execute one complete instruction every cycle • Fetch, decode, execute one complete insn over multiple cycles + Takes 1 cycle to execution any instruction by definition (“CPI” is 1) • Allows insns to take different number of cycles – Long clock period: to accommodate slowest instruction + Opposite of single-cycle: short clock period (less “work” per cycle) (worst-case delay through circuit, must wait this long every time) - Multiple cycles per instruction (higher “CPI”) CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 9 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 10 Recap: Single-cycle vs. Multi-cycle Single-cycle vs. Multi-cycle Performance • Single-cycle insn0.fetch, dec, exec • Clock period = 50ns, CPI = 1 insn1.fetch, dec, exec Single-cycle • Performance = 50ns/insn insn0.fetch insn0.dec insn0.exec Multi-cycle insn1.fetch insn1.dec insn1.exec • Multi-cycle has opposite performance split of single-cycle • Single-cycle datapath : + Shorter clock period • Fetch, decode, execute one complete instruction every cycle – Higher CPI + Low CPI: 1 by definition – Long clock period: to accommodate slowest instruction • Multi-cycle • Branch: 20% ( 3 cycles), load: 20% ( 5 cycles), ALU: 60% ( 4 cycles) • Clock period = 11ns , CPI = (20%*3)+(20%*5)+(60%*4) = 4 • Multi-cycle datapath : attacks slow clock • Why is clock period 11ns and not 10ns? overheads • Fetch, decode, execute one complete insn over multiple cycles • Performance = 44ns/insn • Allows insns to take different number of cycles ± Opposite of single-cycle: short clock period, high CPI (think: CISC) • Aside: CISC makes perfect sense in multi-cycle datapath CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 11 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 12

  4. Recall: Latency vs. Throughput • Latency (execution time) : time to finish a fixed task • Throughput (bandwidth) : number of tasks in fixed time • Different: exploit parallelism for throughput, not latency (e.g., bread) • Often contradictory (latency vs. throughput) • Will see many examples of this • Choose definition of performance that matches your goals • Scientific program? Latency, web server: throughput? • Example: move people 10 miles • Car: capacity = 5, speed = 60 miles/hour Pipelined Datapath • Bus: capacity = 60, speed = 20 miles/hour • Latency: car = 10 min , bus = 30 min • Throughput: car = 15 PPH (count return trip), bus = 60 PPH • Fastest way to send 1TB of data? (at 100+ mbits/second) CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 13 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 14 Latency versus Throughput Pipelining insn0.fetch, dec, exec insn0.fetch insn0.dec insn0.exec insn1.fetch, dec, exec insn1.fetch insn1.dec insn1.exec Single-cycle Multi-cycle insn0.fetch insn0.dec insn0.exec insn0.fetch insn0.dec insn0.exec Multi-cycle insn1.fetch insn1.dec insn1.exec Pipelined insn1.fetch insn1.dec insn1.exec • Can we have both low CPI and short clock period? • Important performance technique • Not if datapath executes only one insn at a time • Improves instruction throughput rather instruction latency • Latency and throughput: two views of performance … • Begin with multi-cycle design • (1) at the program level and (2) at the instructions level • When insn advances from stage 1 to 2, next insn enters at stage 1 • Single instruction latency • Form of parallelism: “insn-stage parallelism” • Doesn’t matter: programs comprised of billions of instructions • Maintains illusion of sequential fetch/execute loop • Difficult to reduce anyway • Individual instruction takes the same number of stages • Goal is to make programs, not individual insns, go faster + But instructions enter and leave at a much faster rate • Instruction throughput → program latency • Laundry analogy • Key: exploit inter-insn parallelism CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 15 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 16

  5. 5 Stage Multi-Cycle Datapath 5 Stage Pipeline: Inter-Insn Parallelism + << 4 + 2 4 Register Data File Insn s1 s2 d PC Mem A Mem P Insn I Register O D a C Mem R File Data B s1 s2 d Mem T insn-mem T regfile T ALU T data-mem T regfile d S T singlecycle X • Pipelining : cut datapath into N stages (here 5) • One insn in each stage in each cycle + Clock period = MAX(T insn-mem , T regfile , T ALU , T data-mem ) + Base CPI = 1: insn enters and leaves every cycle – Actual CPI > 1: pipeline must often “stall” • Individual insn latency increases (pipeline overhead), not the point CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 17 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 18 5 Stage Pipelined Datapath More Terminology & Foreshadowing • Scalar pipeline : one insn per stage per cycle PC PC • Alternative: “superscalar” (later) + 4 • In-order pipeline : insns enter execute stage in order O Insn Register A • Alternative: “out-of-order” (later) PC O D Mem File Data s1 s2 d Mem B • Pipeline depth : number of pipeline stages B IR IR IR IR PC • Nothing magical about five D X M W • Contemporary high-performance cores have ~15 stage pipelines • Five stage: F etch, D ecode, e X ecute, M emory, W riteback • Nothing magical about 5 stages (Pentium 4 had 22 stages!) • Latches (pipeline registers) named by stages they begin • PC , D , X , M , W CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 19 CIS 501: Comp. Arch. | Prof. Milo Martin | Pipelining 20

Recommend


More recommend