Review, #1 CS252 • Technology is changing rapidly: Graduate Computer Architecture Capacity Speed Lecture 2 Logic 2x in 3 years 2x in 3 years DRAM 4x in 3 years 2x in 10 years Review of Instruction Sets, Pipelines, and Disk 4x in 3 years 2x in 10 years Caches Processor ( n.a.) 2x in 1.5 years Prof. David Culler • What was true five years ago is not Electrical Engineering and Computer Sciences necessarily true now. University of California, Berkeley • Execution time is the REAL measure of computer performance! http://www.eecs.berkeley.edu/~culler/courses/cs252-s05 – Not clock rate, not CPI • “X is n times faster than Y” means: ExTime(y) = Performanc e(X) ExTime(X) Performanc e(Y) 1/20/05 CS252-S05 Lec2 1 1/20/05 CS252-S05 Lec2 2 Amdahl’s Law Fraction ( ) ExTime ExTime Fraction enhanced = × − + 1 new old enhanced Speedup enhanced Today: Quick review of everything you should ExTime 1 old Speedup = = overall have learned ExTime Fraction ( ) enhanced new − Fraction + 1 enhanced Speedup enhanced Best you could ever hope to do: 1 Speedup = maximum ( ) 1 - Fraction enhanced 1/20/05 CS252-S05 Lec2 3 1/20/05 CS252-S05 Lec2 4 CPI Computer Performance Cycles Per Instruction (Throughput) inst count Cycle time “Average Cycles per Instruction” CPU time = Seconds = Instructions x Cycles x Seconds CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Program Program Instruction Cycle CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count Inst Count CPI Clock Rate Program X n CPU time = Cycle Time × CPI × I ∑ j j j = 1 Compiler X (X) Inst. Set. X X I n CPI CPI F where F j = ∑ × = j j j Instructio n Count Organization X X = 1 j Technology X “Instruction Frequency” CS252-S05 Lec2 5 CS252-S05 Lec2 6 1/20/05 1/20/05 Page 1
Example: Branch Stall Impact Example: Calculating CPI bottom up Run benchmark and collect workload characterization (simulate, machine • Assume CPI = 1.0 ignoring branches (ideal) counters, or sampling) • Assume solution was stalling for 3 cycles Base Machine (Reg / Reg) Op Freq Cycles CPI(i) (% Time) • If 30% branch, Stall 3 cycles on 30% ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) Op Freq Cycles CPI(i) (% Time) Store 10% 2 .2 (13%) Other 70% 1 .7 (37%) Branch 30% 4 1.2 (63%) Branch 20% 2 .4 (27%) ⇒ new CPI = 1.9 1.5 Typical Mix of instruction types in program • New machine is 1/1.9 = 0.52 times faster (i.e. slow!) Design guideline: Make the common case fast MIPS 1% rule: only consider adding an instruction of it is shown to add 1% performance improvement on reasonable benchmarks. 1/20/05 CS252-S05 Lec2 7 1/20/05 CS252-S05 Lec2 8 SPEC: System Performance Evaluation Cooperative SPEC First Round • One program: 99% of time in single line of code • First Round 1989 – 10 programs yielding a single number (“SPECmarks”) • New front-end compiler could improve dramatically • Second Round 1992 – SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs) » Compiler Flags unlimited. March 93 of DEC 4000 Model 610: spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)= 800 memcpy(b,a,c)” 700 wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200 600 nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas 500 • Third Round 1995 400 – new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 300 floating point) 200 – “benchmarks useful for 3 years” – Single flag setting for all programs: SPECint_base95, SPECfp_base95 100 • Fourth Round 2000: 26 apps 0 gcc doduc epresso spice nasa7 eqntott li fpppp tomcatv – analysis and simulation programs matrix300 – Compression: bzip2, gzip, – Integrated circuit layout, ray tracing, lots of others Benchmark 1/20/05 CS252-S05 Lec2 9 1/20/05 CS252-S05 Lec2 10 Integrated Circuits Costs A "Typical" RISC Die cost Testing cost Packaging cost + + IC cost = Final test yield • 32-bit fixed format instruction (3 formats) Wafer cost • 32 32-bit GPR (R0 contains zero, DP take pair) Die cost = Dies per Wafer × Die yield • 3-address, reg-reg arithmetic instruction 2 π (Wafer_dia m/2) π × Wafer_diam • Single address mode for load/store: Dies per wafer Test_Die = − − Die_Area 2 Die_Area ⋅ base + displacement – no indirection • Simple branch conditions • Delayed branch see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, − α Defect_Den sity Die_area × CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3 Die Yield = Wafer_yiel d × 1 + α Die Cost goes roughly with die area 4 CS252-S05 Lec2 11 CS252-S05 Lec2 12 1/20/05 1/20/05 Page 2
Example: MIPS (- DLX) Datapath vs Control Register-Register Datapath Controller 11 10 6 5 31 26 25 21 20 16 15 0 Op Rs1 Rs2 Rd Opx signals Register-Immediate 31 26 25 21 20 16 15 0 immediate Op Rs1 Rd Branch 31 26 25 21 20 16 15 0 immediate Op Rs1 Rs2/Opx Control Points Jump / Call • Datapath: Storage, FU, interconnect sufficient to perform the 31 26 25 0 desired functions target Op – Inputs are Control Points – Outputs are signals • Controller: State machine to orchestrate operation on the data path – Based on desired function and signals 1/20/05 CS252-S05 Lec2 13 1/20/05 CS252-S05 Lec2 14 5 Steps of DLX Datapath Approaching an ISA Figure 3.1, Page 130 • Instruction Set Architecture Instruction Instr. Decode Execute Memory Write – Defines set of operations, instruction format, hardware supported Fetch Reg. Fetch Addr. Calc Access Back data types, named storage, addressing modes, sequencing Next PC • Meaning of each instruction is described by RTL on MUX architected registers and memory Adder Next SEQ PC • Given technology constraints assemble adequate datapath 4 Zero? RS1 – Architected storage mapped to actual storage MUX MUX Address RS2 – Function units to do all the required operations Memory Reg File Inst – Possible additional storage (eg. MAR, MBR, …) ALU Memory L RD – Interconnect to move information among regs and FUs Data M MUX • Map each instruction to sequence of RTLs D • Collate sequences into symbolic controller state transition Sign diagram (STD) IR <= mem[PC]; Imm Extend • Lower symbolic STD to control points PC <= PC + 4 WB Data • Implement controller Reg[IR rd ] <= Reg[IR rs ] op IRop Reg[IR rt ] 1/20/05 CS252-S05 Lec2 15 1/20/05 CS252-S05 Lec2 16 5 Steps of DLX Datapath Inst. Set Processor Controller Figure 3.4, Page 134 Instruction Instr. Decode Execute Memory Write Fetch Reg. Fetch Addr. Calc Access Back IR <= mem[PC]; Ifetch Next PC PC <= PC + 4 MUX Next SEQ PC Next SEQ PC Adder 4 Zero? A <= Reg[IR rs ]; opFetch-DCD JSR RS1 JR ST B <= Reg[IR rt ] Address MUX MUX MEM/WB Memory RS2 Reg File EX/MEM jmp ID/EX IF/ID br LD RI ALU RR Memory r <= A + IR im Data r <= A op IRop IR im if bop(A,b) PC <= IR jaddr r <= A op IRop B MUX PC <= PC+IR im IR <= mem[PC]; WB Data Sign WB <= r WB <= r WB <= Mem[r] Extend Imm PC <= PC + 4 A <= Reg[IR rs ]; RD RD RD Reg[IR rd ] <= WB Reg[IR rd ] <= WB Reg[IR rd ] <= WB B <= Reg[IR rt ] rslt <= A op IRop B WB <= rslt CS252-S05 Lec2 17 CS252-S05 Lec2 18 1/20/05 1/20/05 Reg[IR rd ] <= WB Page 3
Recommend
More recommend