Pentium 4 • Deeply pipelined processor supporting multiple issue with speculation and multi-threading – 2004 version: 31 clock cycles from fetch to retire, 3.2 GHZ clock rate (deep pipeline allows higher clock rate) • Front end decoder translates each IA-32 instruction into a series of RISC like micro- operations called uops • Uops executed by dynamically scheduled speculative pipeline Section 2.10 1
Pentium 4, continued • Uops are stored in an execution trace cache – Stores sequences of instructions to be executed, including nonadjacent instructions – Accessed using branch prediction bits and address of first instruction in trace – Has its own branch target buffer for predicting the outcome of uop branches – Very high hit rate – IA-32 instruction fetch rarely needed Section 2.10 2
Pentium 4, continued • Uops executed by an out-of-order speculative pipeline that uses register renaming rather than a reorder buffer • Up to three uops per clock cycle can be renamed and dispatched to the functional unit queue • Up to three uops can commit each clock cycle • Up to six uops can be dispatched to functional units each clock cycle Section 2.10 3
Figure 2.26 Section 2.10 4
About figure 2.26 • Front-end BTB – predicts next IA-32 instruction to fetch; only accessed if miss in execution trace cache • Execution trace cache – holds uops • Trace cache BTB – predicts the next uop • Registers for renaming – 128; supports 128 uops executing simultaneously • Functional units – 7 (simple ones run at twice the clock rate and accept up to two every clock cycle) Section 2.10 5
About figure 2.26 • L1 data cache – supports up to 8 outstanding misses; integer load latency is 4 cycles; FP load latency is 12 cycles • L2 cache – 18 cycle access time Section 2.10 6
Pentium 4 • Deep pipeline makes speculation and branch prediction very important for high performance • Cost of cache miss is also very high as queues will fill waiting for the miss to be handled Section 2.10 7
Pentium 4: Branch misprediction • Figure 2.28 (next slide) show branch- misprediction rate per 1000 instructions – Top five are integer benchmarks (average 186 branches per 1000 instructions) – Bottom five are fp benchmarks (48 branches per 1000 instructions) – Misprediction rate for integer benchmarks is 8 times higher than for fp benchmarks Section 2.10 8
Figure 2.28 Section 2.10 9
Pentium 4: Misspeculation • Misprediction causes wrong instructions to be executed (misspeculated instructions), requires recovery time and wastes energy • Figure 2.29 (next slide) shows the percentage of uop instructions issued that are misspeculated • Note Figure 2.29 closely matches Figure 2.28 Section 2.10 10
Figure 2.29 Section 2.10 11
Pentium 4: cache misses • Trace cache miss rates are almost negligible for SPEC benchmarks • L1 and L2 miss rates are more significant • Figure 2.30 (next slide) shows misses per 1000 instructions for the L1 and L2 caches • Misses for L1 is higher, however miss penalty for L2 is higher so both will impact performance Section 2.10 12
Figure 2.30 Section 2.10 13
Pentium 4: CPI • Figure 2.31 (next slide) shows cycles per instruction for these same 10 SPEC benchmarks • Note mcf has worst misspeculation rate and worst L1 and L2 miss rate and also has highest CPI • Note swim has high L1 and L2 miss rate and is lowest performing FP benchmark Section 2.10 14
Figure 2.31 Section 2.10 15
Comparing Pentium 4 to AMD Opteron • Both use dynamically scheduled, speculative pipeline capable of issuing three IA-32 instructions per clock cycle • Both have two levels of on-chip cache, but Opteron L1 instruction cache is not a trace cache • Biggest difference is that the Pentium 4 is more deeply pipelined • Pentium 4 has higher CPI (figure 2.32) but this makes sense given deeper pipeline Section 2.10 16
Figure 2.32 Section 2.10 17
Comparing Pentium 4 to AMD Opteron • Deeper pipelining allows increase in clock rate – Will this increase make up for increase in CPI? • Figure 2.33 (next slide) compares 2.8 GHz AMD Opteron versus 3.8 GHz Intel Pentium 4 • Note the AMD has higher performance, thus the higher clock rate is insufficient to overcome the higher CPI Section 2.10 18
Figure 2.33 Section 2.10 19
Comparing Pentium 4 to IBM Power5 • Sophisticated multiple-issue pipelines usually have slower clock rates than simple pipelines • Faster clock rate will win in the presence of limited ILP • IBM Power5 designed for high-performance integer and FP (two processor cores each capable of sustaining four instructions per clock cycle); 1.9GHz clock rate Section 2.10 20
Comparing Pentium 4 to IBM Power5 • Pentium 4 – single processor with multithreading; very deep pipeline; can sustain three instructions per clock cycle; higher clock rate (3.8GHz) • Figure 2.34 (next slide) compares the performance of these machines – Note that the Power5 often does better on the FP benchmarks (less branches, more parallelism) – Pentium 4 does better on Integer (higher clock rate) Section 2.10 21
Figure 2.34 Section 2.10 22
Recommend
More recommend