Pentium 4 Deeply pipelined processor supporting multiple issue - PowerPoint PPT Presentation

Pentium 4 • Deeply pipelined processor supporting multiple issue with speculation and multi-threading – 2004 version: 31 clock cycles from fetch to retire, 3.2 GHZ clock rate (deep pipeline allows higher clock rate) • Front end decoder translates each IA-32 instruction into a series of RISC like micro- operations called uops • Uops executed by dynamically scheduled speculative pipeline Section 2.10 1

Pentium 4, continued • Uops are stored in an execution trace cache – Stores sequences of instructions to be executed, including nonadjacent instructions – Accessed using branch prediction bits and address of first instruction in trace – Has its own branch target buffer for predicting the outcome of uop branches – Very high hit rate – IA-32 instruction fetch rarely needed Section 2.10 2

Pentium 4, continued • Uops executed by an out-of-order speculative pipeline that uses register renaming rather than a reorder buffer • Up to three uops per clock cycle can be renamed and dispatched to the functional unit queue • Up to three uops can commit each clock cycle • Up to six uops can be dispatched to functional units each clock cycle Section 2.10 3

Figure 2.26 Section 2.10 4

About figure 2.26 • Front-end BTB – predicts next IA-32 instruction to fetch; only accessed if miss in execution trace cache • Execution trace cache – holds uops • Trace cache BTB – predicts the next uop • Registers for renaming – 128; supports 128 uops executing simultaneously • Functional units – 7 (simple ones run at twice the clock rate and accept up to two every clock cycle) Section 2.10 5

About figure 2.26 • L1 data cache – supports up to 8 outstanding misses; integer load latency is 4 cycles; FP load latency is 12 cycles • L2 cache – 18 cycle access time Section 2.10 6

Pentium 4 • Deep pipeline makes speculation and branch prediction very important for high performance • Cost of cache miss is also very high as queues will fill waiting for the miss to be handled Section 2.10 7

Pentium 4: Branch misprediction • Figure 2.28 (next slide) show branch- misprediction rate per 1000 instructions – Top five are integer benchmarks (average 186 branches per 1000 instructions) – Bottom five are fp benchmarks (48 branches per 1000 instructions) – Misprediction rate for integer benchmarks is 8 times higher than for fp benchmarks Section 2.10 8

Pentium 4: Misspeculation • Misprediction causes wrong instructions to be executed (misspeculated instructions), requires recovery time and wastes energy • Figure 2.29 (next slide) shows the percentage of uop instructions issued that are misspeculated • Note Figure 2.29 closely matches Figure 2.28 Section 2.10 10

Pentium 4: cache misses • Trace cache miss rates are almost negligible for SPEC benchmarks • L1 and L2 miss rates are more significant • Figure 2.30 (next slide) shows misses per 1000 instructions for the L1 and L2 caches • Misses for L1 is higher, however miss penalty for L2 is higher so both will impact performance Section 2.10 12

Pentium 4: CPI • Figure 2.31 (next slide) shows cycles per instruction for these same 10 SPEC benchmarks • Note mcf has worst misspeculation rate and worst L1 and L2 miss rate and also has highest CPI • Note swim has high L1 and L2 miss rate and is lowest performing FP benchmark Section 2.10 14

Comparing Pentium 4 to AMD Opteron • Both use dynamically scheduled, speculative pipeline capable of issuing three IA-32 instructions per clock cycle • Both have two levels of on-chip cache, but Opteron L1 instruction cache is not a trace cache • Biggest difference is that the Pentium 4 is more deeply pipelined • Pentium 4 has higher CPI (figure 2.32) but this makes sense given deeper pipeline Section 2.10 16

Comparing Pentium 4 to AMD Opteron • Deeper pipelining allows increase in clock rate – Will this increase make up for increase in CPI? • Figure 2.33 (next slide) compares 2.8 GHz AMD Opteron versus 3.8 GHz Intel Pentium 4 • Note the AMD has higher performance, thus the higher clock rate is insufficient to overcome the higher CPI Section 2.10 18

Comparing Pentium 4 to IBM Power5 • Sophisticated multiple-issue pipelines usually have slower clock rates than simple pipelines • Faster clock rate will win in the presence of limited ILP • IBM Power5 designed for high-performance integer and FP (two processor cores each capable of sustaining four instructions per clock cycle); 1.9GHz clock rate Section 2.10 20

Comparing Pentium 4 to IBM Power5 • Pentium 4 – single processor with multithreading; very deep pipeline; can sustain three instructions per clock cycle; higher clock rate (3.8GHz) • Figure 2.34 (next slide) compares the performance of these machines – Note that the Power5 often does better on the FP benchmarks (less branches, more parallelism) – Pentium 4 does better on Integer (higher clock rate) Section 2.10 21

Pentium 4 Deeply pipelined processor supporting multiple issue - PowerPoint PPT Presentation

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire, 3.2 GHZ clock rate (deep pipeline allows higher clock rate) Front end decoder

The Pentium Processor Chapter 7 S. Dandamudi Outline Pentium family history Protected

Intel P6 Intel P6 15-213 Internal Designation for Successor to Pentium Internal Designation for

Chapter 11 Instruction Sets: Addressing Modes and Formats Contents Addressing Pentium

Q: According to Intel, the Pentium conforms to the IEEE standards 754 and 854 for floating point

Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu

Abstraction-Refinement Edmund M. Clarke School of Computer Science Carnegie Mellon University

Pentium 4 Architecture Breakdown Key differences from the PIII Using the P4s

New speed records 640838 Pentium M cycles for point multiplication to compute a 32-byte secret

LECTURE 12 Out-of-order execution: Pentium Pro/II/III EXECUTING IA32/IA64 INSTRUCTIONS FAST

x86 architecture We will focus on the Pentium instruction set. Todays history lesson : a

Selected Pentium Instructions Chapter 12 S. Dandamudi Outline Status flags Conditional

Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register

Procedures and the Stack Chapter 10 S. Dandamudi Outline What is stack? Examples

Page 1 Ridges of Temporal Locality Ridges of Temporal Locality Pentium 4 Pentium 4 Memory

Ubuntu Installation Presentation Overview Hardware Extracting iso File Simple Installation

What is the execution time of spin(n) when n = 1 000 000? Function spin(n) : void spin(int n) {

Cavity/Muon timing Need 1. Cavity phase and amplitude measurement. 2. Cavity phase for each

Day 2 VLSI Microprocessor Design Flow Session A: Circuit design styles Break Session B: Design

System-level Exploration of Dynamical Clusteration for Adaptive Power Management in

Linear Cryptanalysis of Stream Ciphers T-79.514 Special Course on Cryptology Seminar talk Emilia

Registered(Datapath DFFs are rising edge triggered D D LOGIC F F tpd F F Clk Freq = 1/

Clock Routing Problem Formulation Specialized algorithms are required for clock (and power

Sequential Circuits Prof. Usagi Recap: Whats 16777216 special about? 0 10010111 0000 0000

AMT2.0 - Qualitative and Quantitative Trace Analysis with Extended Signal Temporal Logic TACAS