Prediction and speculation : the role of stochastic models of program behaviour in the performance of modern computers r. innocente 20 Nov 2005 roberto innocente 1 1
Speculation from the Merriam-Webster dict : an assumption of an unusual risk in hopes of obtaining commensurate gains 20 Nov 2005 roberto innocente 2 2
Speculative execution A prediction of what work is likely to be needed soon is made. Then it is speculatively executed in such a way that you can commit it if the prediction was correct or abort it. 20 Nov 2005 roberto innocente 3 3
von Neumann's model : Stored Program Computer The Control Counter today called Program Counter (PC) or Instruction Pointer (IP) keeps the address of the next instruction to be executed. The control part fetches this instruction, decodes and executes it. At the end the PC is updated. 20 Nov 2005 roberto innocente 4 4
Linear scaling of speed- Quadratic scaling of transistors Let's look at the last scaling in silicon litography from 0.13 u 250,00 to 0.9 u : a 0.70 linear scaling, 200,00 a 0.49 scaling of surface. 150,00 Gate delays scale linearly, Gate Speed Transistors transistors available scale 100,00 quadratically 50,00 We will get much more in 0,00 available complexity than 0.25 0.18 0.13 0.09 0.065 in gate speed 20 Nov 2005 roberto innocente 5 5
von Neumann's Projection/ Collapse postulate of QM A system can be described with any mix of states, but if you observe it you can only find it in one of the eigenstates , and you can only measure an eigenvalue . ( When you look at it the Shroedinger's cat is aut dead aut alive ) 20 Nov 2005 roberto innocente 6 6
Modern microprocessors Today µ processors take advantage of the fact that they need to present an architectural state compliant with the standard von Neumann's model only from time to time, being for the remaining time free to proceed in whatever way they find it convenient 20 Nov 2005 roberto innocente 7 7
ILP – Instruction Level Parallelism (Fisher 1981) Obeying the standard semantic when required, try to overlap the execution of multiple instructions as much as possible. (We will see that current microprocessors can have more than 100 instructions in flight ) 20 Nov 2005 roberto innocente 8 8
Enabling technologies for ILP exploitation Pipelining Multiple issue = Superscalar 20 Nov 2005 roberto innocente 9 9
A microprocessor in 1989 (Intel 386) CPI = Cycles Per Instruction Performance = Frequency / CPI Intel 386 : feature size : 1 micron frequency: 33 Mhz CPI = 5/6 Performance = 33 M/6 ~ 6 Kinstructions/s 20 Nov 2005 roberto innocente 10 10
Pipelining The work to be done is eXecute Memory Fetch divided in stages , with a Writeback clear signal interface Decode between them. After each Pipeline latch stage a latch memorizes the state for the next cycle. It adds some W F X D overhead, but the hope is M to get 1 result per cycle, after the pipe is full. 20 Nov 2005 roberto innocente 11 11
Limits of pipelining A latch can add 2 or 3 gate delays. Current work is around 400 gate delays you get a result every 400/n + 3 gate delays you add an overhead of 3n gate delays 20 Nov 2005 roberto innocente 12 12
Pipeline at work F D X M W When there is a cycle dependency we 1 add r1,r3,r4 say that the 2 mul r5,r6,r7 pipeline is stalled 3 bnez loop,r1 or a bubble is 4 X inserted waiting 5 X X for the dependency 6 X X X to solve. Here a 7 X X X X control 8 div r8,r3,r6 X X X X dependency causes 9 add r10,r8,r9 X X X a 4 cycles stall. 10 jmp loop X X 20 Nov 2005 roberto innocente 13 13
Instruction dependencies Data dependency : Control dependency : add r1,r2,r3 ; r1<-r2+r3 bne label1,r1,r2 mul r1,r4,r5 ; r5<-r4*r5 add r1,r2,r3 Solution: label1: register renaming, mul r4,r5,r6 result forwarding Solution: branch prediction Structural dependency : Solution: add functional units 20 Nov 2005 roberto innocente 14 14
Multiple issue (Superscalar) Architectures Architectures that are able to process multiple instructions at a time. While it was common to have multiple W F X D execution units (like an M integer and a FP unit), only in the '90 appeared the first superscalar architectures e.g. IBM Power and Pentium Pro. W F X D These architectures require a M very good branch prediction. Here it's depicted a 2 way superscalar. 20 Nov 2005 roberto innocente 15 15
Superscalar/2 Current architectures are commonly 4 or 8 way superscalars The design of the last Alpha, canceled in its late phase, was for an 8 way superscalar Extremely good branch prediction is needed : there can be hundredths of instructions in flight ( 4 way*30 stages=120) 20 Nov 2005 roberto innocente 16 16
Superscalar at work F D X M W The wasted cycle slots are now 1 add r1,r3,r4 much more than mul r5,r6,r7 in the pipelined 2 bnez loop,r1 only case X 3 X X X 4 X X X X X 5 X X X X X X X 20 Nov 2005 roberto innocente 17 17
Real World Architectures IBM power5 20 Nov 2005 roberto innocente 18 18
15 years of x86 year processor feature transistor cycles / frequen pipe FO4 gates size count instr. cy length per cycle 1979 8088 12 1988 386dx 1 275 5 33 80 1991 486dx 1 1100 50 1993 pentium 60 0.8 3100 60 5 1995 pentiumPro 0.6 5500 150 10 1997 Pentium II 0.35 7500 233 10 1999 Pentium III 0.25 9500 450 10 2000 Pentium 4 0.18 42000 1300 20 2005 Pentium 4 571 0.09 130000 3800 30 13 20 Nov 2005 roberto innocente 19 19
Feature size, frequency, complexity 1 4000 0.9 3500 0.8 3000 0.7 0.6 feat.size 2500 freq 0.5 2000 0.4 1500 0.3 0.2 1000 0.1 500 0 0 386 486dx P 60 p pro P II P III P 4 P 4 571 386 486dx P 60 p pro P II P III P 4 P 4 571 130000 120000 110000 100000 90000 80000 trans.# 70000 60000 50000 40000 30000 20000 10000 0 386 486dx P 60 p pro P II P III P 4 P 4 571 20 Nov 2005 roberto innocente 20 20
A microprocessor in 2005 (Intel Pentium4) IPC = Instructions Per Cycle Performance = Frequency * IPC Intel Pentium4 : feature size : 90 nm frequency: 3 Ghz IPC ~ 2/3 (2 for SPECint,3 for SPECfp) Performance = 3 G * 2 = 6 Ginstructions/s 20 Nov 2005 roberto innocente 21 21
Control xfer instructions Some of the instructions, instead of simply incrementing the PC to the next instruction, change it to a different value. We distinguish : Unconditional branches or simply jumps Conditional branches or simply branches subroutine calls subroutine returns traps, returns from interrupts or exceptions 20 Nov 2005 roberto innocente 22 22
Assembly – Machine instructions Only jumps or branches : j <label> j @register beq <label> bne <label> bz <label> bnz <label> 20 Nov 2005 roberto innocente 23 23
High level Language – Assembly ld r1,1 ld r1,1 ld r2,4 ld r2,4 for(i=1;i<=4;i++) loop:cmp r1,r2 loop:cmp r1,r2 { .. } beq out beq out .. .. add r1,r1,1 add r1,r1,1 jmp loop jmp loop out: out: ld r1,i ld r1,i if (i) { .. } bz next bz next .. .. next: next: loop: sub r1,1 loop: sub r1,1 while (i--) bz out bz out { .. } .. .. jmp loop jmp loop out: out: 20 Nov 2005 roberto innocente 24 24
SPEC-Std Perf. Evaluation Corporation benchmarks Well-known set of benchmarks, continuously updated, recognized as representative of possible workloads Divided in 2 big sets : SPECint : integer programs( go, m88ksim, compress, li, ijpeg, perl, vortex) SPECfp : floating point programs (mathematical simulation prgs) http://www.spec.org 20 Nov 2005 roberto innocente 25 25
Branches by type I ndirect Average from 2 % SPECint95 Returns 1 0 % I m m ediat Conditional e I m m ediate 1 6 % Returns I ndirect Condition al 7 2 % 20 Nov 2005 roberto innocente 26 26
Branches by frequency 2 5 0 2 0 0 1 5 0 Dynamic instructions 1 0 0 Dynamic branches Dynamic Cond BR 5 0 SPEC95 0 com press perl gcc go ijpeg m 8 8 ksim vortex xlisp Benchmarks (on y-axis millions of instruction) 20 Nov 2005 roberto innocente 27 27
Branches by taken rate Never Alw ays Taken taken 1 4 % 1 4 % Alw ays taken 0 -5 % 9 5 -1 0 0 % 9 5 -1 0 0 % 7 % 5 0 -9 5 % 2 1 % 5 -5 0 % 0 -5 % 5 -5 0 % Never Taken 2 4 % 5 0 -9 5 % 2 0 % Average from SPECint95 20 Nov 2005 roberto innocente 28 28
Occurrences of branches Occurrences of branches (conditional branches) : SPECint 95 1 out of 5 instruction executed (20%) SPECfp 95 1 out of 10 instruction executed (10%) Basic block is the term used for a sequence of instructions without any control xfer Note : this is different and much more than the rate of branches in the static program 20 Nov 2005 roberto innocente 29 29
Recommend
More recommend