prediction and speculation
play

Prediction and speculation : the role of stochastic models of - PowerPoint PPT Presentation

Prediction and speculation : the role of stochastic models of program behaviour in the performance of modern computers r. innocente 20 Nov 2005 roberto innocente 1 1 Speculation from the Merriam-Webster dict : an assumption of an


  1. Prediction and speculation : the role of stochastic models of program behaviour in the performance of modern computers r. innocente 20 Nov 2005 roberto innocente 1 1

  2. Speculation  from the Merriam-Webster dict : an assumption of an unusual risk in hopes of obtaining commensurate gains 20 Nov 2005 roberto innocente 2 2

  3. Speculative execution  A prediction of what work is likely to be needed soon is made. Then it is speculatively executed in such a way that you can commit it if the prediction was correct or abort it. 20 Nov 2005 roberto innocente 3 3

  4. von Neumann's model : Stored Program Computer  The Control Counter today called Program Counter (PC) or Instruction Pointer (IP) keeps the address of the next instruction to be executed. The control part fetches this instruction, decodes and executes it. At the end the PC is updated. 20 Nov 2005 roberto innocente 4 4

  5. Linear scaling of speed- Quadratic scaling of transistors  Let's look at the last scaling in silicon litography from 0.13 u 250,00 to 0.9 u : a 0.70 linear scaling, 200,00 a 0.49 scaling of surface.  150,00 Gate delays scale linearly, Gate Speed Transistors transistors available scale 100,00 quadratically 50,00  We will get much more in 0,00 available complexity than 0.25 0.18 0.13 0.09 0.065 in gate speed 20 Nov 2005 roberto innocente 5 5

  6. von Neumann's Projection/ Collapse postulate of QM  A system can be described with any mix of states, but if you observe it you can only find it in one of the eigenstates , and you can only measure an eigenvalue .  ( When you look at it the Shroedinger's cat is aut dead aut alive ) 20 Nov 2005 roberto innocente 6 6

  7. Modern microprocessors Today µ processors take advantage of the  fact that they need to present an architectural state compliant with the standard von Neumann's model only from time to time, being for the remaining time free to proceed in whatever way they find it convenient 20 Nov 2005 roberto innocente 7 7

  8. ILP – Instruction Level Parallelism (Fisher 1981)  Obeying the standard semantic when required, try to overlap the execution of multiple instructions as much as possible. (We will see that current microprocessors can have more than 100 instructions in flight ) 20 Nov 2005 roberto innocente 8 8

  9. Enabling technologies for ILP exploitation   Pipelining   Multiple issue = Superscalar 20 Nov 2005 roberto innocente 9 9

  10. A microprocessor in 1989 (Intel 386)  CPI = Cycles Per Instruction  Performance = Frequency / CPI  Intel 386 :  feature size : 1 micron  frequency: 33 Mhz  CPI = 5/6  Performance = 33 M/6 ~ 6 Kinstructions/s 20 Nov 2005 roberto innocente 10 10

  11. Pipelining  The work to be done is eXecute Memory Fetch divided in stages , with a Writeback clear signal interface Decode between them. After each Pipeline latch stage a latch memorizes the state for the next cycle. It adds some W F X D overhead, but the hope is M to get 1 result per cycle, after the pipe is full. 20 Nov 2005 roberto innocente 11 11

  12. Limits of pipelining  A latch can add 2 or 3 gate delays.  Current work is around 400 gate delays  you get a result every 400/n + 3 gate delays  you add an overhead of 3n gate delays 20 Nov 2005 roberto innocente 12 12

  13. Pipeline at work F D X M W When there is a cycle dependency we 1 add r1,r3,r4 say that the 2 mul r5,r6,r7 pipeline is stalled 3 bnez loop,r1 or a bubble is 4 X inserted waiting 5 X X for the dependency 6 X X X to solve. Here a 7 X X X X control 8 div r8,r3,r6 X X X X dependency causes 9 add r10,r8,r9 X X X a 4 cycles stall. 10 jmp loop X X 20 Nov 2005 roberto innocente 13 13

  14. Instruction dependencies   Data dependency : Control dependency : add r1,r2,r3 ; r1<-r2+r3 bne label1,r1,r2 mul r1,r4,r5 ; r5<-r4*r5 add r1,r2,r3  Solution: label1:  register renaming, mul r4,r5,r6 result forwarding  Solution:   branch prediction Structural dependency :  Solution:  add functional units 20 Nov 2005 roberto innocente 14 14

  15. Multiple issue (Superscalar) Architectures Architectures that are able to process multiple instructions at a time. While it was common to have multiple W F X D execution units (like an M integer and a FP unit), only in the '90 appeared the first superscalar architectures e.g. IBM Power and Pentium Pro. W F X D These architectures require a M very good branch prediction. Here it's depicted a 2 way superscalar. 20 Nov 2005 roberto innocente 15 15

  16. Superscalar/2  Current architectures are commonly 4 or 8 way superscalars  The design of the last Alpha, canceled in its late phase, was for an 8 way superscalar  Extremely good branch prediction is needed : there can be hundredths of instructions in flight ( 4 way*30 stages=120) 20 Nov 2005 roberto innocente 16 16

  17. Superscalar at work F D X M W The wasted cycle slots are now 1 add r1,r3,r4 much more than mul r5,r6,r7 in the pipelined 2 bnez loop,r1 only case X 3 X X X 4 X X X X X 5 X X X X X X X 20 Nov 2005 roberto innocente 17 17

  18. Real World Architectures IBM power5 20 Nov 2005 roberto innocente 18 18

  19. 15 years of x86 year processor feature transistor cycles / frequen pipe FO4 gates size count instr. cy length per cycle 1979 8088 12 1988 386dx 1 275 5 33 80 1991 486dx 1 1100 50 1993 pentium 60 0.8 3100 60 5 1995 pentiumPro 0.6 5500 150 10 1997 Pentium II 0.35 7500 233 10 1999 Pentium III 0.25 9500 450 10 2000 Pentium 4 0.18 42000 1300 20 2005 Pentium 4 571 0.09 130000 3800 30 13 20 Nov 2005 roberto innocente 19 19

  20. Feature size, frequency, complexity 1 4000 0.9 3500 0.8 3000 0.7 0.6 feat.size 2500 freq 0.5 2000 0.4 1500 0.3 0.2 1000 0.1 500 0 0 386 486dx P 60 p pro P II P III P 4 P 4 571 386 486dx P 60 p pro P II P III P 4 P 4 571 130000 120000 110000 100000 90000 80000 trans.# 70000 60000 50000 40000 30000 20000 10000 0 386 486dx P 60 p pro P II P III P 4 P 4 571 20 Nov 2005 roberto innocente 20 20

  21. A microprocessor in 2005 (Intel Pentium4)  IPC = Instructions Per Cycle  Performance = Frequency * IPC  Intel Pentium4 :  feature size : 90 nm  frequency: 3 Ghz  IPC ~ 2/3 (2 for SPECint,3 for SPECfp)  Performance = 3 G * 2 = 6 Ginstructions/s 20 Nov 2005 roberto innocente 21 21

  22. Control xfer instructions  Some of the instructions, instead of simply incrementing the PC to the next instruction, change it to a different value. We distinguish :  Unconditional branches or simply jumps  Conditional branches or simply branches  subroutine calls  subroutine returns  traps, returns from interrupts or exceptions 20 Nov 2005 roberto innocente 22 22

  23. Assembly – Machine instructions  Only jumps or branches :  j <label>  j @register  beq <label>  bne <label>  bz <label>  bnz <label> 20 Nov 2005 roberto innocente 23 23

  24. High level Language – Assembly ld r1,1 ld r1,1 ld r2,4 ld r2,4  for(i=1;i<=4;i++) loop:cmp r1,r2 loop:cmp r1,r2 { .. } beq out beq out .. .. add r1,r1,1 add r1,r1,1 jmp loop jmp loop out: out: ld r1,i ld r1,i  if (i) { .. } bz next bz next .. .. next: next: loop: sub r1,1 loop: sub r1,1  while (i--) bz out bz out { .. } .. .. jmp loop jmp loop out: out: 20 Nov 2005 roberto innocente 24 24

  25. SPEC-Std Perf. Evaluation Corporation benchmarks  Well-known set of benchmarks, continuously updated, recognized as representative of possible workloads  Divided in 2 big sets :  SPECint : integer programs( go, m88ksim, compress, li, ijpeg, perl, vortex)  SPECfp : floating point programs (mathematical simulation prgs)  http://www.spec.org 20 Nov 2005 roberto innocente 25 25

  26. Branches by type I ndirect Average from 2 % SPECint95 Returns 1 0 % I m m ediat Conditional e I m m ediate 1 6 % Returns I ndirect Condition al 7 2 % 20 Nov 2005 roberto innocente 26 26

  27. Branches by frequency 2 5 0 2 0 0 1 5 0 Dynamic instructions 1 0 0 Dynamic branches Dynamic Cond BR 5 0 SPEC95 0 com press perl gcc go ijpeg m 8 8 ksim vortex xlisp Benchmarks (on y-axis millions of instruction) 20 Nov 2005 roberto innocente 27 27

  28. Branches by taken rate Never Alw ays Taken taken 1 4 % 1 4 % Alw ays taken 0 -5 % 9 5 -1 0 0 % 9 5 -1 0 0 % 7 % 5 0 -9 5 % 2 1 % 5 -5 0 % 0 -5 % 5 -5 0 % Never Taken 2 4 % 5 0 -9 5 % 2 0 % Average from SPECint95 20 Nov 2005 roberto innocente 28 28

  29. Occurrences of branches  Occurrences of branches (conditional branches) :  SPECint 95 1 out of 5 instruction executed (20%)  SPECfp 95 1 out of 10 instruction executed (10%)  Basic block is the term used for a sequence of instructions without any control xfer Note : this is different and much more than the rate of branches in the static program 20 Nov 2005 roberto innocente 29 29

Recommend


More recommend