computer architecture for the next millenium
play

Computer Architecture for the Next Millenium November 1, 1999 - PowerPoint PPT Presentation

Computer Architecture for the Next Millenium November 1, 1999 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu Outline The Stanford Concurrent VLSI Architecture Group Forces acting on


  1. Computer Architecture for the Next Millenium November 1, 1999 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu

  2. Outline • The Stanford Concurrent VLSI Architecture Group • Forces acting on computer architecture – applications (media) – technology (wire-limited) – techniques (explicit parallelism) • Example: register organization – distributed register files • Imagine a stream processor – 20GFLOPS on a 0.5cm 2 chip • Tremendous opportunities and challenges for computer architecture in the next millenium – its not a mature field yet WJD November 1, 1999 Computer Architecture for the Next Millenium 2

  3. The Concurrent VLSI Architecture Group • Architecture and design technology for VLSI • Routing chips – Torus Routing Chip, Network Design Frame, Reliable Router – Basis for Intel, Cray/SGI, Mercury, Avici network chips WJD November 1, 1999 Computer Architecture for the Next Millenium 3

  4. Parallel computer systems • J-Machine (MDP) led to Cray T3D/T3E • M-Machine (MAP) – Fast messaging, scalable processing nodes, scalable memory architecture MDP Chip J-Machine Cray T3D MAP Chip WJD November 1, 1999 Computer Architecture for the Next Millenium 4

  5. Design technology • Off-chip I/O – Simultaneous bidirectional signaling, 1989 • now used by Intel and Hitachi – High-speed signalling • 4Gb/s in 0.6 µ m CMOS, Equalization, 1995 • On-Chip Signalling – Low-voltage on-chip signalling – Low-skew clock distribution 250ps/division • Synchronization – Mesochronous, Plesiochronous – Self-Timed Design 4Gb/s CMOS I/O WJD November 1, 1999 Computer Architecture for the Next Millenium 5

  6. What is Computer Architecture? I/O Chan Link API ISA Interfaces Technology IR Regs Machine Organization Computer Applications Architect Measurement & Evaluation WJD November 1, 1999 Computer Architecture for the Next Millenium 6

  7. Forces Acting on Architecture • Applications - shifting towards media applications dealing with streams of low-precision samples – video, graphics, audio, DSL modems, cellular base stations • Technology - becoming wire-limited – power and delay dominated by communication, not arithmetic – global structures: register files and instruction issue don’t scale • Technique - Micro-architecture - ILP has been mined out – to the point of diminishing returns on squeezing performance from sequential code – explicit parallelism (data parallelism and thread-level parallelism) required to continue scaling performance WJD November 1, 1999 Computer Architecture for the Next Millenium 7

  8. Applications • Little locality of reference – read each pixel once – often non-unit stride – but there is producer-consumer locality • Very high arithmetic intensity – 100s of arithmetic operations per memory reference • Dominated by low-precision (16-bit) integer operations WJD November 1, 1999 Computer Architecture for the Next Millenium 8

  9. Wires Are Becoming Like Wet Noodles 0.0mm 2.5mm Minimum width wire in an 0.35 µ m 5.0mm process 7.5mm 10.0mm WJD November 1, 1999 Computer Architecture for the Next Millenium 9

  10. Technology scaling makes communication the scarce resource 1999 2008 0.07 µ m 0.18 µ m 4Gb DRAM 256Mb DRAM 256 64b FP Proc 16 64b FP Proc 2.5GHz 500MHz P P 18mm 25mm 30,000 tracks 120,000 tracks 1 clock 16 clocks repeaters every 3mm repeaters every 0.5mm WJD November 1, 1999 Computer Architecture for the Next Millenium 10

  11. Care and Feeding of ALUs Instr. IP Cache Instruction Bandwidth IR Data Bandwidth Regs ‘Feeding’ Structure Dwarfs ALU WJD November 1, 1999 Computer Architecture for the Next Millenium 11

  12. What Does This Say About Architecture? • Tremendous opportunities – Media problems have lots of parallelism and locality – VLSI technology enables 100s of ALUs per chip (1000s soon) • (in 0.18um 0.1mm 2 per integer adder, 0.5mm 2 per FP adder) • Challenging problems – Locality - global structures won’t work – Explicit parallelism - ILP won’t keep 100 ALUs busy – Memory - streaming applications don’t cache well • Its time to try some new approaches WJD November 1, 1999 Computer Architecture for the Next Millenium 12

  13. Example Register File Organization • Register files serve two functions: – Short term storage for intermediate results – Communication between multiple function units • Global register files don’t scale well as N, number of ALUs increases – Need more registers to hold more results (grows with N) – Need more ports to connect all of the units (grows with N 2 ) WJD November 1, 1999 Computer Architecture for the Next Millenium 13

  14. Register Cells are Mostly Switch p w p w Bit Lines Vdd Vdd Gnd Bit Lines h h 1 wire Word Lines grid Word Lines p p ... ... WJD November 1, 1999 Computer Architecture for the Next Millenium 14

  15. Register Architecture for ‘wide’ Processors (A) (B) C SIMD Clusters N Arithmetic Units N/C N/C Arithmetic Arithmetic Units Units (C) (D) C SIMD Clusters N/C Arithmetic N/C Arithmetic N Arithmetic Units Units Units WJD November 1, 1999 Computer Architecture for the Next Millenium 15

  16. Area of Register Organizations 1000 Central 100 SIMD 10 DRF SIMD/DRF 1 0.1 1 10 100 1000 Number of Arithmetic Units WJD November 1, 1999 Computer Architecture for the Next Millenium 16

  17. Delay of Register Organizations 1000 Central 100 SIMD 10 DRF 1 SIMD/DRF 0.1 1 10 100 1000 Number of Arithmetic Units WJD November 1, 1999 Computer Architecture for the Next Millenium 17

  18. Performance of Register Organizations 1.20 1.20 1.00 1.00 0.80 0.80 0.60 0.60 0.40 0.40 0.20 0.20 0.00 0.00 Central SIMD SIMD/DRF HIERARCHICAL STREAM Central SIMD SIMD/DRF HIERARCHICAL STREAM (A) Raw Performance (B) Performance with Latency WJD November 1, 1999 Computer Architecture for the Next Millenium 18

  19. Stubs Abstract the Communication Between Operations FU Op 1 (Op 1) Write stub RF Data transfer Read stub RF FU Op 2 (Op 2) WJD November 1, 1999 Computer Architecture for the Next Millenium 19

  20. A Communication Example Instruction 6 + k * x L/S + k * x L/S t1 t1 t1 Instruction 5 + * L/S Pass t1 * L/S t1 t1 t1 + t1 * t2 L/S + t1 * t2 L/S + t1 * t2 L/S (a) (b) (c) WJD November 1, 1999 Computer Architecture for the Next Millenium 20

  21. The Imagine Stream Processor SDRAM SDRAM SDRAM SDRAM Streaming Memory System Host Network Interface Network Host Stream Register File Interface Processor Microcontroller ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 ALU Cluster 7 Imagine Stream Processor WJD November 1, 1999 Computer Architecture for the Next Millenium 21

  22. Data Bandwidth Hierarchy Imagine Stream Processor ALU Cluster SDRAM Register File ALU Cluster Stream SDRAM SDRAM SDRAM ALU Cluster 3.2GB/s 64GB/s 544GB/s WJD November 1, 1999 Computer Architecture for the Next Millenium 22

  23. Cluster Architecture Intercluster Network Local Register File + + + * * / CU To Stream Buffers Cross Point From Stream Buffers • VLIW organization with shared control • Local register files provide high data bandwidth WJD November 1, 1999 Computer Architecture for the Next Millenium 23

  24. Imagine is a Stream Processor • Instructions are Load, Store, and Operate – operands are streams – also Send and Receive for multiple-imagine systems • Operate performs a compound stream operation – read elements from input streams – perform a local computation – append elements to output streams – repeat until input stream is consumed – (e.g., triangle transform) • Order of magnitude less global register bandwidth than a vector processor WJD November 1, 1999 Computer Architecture for the Next Millenium 24

  25. Triangle Rendering Arithmetic Memory Stream Register File Clusters word Memory Register Bandwidth Bandwidth record Triangle Records Transform Input Data Triangle Records Shade Shaded Triangle Records Project/ Cull Projected Triangle Records Span Setup Span Records Process Span Fragment Records Sort Fragment Records Compact Image Buffer Indices Pixel Depth & Color Image Z-Composite Depth & Pixel Depth & Color Color Pixel Depth & Color

  26. Bandwidth Demands Transform Kernel References Stream Scalar Vector (per ∆ ) Memory 5.5 117 (21.3) 48 (8.7) Global RF 48 624 (13.0) 261 (5.4) Local RF 372 N/A N/A WJD November 1, 1999 Computer Architecture for the Next Millenium 26

  27. Data Parallelism is easier than ILP Kernel 1 to 8 Cluster Speedup FFT (1024) 6.4 DCT (8x8) 7.8 Blockwarp (8x8) 7.2 Transform ( ∆ ) 8.0 Harmonic Mean 7.3 WJD November 1, 1999 Computer Architecture for the Next Millenium 27

  28. Conventional Approaches to Data-Dependent Conditional Execution A A A y=(x>0) x>0 Y N x>0 Y B Speculative if y Exponentially B B J Loss Decreasing D x W Duty Factor J if ~y ~100s C C K C if y Whoops Data-Dependent Branch J K if ~y K WJD November 1, 1999 Computer Architecture for the Next Millenium 28

Recommend


More recommend