major challenges to achieve
play

Major Challenges to Achieve Exascale Performance Shekhar Borkar - PowerPoint PPT Presentation

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009 Acknowledgment: Exascale WG sponsored by Dr. Bill Harrod, DARPA (IPTO) 1 Outline Exascale performance goals Major challenges Potential solutions


  1. Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009 Acknowledgment: Exascale WG sponsored by Dr. Bill Harrod, DARPA (IPTO) 1

  2. Outline Exascale performance goals Major challenges Potential solutions Paradigm shift Summary 2

  3. Performance Roadmap EFLOP 1.E+08 PFLOP 1.E+06 1.E+04 GFLOP TFLOP 1.E+02 GFLOP 1.E+00 1.E-02 MFLOP 12 Years 11 Years 10 Years 1.E-04 1960 1970 1980 1990 2000 2010 2020 3

  4. From Giga to Exa, via Tera & Peta Exa 1000 Relative Tr Performance Peta 1.E+08 Peta 2.5M X 100 1.E+06 Tera 4,000X 1.E+04 Tera Concurrency 10 36X 1.E+02 30X 250X G Transistor Performance G 1 1.E+00 1986 1996 2006 2016 1986 1996 2006 2016 Exa 5V 1 1.E+08 Relative Energy/Op G Vcc scaling Peta 0.1 1.E+06 Power Tera 1.E+04 Tera Peta 0.01 1M X 1.E+02 4,000X G 80X 1.E+00 0.001 1986 1996 2006 2016 1986 1996 2006 2016 4

  5. Building with Today’s Technology TFLOP Machine today Decode and control Translations …etc 4450W Power supply losses Cooling…etc 10TB disk @ 1TB/disk @10W 5KW 100W Disk 100pJ com per FLOP 100W Com 0.1B/FLOP @ 1.5nJ per Byte 150W Memory 200pJ per FLOP 200W Compute KW Tera, MW Peta, GW Exa? 5

  6. The Power & Energy Challenge TFLOP Machine today 4550W TFLOP Machine then 5KW 100W With Exa Technology Disk 100W Com 5W ~20W ~3W 150W Memory ~5W 2W 200W 5W Compute 6

  7. Starting Point: Optimistic yet Realistic 80 Core TFLOP Chip 1.5mm 1.5mm 12.64mm 12.64mm I/O Area I/O Area Global clk spine + clk buffers Global clk spine + clk buffers DMEM DMEM DMEM single tile single tile FPMAC0 FPMAC0 FPMAC0 RF RF RF 1.5mm 1.5mm RIB RIB RIB IMEM IMEM IMEM 2.0mm 2.0mm 2.0mm 2.0mm 2.0mm MSINT MSINT 21.72mm 21.72mm CLK CLK CLK FPMAC1 FPMAC1 FPMAC1 Router Router Router Technology Technology 65nm CMOS Process 65nm CMOS Process Interconnect Interconnect 1 poly, 8 metal (Cu) 1 poly, 8 metal (Cu) Transistors Transistors 100 Million 100 Million Die Area Die Area 275mm 2 275mm 2 PLL PLL TAP TAP 3mm 2 3mm 2 Tile area Tile area Package Package 1248 pin LGA, 14 layers, 1248 pin LGA, 14 layers, I/O Area I/O Area 343 signal pins 343 signal pins 7

  8. Scaling Assumptions Technology 45nm 32nm 22nm 16nm 11nm 8nm 5nm (2008) (2010) (2012) (2014) (2016) (2018) (2020) (High Volume) Transistor density 1.75 1.75 1.75 1.75 1.75 1.75 1.75 Frequency scaling 15% 10% 8% 5% 4% 3% 2% Vdd scaling -10% -7.5% -5% -2.5% -1.5% -1% -0.5% Dimension & Capacitance 0.75 0.75 0.75 0.75 0.75 0.75 0.75 SD Leakage scaling/micron 1X Optimistic to 1.43X Pessimistic 65nm Core + Local Memory 8nm Core + Local Memory DP FP Add, Multiply Integer Core, RF DP FP Add, Multiply Router Integer Core, RF Router 5mm 2 (50%) 0.17mm 2 (50%) Memory 0.35MB 0.17mm 2 (50%) Memory 0.35MB ~0.6mm 5mm 2 (50%) 0.34mm2, 4.6GHz, 9.2GF , 0.24 to 0.46W 10mm2, 3GHz, 6GF , 1.8W 8

  9. Processor Chip 20000 500 Chip Performance (GF) 400 Chip Power (W) 15000 Power(W) GFLOPs 300 10000 200 5000 100 0 0 65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm 65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm 20mm 2018, 8nm technology node Cores/Module 1150 Total Local Memory 400 MB Frequency 4.61 GHz 20mm Peak performance 10.6 TF Power 300 - 600W Energy efficiency 34 - 18 GF/Watt 400mm2 30-60 MW for Exascale 9

  10. Processor Node 128 GB 128 GB Peak performance 10.6 TF Total DRAM Capacity 512GB Total DRAM BW 1TB/s (0.1B/FLOP) DRAM Power 800 W* Total Power 1100 - 1400W Energy efficiency 9.5 - 8 GF/Watt 110-140 MW for Exascale 256GB/s 64b 128 GB 128 GB *Assumes 5% Vdd scaling each technology generation 140 pJ energy consumed per accessed bit 10

  11. Node Power Breakdown 10 TF, ~ 1KW Aggressive voltage Compute scaling Fabric Hierarchical heterogeneous topologies DRAM Efficient signaling Repartitioning 11

  12. Voltage Scaling When designed to voltage scale 1 10 Energy Efficiency 0.8 8 Freq Normalized 0.6 6 0.4 4 Total Power Leakage 0.2 2 0 0 0.3 0.5 0.7 0.9 Vdd (Normal) 12

  13. Energy Efficiency with Vdd Scaling 160 Energy Efficiency (GF/W) 140 Vdd 120 0.7x 0.5x 100 80 60 40 20 0 65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm ~3X Compute energy efficiency with Vdd Scaling 13

  14. On-die Mesh Interconnect 45nm 22nm 16nm 32nm 20mm 20mm 20mm 20mm 70 Cores 123 Cores 214 Cores 375 Cores 500 Network 400 Chip Power (W) Compute 300 200 100 0 65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm On-die network (mesh) power is high Worse if link width scales up each generation 14

  15. Mesh — Retrospective Bus: Good at board level, does not extend well • Transmission line issues: loss and signal integrity, limited frequency • Width is limited by pins and board area • Broadcast, simple to implement Point to point busses: fast signaling over longer distance • Board level, between boards, and racks • High frequency, narrow links • 1D Ring, 2D Mesh and Torus to reduce latency • Higher complexity and latency in each node Hence, emergence of packet switched network But, pt-to-pt packet switched network on a chip? 15

  16. Interconnect Delay & Energy 10000 2 65nm, 3GHz Router Delay 1000 1.5 Delay (ps) pJ/Bit 100 1 10 0.5 1 0 0 5 10 15 20 Length (mm) 16

  17. Bus —The Other Extreme… Issues: Slow, < 300MHz Shared, limited scalability? Solutions: Repeaters to increase freq Wide busses for bandwidth Multiple busses for scalability Benefits: Power? Simpler cache coherency Move away from frequency, embrace parallelism 17

  18. Hierarchical & Heterogeneous C C C C C C R R Bus Bus Bus C C C C C C C C C C R R Bus Bus 2 nd Level Bus C C C C Bus to connect over short distances Hierarchy of Busses Or hierarchical circuit and packet switched networks 18

  19. Revise DRAM Architecture Signaling M Control Energy cost today: ~175 pJ/bit DRAM Array Traditional DRAM New DRAM architecture Addr RAS Page Page Page Page Page Page Addr CAS Activates many pages Activates few pages Lots of reads and writes (refresh) Read and write (refresh) what is needed Small amount of read data is used All read data is used Requires small number of pins Requires large number of IO’s (3D) 19

  20. Data Locality Chip to memory Communication: ~1.5nJ per Byte ~150pJ per Byte Core-to-core Communication on the chip: Chip to chip ~10pJ per Byte Communication: ~100pJ per Byte Data movement is expensive — keep it local (1) Core to core, (2) Chip-to-chip, (3) Memory 20

  21. Impact of Exploding Parallelism Almost flat because Vdd close to Vt 450 4X increase in the 400 0.5x Vdd Million Cores/EFLOP number of cores 350 (Parallelism) 300 Increased 250 communication and 0.7x Vdd 200 related energy 150 1x Vdd Increased HW, and unreliability 100 65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm 1. Strike a balance between Com & Computation 2. Resiliency (Gradual, Intermittent, Permanent faults) 21

  22. Road to Unreliability? From Peta to Exa Reliability Issues 1,000X parallelism More hardware for something to go wrong >1,000X intermittent faults due to soft errors Aggressive Vcc scaling Gradual faults due to increased variations to reduce power/energy More susceptible to Vcc droops (noise) More susceptible to dynamic temp variations Exacerbates intermittent faults — soft errors Deeply scaled Aging related faults technologies Lack of burn-in? Variability increases dramatically Resiliency will be the corner-stone 22

  23. Resiliency Faults Example Faults cause errors (data & control) Permanent faults Stuck-at 0 & 1 Datapath errors Detected by parity/ECC Gradual faults Variability Silent data corruption Need HW hooks Temperature Control errors Control lost (Blue screen) Intermittent faults Soft errors Voltage droops Minimal overhead for resiliency Aging faults Degradation Error detection Applications Fault isolation System Software Programming system Fault confinement Microcode, Platform Reconfiguration Microarchitecture Recovery & Adapt Circuit & Design 23

  24. Needs a Paradigm Shift Past and present priorities — Single thread performance Frequency Programming productivity Legacy, compatibility Architecture features for productivity Constraints (1) Cost (2) Reasonable Power/Energy Future priorities — Throughput performance Parallelism Power/Energy Architecture features for energy Simplicity Constraints (1) Programming productivity (2) Cost Evaluate each (old) architecture feature with new priorities 24

  25. Summary Von-Neumann computing & CMOS technology (nothing else in sight) Voltage scaling to reduce power and energy • Explodes parallelism • Cost of communication vs computation — critical balance • Resiliency to combat side-effects and unreliability Programming system for extreme parallelism System software to harmonize all of the above 25

Recommend


More recommend