efficiency and programmability enablers for exascale
play

Efficiency and Programmability: Enablers for ExaScale Bill Dally | - PowerPoint PPT Presentation

Efficiency and Programmability: Enablers for ExaScale Bill Dally | Chief Scientist and SVP , Research NVIDIA | Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable Demand for More


  1. Efficiency and Programmability: Enablers for ExaScale Bill Dally | Chief Scientist and SVP , Research NVIDIA | Professor (Research), EE&CS, Stanford

  2. Scientific Discovery and Business Analytics Driving an Insatiable Demand for More Computing Performance

  3. HPC Analytics Memory & Compute Communication Storage

  4. The End of Historic Scaling Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

  5. “ Moore’s Law gives us more transistors… Dennard scaling made them useful. ” Bob Colwell, DAC 2013, June 4, 2013

  6. 18,688 NVIDIA Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs 17.59 Petaflops Sustained Performance on Linpack TITAN Numerous real science applications 2.14GF/W – Most efficient accelerator

  7. 18,688 NVIDIA Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs 17.59 Petaflops Sustained Performance on Linpack TITAN Numerous real science applications 2.14GF/W – Most efficient accelerator

  8. You Are Here 2020 1,000PF (50x) 2013 72,000HCNs (4x) 20MW (2x) 50 GFLOPs/W (25x) 20PF 10 Threads (1000x) ~10 18,000 GPUs 10MW 2 GFLOPs/W 7 Threads ~10

  9. EFFICIENCY GAP 50 45 40 Needed 35 Process GFLOPS/W 30 25 20 15 10 5 0 2013 2014 2015 2016 2017 2018 2019 2020

  10. Needed CIRCUITS Process 3 X GFLOPS/W 10 PROCESS 2.2 X 1 2013 2014 2015 2016 2017 2018 2019 2020

  11. Simpler Cores = Energy Efficiency Source: Azizi [PhD 2010]

  12. CPU GPU 1690 pJ/flop 140 pJ/flop Optimized for Latency Optimized for Throughput Caches Explicit Management of On-chip Memory Kepler Westmere 28 nm 32 nm

  13. ARCHITECTURE 4 X Needed Process GFLOPS/W 10 CIRCUITS 3 X PROCESS 2.2 X 1 2013 2014 2015 2016 2017 2018 2019 2020

  14. Programmers, Tools, and Architecture Need to Play Their Positions Programmer Tools Architecture

  15. Programmers, Tools, and Architecture Need to Play Their Positions Algorithm All of the parallelism Programmer Abstract locality Combinatorial optimization Fast mechanisms Mapping Tools Architecture Exposed costs Selection of mechanisms

  16. Banks XBAR L2 DRAM I/O DRAM I/O NW I/O DRAM I/O DRAM I/O DRAM I/O SM SM SM SM SM SM SM SM DRAM I/O SM SM SM SM SM SM SM SM NOC NOC NOC NOC NOC NOC NOC NOC SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM Lane Lane SM SM SM SM SM SM SM SM Lane Lane SM SM SM SM SM SM SM SM NOC NOC NOC NOC NOC NOC NOC NOC SM SM SM SM SM SM SM SM Lane Lane SM SM SM SM SM SM SM SM DRAM I/O DRAM I/O Lane Lane SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM NOC NOC NOC NOC NOC NOC NOC NOC SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM NOC NOC NOC NOC NOC NOC NOC NOC NOC SM SM SM SM SM SM SM SM NW I/O SM SM SM SM SM SM SM SM NW I/O LOC LOC LOC LOC LOC LOC LOC LOC SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM NOC NOC NOC NOC NOC NOC NOC NOC SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM DRAM I/O SM SM SM SM SM SM SM SM DRAM I/O SM SM SM SM SM SM SM SM NOC NOC NOC NOC NOC NOC NOC NOC SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM NOC NOC NOC NOC NOC NOC NOC NOC SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM DRAM I/O DRAM I/O SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM NOC NOC NOC NOC NOC NOC NOC NOC SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM DRAM I/O DRAM I/O NW I/O DRAM I/O DRAM I/O

  17. An Enabling HPC Network • <1 m s Latency • Scalable bandwidth • Small messages 50% @ 32B • Global adaptive routing • PGAS • Collectives & Atomics • MPI Offload

  18. An Open HPC Network Ecosystem Common: • Software-NIC API • NIC-Router Channel Processor/NIC Vendors System Vendors Networking Vendors

  19. Programming Power Parallelism 25x Efficiency with 2.2x from process Heterogeneity Hierarchy Programmer Tools Architecture

  20. “Super” Computing From Super Computers to Super Phones

  21. Backup

  22. In The Past, Demand Was Fueled by Moore’s Law Source: Moore, Electronics 38(8) April 19, 1965

  23. 10000000 1000000 Perf (ps/Inst) 100000 Linear (ps/Inst) 10000 ILP Was Mined Out 1000 100 30:1 in 2001 10 1,000:1 1 0.1 30,000:1 0.01 0.001 0.0001 1980 1990 2000 2010 2020 Source: Dally et al. “The Last Classical Computer”, ISAT Study, 2001

  24. Voltage Scaling Ended in 2005 Source: Moore, ISSCC Keynote, 2003

  25. Moore’s law is alive and well, but… Instruction-level parallelism (ILP) was mined out in 2001 Voltage scaling (Dennard scaling) ended Summary in 2005 Most power is spent on communication What does this mean to you?

  26. The End of Historic Scaling Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

  27. All performance is from parallelism Machines are power limited In the Future (efficiency IS performance) Machines are communication limited (locality IS performance)

  28. Two Major Challenges Energy Efficiency Programming 25x in 7 years Parallel (10 10 threads) (~2.2x from process) Hierarchical Heterogeneous

  29. Needed Process GFLOPS/W 10 1 2013 2014 2015 2016 2017 2018 2019 2020

  30. How is Power Spent in a CPU? In-order Embedded OOO Hi-perf Data ALU Supply 4% 5% Data Supply RF 17% Clock + Control Logic 14% ALU 6% 24% Clock + Pins 45% Issue Register File 11% 11% Rename Instruction Supply 10% Fetch 42% 11% Dally [2008] (Embedded in-order CPU) Natarajan [2003] (Alpha 21264)

  31. Energy Shopping List Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ FP Op lower bound Wire energy (256 bits, 10mm) 310 pJ 174 pJ = 4 pJ Memory Technology 45 nm 16nm DRAM interface pin bandwidth 4 Gbps 50 Gbps DRAM interface energy 20-30 pJ/bit 2 pJ/bit DRAM access energy 8-15 pJ/bit 2.5 pJ/bit Keckler [Micro 2011], Vogelsang [Micro 2010]

  32. Needed CIRCUITS Process 3 X GFLOPS/W 10 PROCESS 2.2 X 1 2013 2014 2015 2016 2017 2018 2019 2020

  33. Throughput-Optimized Core Latency-Optimized Core (TOC) (LOC) Branch PC PCs Predict PC I$ Select Register Rename I$ Instruction Window Register File Register File ALU 1 ALU 2 ALU 3 ALU 4 ALU 1 ALU 2 ALU 3 ALU 4 Reorder Buffer

  34. Main Register File 15% of SM Energy 32 banks Warp Scheduler SIMT Lanes ALU SFU MEM TEX Shared Memory 32 KB Streaming Multiprocessor (SM)

  35. Hierarchical Register File 100% 100% Percent of All Values Produced Percent of All Values Produced 80% 80% Read >2 Times 60% 60% Lifetime >3 Read 2 Times Lifetime 3 Lifetime 2 Read 1 Time 40% 40% Lifetime 1 Read 0 Times 20% 20% 0% 0%

  36. Register File Caching (RFC) MRF 4x128-bit Banks (1R1W) Operand Buffering Operand Routing RFC 4x32-bit (3R1W) Banks S M T F E E U M X ALU

  37. Energy Savings from RF Hierarchy 54% Energy Reduction Source: Gebhart, et. al (Micro 2011)

  38. Two Major Challenges Energy Efficiency Programming 25x in 7 years Parallel (10 10 threads) (~2.2x from process) Hierarchical Heterogeneous

  39. Skills on LinkedIn Size (approx) Growth (rel) C++ 1,000,000 -8% Mainstream Javascript 1,000,000 -1% Programming Python 429,000 7% Fortran 90,000 -11% MPI 21,000 -3% x86 Assembly 17,000 -8% CUDA 14,000 9% Parallel and Assembly Parallel programming 13,000 3% Programming OpenMP 8,000 2% TBB 389 10% 6502 Assembly 256 -13% Source: linkedin.com/skills (as of Jun 11, 2013)

  40. Parallel Programming is Easy forall molecule in set: # 1E6 molecules forall neighbor in molecule.neighbors: # 1E2 neighbors ea forall force in forces: # several forces # reduction molecule.force += force(molecule, neighbor)

  41. We Can Make It Hard pid = fork() ; // explicitly managing threads lock(struct.lock) ; // complicated, error-prone synchronization // manipulate struct unlock(struct.lock) ; code = send(pid, tag, &msg) ; // partition across nodes

  42. Programmers, Tools, and Architecture Need to Play Their Positions Programmer Tools Architecture

  43. Programmers, Tools, and Architecture Need to Play Their Positions Algorithm All of the parallelism Programmer Abstract locality Combinatorial optimization Fast mechanisms Mapping Tools Architecture Exposed costs Selection of mechanisms

  44. OpenACC: Easy and Portable do i = 1, 20*128 do j = 1, 5000000 fa(i) = a * fa(i) + fb(i) end do Serial Code: SAXPY end do !$acc parallel loop do i = 1, 20*128 !dir$ unroll 1000 do j = 1, 5000000 fa(i) = a * fa(i) + fb(i) end do end do

  45. Conclusion

  46. The End of Historic Scaling Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

  47. Parallelism is the source of all performance Power limits all computing Communication dominates power

  48. Two Challenges Power Programming 25x Efficiency Parallelism with 2.2x from process Heterogeneity Hierarchy

Recommend


More recommend