exascale hardware platform
play

Exascale Hardware Platform Paul Harvey Konstantin Bakanov, Ivor - PowerPoint PPT Presentation

A Scalable Runtime for the ECOSCALE Heterogeneous Exascale Hardware Platform Paul Harvey Konstantin Bakanov, Ivor Spence, Dimitrios S. Nikolopoulos Looking To Discuss and Share Ideas No implementation No results Just design!


  1. A Scalable Runtime for the ECOSCALE Heterogeneous Exascale Hardware Platform Paul Harvey Konstantin Bakanov, Ivor Spence, Dimitrios S. Nikolopoulos

  2. Looking To Discuss and Share Ideas • No implementation • No results • Just design! • Intro & Context • Hardware • Language • Runtime Architecture

  3. Exascale: Money Exascale Spendin (£) • America : ~$1500 Million 1200 Millions • Europe : €700 million 1000 800 • China : 5000 million CNY 600 • Japan : 110 Billion JPY 400 200 0 America China Europe Japan http://www.exascale.org/bdec/sites/www.exascale.org.bdec/files/3-BDEC2015-ishikawa.pdf http://www.hpcwire.com/2016/02/12/obama-budget-reveals-new-elements-exascale-program/ http://www.scientific-computing.com/news/news_story.php?news_id=2732 http://www.exascale.org/mediawiki/images/b/b8/Talk25-zjin.pdf

  4. Exascale: Brains

  5. Exascale: Problems http://science.energy.gov/~/media/ascr/ascac/pdf/reports/Exascale_subcommittee_report.pdf

  6. Exascale: Problems http://science.energy.gov/~/media/ascr/ascac/pdf/reports/Exascale_subcommittee_report.pdf

  7. Ecoscale - ecoscale.eu • Funded till October 2018 • ~£4,000,000 • Building new Hardware • Exascale prototype with FPGA focus • Queen’s University working on Software

  8. FPGA FFT BitCoin Matrix Mul

  9. FPGA: Floating point Intensive Calculation Platform Time (ns) W Energy/Step (nJ) Obtained By HD 4400 (GPU) 3.13 15 46.9 Measurement GTX 960 (GPU) 0.163 120 19.56 Measurement Quadro K4200 (GPU) 0.204 105 21.42 Measurement GTX Titan (GPU) 0.0389 375 14.61 Extrapolation Virtex 7 (FPGA) 0.315 24.4 7.69 Measurement • Compute-intensive, not using global memory • GPU memory bandwidth is >> FPGA memory bandwidth • GPU DDR4 ~8x more than FPGA DDR3

  10. FPGA: Floating point Intensive Calculation Platform Time (ns) W Energy/Step (nJ) Obtained By HD 4400 (GPU) 3.13 15 46.9 Measurement GTX 960 (GPU) 0.163 120 19.56 Measurement Quadro K4200 (GPU) 0.204 105 21.42 Measurement GTX Titan (GPU) 0.0389 375 14.61 Extrapolation Virtex 7 (FPGA) 0.315 24.4 7.69 Measurement • Compute-intensive, not using global memory • GPU memory bandwidth is >> FPGA memory bandwidth • GPU DDR4 ~8x more than FPGA DDR3

  11. Architecture

  12. Simplified Architecture Compute Node … Worker Node Unimem CPU FPGA … RAM

  13. Unimem • RDMA • PGAS Address Space • One or more single address spaces

  14. OpenCL

  15. Current Abstractions CPU CPU kernel kernel kernel MEMORY FPGA FPGA Application MEMORY Data Data Data GPU GPU MEMORY Host Device

  16. Current Abstractions Data CPU CPU kernel MEMORY Data FPGA FPGA kernel Application MEMORY Data GPU kernel GPU MEMORY Host Device

  17. Current Abstractions CPU CPU kernel MEMORY FPGA FPGA kernel Application MEMORY GPU kernel GPU Data Data Data MEMORY Host Device

  18. OpenCL • Simple model • Widely used in non-hpc • Standardised • Lots of activity • Industry • Academia • Non-proprietary

  19. Extensions 1. New abstractions of multiple hardware devices 1. Enables scheduler to dynamically go after performance or power 2. New fundamental unit of scheduling 1. Better scaling across multiple compute devices 2. Enables kernels to run where a single device has insufficient resources

  20. CPU CPU kernel kernel kernel MEMORY Worker Abstraction FPGA FPGA Application MEMORY Data Data Data GPU GPU MEMORY Host Device kernel + Data Worker Compute Software Node Device • No change for Programmer • Scheduler control for Worker power vs. Performance Node Unimem CPU FPGA RAM

  21. CPU CPU kernel kernel kernel MEMORY Worker Abstraction FPGA FPGA Application MEMORY Data Data Data GPU GPU MEMORY Host Device kernel + Data Worker Compute Software Node Device • No change for Programmer • Scheduler control for Worker power vs. Performance Node Unimem CPU FPGA RAM

  22. CPU CPU kernel kernel kernel MEMORY Worker Abstraction FPGA FPGA Application MEMORY Data Data Data GPU GPU MEMORY Host Device kernel + Data Worker Software Library Device • No change for Programmer • Scheduler control for power vs. Performance kernel kernel kernel

  23. Abstraction Configurations Worker Logical Aggregated FPGA Aggregated CPU 1 5 1 5 6 6 1 2 6 2 7 3 3 7 7 8 8 8 4 4 4

  24. Scheduling: CPU vs. FPGA • Machine Learning based on: • Runtime performance • Kernel input data size • CPUF/FPGA power consumption • Data locality • #global memory accesses • #branches and loops • Is a cost model enough? • How do we determine: • a power budget? 100 th of current GPU? • • A performance budget? • Current best GPU?

  25. Controller Controller : Controller : Partition … Schedule computation across workers and data kernel … Worker : Worker : Report Schedule results and/or across local errors to RUNTIME devices controller 1 2 3 4 • Core 1 reserved for OS

  26. Language – Data Partitioning d_m1 = clCreateBuffer(context, CL_MEM_READ_WRITE, matrix_dim*matrix_dim* sizeof ( double ), NULL,  ecoscale_partition(d_m1, REPLICATE, 0), &errcode);

  27. Architecture Application Compute Ecoscale runtime Node FPGA OCL Runtime MPI/GASnet Driver Worker Unimem OS Driver Node Unimem CPU FPGA 1 2 3 4 RAM

  28. Application Application Ecoscale runtime Ecoscale runtime MPI/GAS FPGA MPI/GAS FPGA OCL Runtime OCL Runtime Driver net Driver net Compute OS … OS Unimem Unimem Driver Driver Node 1 2 2 1 Worker 3 4 3 4 Node Unimem Application Application Ecoscale runtime … Ecoscale runtime MPI/GAS FPGA FPGA MPI/GAS OCL Runtime Driver net OCL Runtime CPU FPGA Driver net OS Unimem OS Unimem Driver Driver RAM 1 2 1 2 3 4 3 4

  29. Controller Application Application Ecoscale runtime Ecoscale runtime MPI/GAS FPGA MPI/GAS FPGA OCL Runtime OCL Runtime Driver net Driver net Compute OS … OS Unimem Unimem Driver Driver Node 1 2 2 1 Worker 3 4 3 4 Node Unimem Application Application Ecoscale runtime … Ecoscale runtime MPI/GAS FPGA FPGA MPI/GAS OCL Runtime Driver net OCL Runtime CPU FPGA Driver net OS Unimem OS Unimem Driver Driver RAM 1 2 1 2 3 4 3 4

  30. Controller Slave Application Application Ecoscale runtime Ecoscale runtime MPI/GAS FPGA MPI/GAS FPGA OCL Runtime OCL Runtime Driver net Driver net Compute OS … OS Unimem Unimem Driver Driver Node 1 2 2 1 Worker 3 4 3 4 Node Unimem Application Application Ecoscale runtime … Ecoscale runtime MPI/GAS FPGA FPGA MPI/GAS OCL Runtime Driver net OCL Runtime CPU FPGA Driver net OS Unimem OS Unimem Driver Driver RAM 1 2 1 2 3 4 3 4 Slave Slave

  31. Resilience • Leaders & slaves • Heatbeats messages • Checkpointing

  32. Application Application Ecoscale runtime Ecoscale runtime MPI/GAS FPGA MPI/GAS FPGA OCL Runtime OCL Runtime Driver net Driver net Compute OS … OS Unimem Unimem Driver Driver Node 1 2 2 1 Worker 3 4 3 4 Node Unimem Application Application Ecoscale runtime … Ecoscale runtime MPI/GAS FPGA FPGA MPI/GAS OCL Runtime Driver net OCL Runtime CPU FPGA Driver net OS Unimem OS Unimem Driver Driver RAM 1 2 1 2 3 4 3 4

  33. Leadership Election Application Application Ecoscale runtime Ecoscale runtime MPI/GAS FPGA MPI/GAS FPGA OCL Runtime OCL Runtime Driver net Driver net Compute OS … OS Unimem Unimem Driver Driver Node 1 2 2 1 Worker 3 4 3 4 Node Unimem Application Application Ecoscale runtime … Ecoscale runtime MPI/GAS FPGA FPGA MPI/GAS OCL Runtime Driver net OCL Runtime CPU FPGA Driver net OS Unimem OS Unimem Driver Driver RAM 1 2 1 2 3 4 3 4

  34. Slave (Backup) Slave Controller Application Application Ecoscale runtime Ecoscale runtime MPI/GAS FPGA MPI/GAS FPGA OCL Runtime OCL Runtime Driver net Driver net Compute OS … OS Unimem Unimem Driver Driver Node 1 2 2 1 Worker 3 4 3 4 Node Unimem Application Application Ecoscale runtime … Ecoscale runtime MPI/GAS FPGA FPGA MPI/GAS OCL Runtime Driver net OCL Runtime CPU FPGA Driver net OS Unimem OS Unimem Driver Driver RAM 1 2 1 2 3 4 3 4 Slave Slave

  35. Data Data Data B C A Accounting Log Slave (Backup) Controller Application Application Ecoscale runtime Ecoscale runtime MPI/GAS FPGA MPI/GAS FPGA OCL Runtime OCL Runtime Driver net Driver net Compute OS … OS Unimem Unimem Driver Driver Node 1 2 2 1 Worker 3 4 3 4 Node Unimem Application Application Ecoscale runtime … Ecoscale runtime MPI/GAS FPGA FPGA MPI/GAS OCL Runtime Driver net OCL Runtime CPU FPGA Driver net OS Unimem OS Unimem Driver Driver RAM 1 2 1 2 3 4 3 4 Slave Slave

Recommend


More recommend