towards 1000x with heterogeneous programmable hardware
play

Towards 1000x with Heterogeneous, Programmable Hardware Datacenter - PowerPoint PPT Presentation

Towards 1000x with Heterogeneous, Programmable Hardware Datacenter Name: Anton Burtsev, UC Irvine Summary: 1 Related work: What will hardware look like in 10-20 years? Massively heterogeneous Not just many-cores GPUs, Xeon


  1. Towards 1000x with Heterogeneous, Programmable Hardware Datacenter ● Name: Anton Burtsev, UC Irvine ● Summary: 1 ● Related work:

  2. What will hardware look like in 10-20 years? ● Massively heterogeneous ○ Not just many-cores ■ GPUs, Xeon Phi, Tilera TILE, PowerEN ○ But also ■ Fine-grained hardware ASICs accelerators ■ Programmable hardware (FPGA) 2

  3. Ubiquitous, fine-grained, heterogeneous hardware-acceleration ● Execution will no longer stay on 1 CPU 3

  4. Ubiquitous, fine-grained, heterogeneous hardware-acceleration ● A chain of hardware accelerators (ASIC/FPGA) ■ On-chip, and over PCIe ○ Co-located with storage and network devices ● A single machine is a distributed system ○ Yet you have to use it efficiently 4

  5. Even your memory is distributed ● Your memory is not local either ● We will see large memories ○ 6TB are possible today (Dell R930, 96x64GB DIMMs) ○ 10x higher density in the near future [Meena et al.] ■ ~100TB of NVM on the memory bus ■ 20-80 ns latency of access 5

  6. Big/New Ideas of 1000x ● Your biggest problem is ... ○ Latency and parallelism ■ Sent a request to another core/accelerator ● 355ns on a cache-coherent Intel HARP [Choi, DAC’16] ■ Have to find something to do… ○ Parallelism ■ Expressing, and running the graph of the computation on a set of execution units 6

  7. Big/New Ideas of 1000x ● Your have more problems... ○ Reliability ■ A single bug can destroy your in-memory dataset ● 100TB of non-volatile memory are cache-coherent Any FPGA unit, or core can wipe it ● 7

  8. Indicated R&D for 1000x ● OS/VMM support for heterogeneous hardware ○ Novel execution runtime ■ Spatial scheduling, preemption, load-balancing ● Sharing across multiple users One host and in a virtual datacenter ● ■ Unified OS platform for GPU, multi-cores, FPGA Proprietary stacks and device drivers should go… ● Direct (low-latency) access to hardware ● 8

  9. Indicated R&D for 1000x ● Language support ○ Programmable hardware ■ C/C++/Rust to FPGA ○ Parallelism ■ Async & delegate [Grappa, USENIX’16] ● Works good for analytical workloads ■ Streaming languages ■ Your favorite model here Well, MPI will work too ● 9

  10. Questions for the Software Institute ● Analyze potential performance gains for HEP workloads ■ Assume a clean-slate ideal slate software stack ■ Only hardware limitations ■ Can we get to 1000x? ■ What are the bottlenecks? 10

  11. Questions for the Software Institute ● Encouraging example: ○ D.E. Shaw Anton/Anton 2 dynamic molecular simulation machine ■ Custom ASIC ■ 1000x speedup ● Same acceleration is possible for HEP 11

Recommend


More recommend