with proprietary interconnect and
play

with Proprietary Interconnect and Its Programming and Applications - PowerPoint PPT Presentation

Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications Toshihiro Hanawa Information Technology Center, The University of Tokyo Taisuke Boku Center for Computational Sciences, University of Tsukuba


  1. Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications Toshihiro Hanawa Information Technology Center, The University of Tokyo Taisuke Boku Center for Computational Sciences, University of Tsukuba Collaboration with Yuetsu Kodama, Mitsuhisa Sato, Masayuki Umemura @ CCS, Univ. of Tsukuba Hitoshi Murai @ Riken AICS, Hideharu Amano @ Keio Univ. GPU Technology Conference 2015 1 Mar. 19, 2015

  2. Agenda  Background  Application Examples  HA-PACS / AC-Crest Project  QUDA (QCD)  FFTE (FFT)  Introduction of HA-PACS / TCA  Introduction of XcalableACC  Organization of TCA  Concept  PEACH2 Board designed for TCA  Code Examples  Evaluation of Basic  Evaluations Performance  Summary  Collective Communications  Implementation Examples  Performance Evaluation GPU Technology Conference 2015 2 Mar. 19, 2015

  3. Current Trend of HPC using GPU Computing  Advantageous Features High peak performance / cost ratio  High peak performance / power ratio   Examples of HPC System:  GPU Clusters in Green500 (Nov.  GPU Clusters and MPPs in TOP500 2014) (“Greenest” Supercomputers (Nov. 2014) ranked in Top500)  2nd: Titan (NVIDIA K20X, 27 PF)  3rd: TSUBAME-KFC (NVIDIA K20X, 4.4  6th: Piz Daint (NVIDIA K20X, 7.8 PF) GF/W)  10th: Cray CS-Storm (NVIDIA K40, 6.1  4th: Cray Storm1 (NVIDIA K40, 3.9 GF / PF) W)  15th: TSUBAME2.5 (NVIDIA K20X, 5.6  7th: HA-PACS/TCA (NVIDIA K20X, 3.5 PF) GF/W)  48 systems use NVIDIA GPUs.  8 systems of Top10 use NVIDIA GPUs. GPU Technology Conference 2015 3 Mar. 19, 2015

  4. Issues of GPU Computing  Data I/O performance limitation Ex) K20X: PCIe gen2 x16  Peak Performance: 8GB/s (I/O) ⇔ 1.3 TFLOPS (Computation) Communication bottleneck becomes significant on multi GPU application   Strong-scaling on GPU cluster Important to shorten Turn-Around Time of production-run  Heavy impact of communication latency   Ultra- low latency between GPUs is important for next generation’s HPC Our target is developing a direct communication system between external GPUs for a feasibility study for future accelerated computing. ⇒ “Tightly Coupled Accelerators (TCA)” architecture GPU Technology Conference 2015 4 Mar. 19, 2015

  5. HA-PACS Project HA- PACS is not only a “commodity GPU HA-PACS (Highly Accelerated Parallel  cluster ” but also experiment platform Advanced system for Computational Sciences) HA-PACS base cluster  for development of GPU-accelerated code 8 th generation of PAX/PACS series   for target fields, and performing product-run supercomputer in University of Tsukuba Now in operation since Feb. 2012  HA-PACS/TCA (TCA = Tightly Coupled FY2011-2013, operation until   Accelerators) FY2016(?) for elementary research on direct Promotion of computational science   communication technology for accelerated applications in key areas in CCS- computing Tsukuba Our original communication chip named  Target field: QCD, astrophysics, “PEACH2” was installed in each node.  QM/MM (quantum mechanics / Now in operation since Nov. 2013  molecular mechanics, bioscience) GPU Technology Conference 2015 5 Mar. 19, 2015

  6. AC-CREST project  Project “ Research and Development on Unified Environment of Accelerated Computing and Interconnection for Post-Petascale Era ” ( AC-CREST )  Objectives Supported by JST-CREST  Realization of high-performance (direct)  “ Development of System communication among accelerators Software Technologies for Development of system software supporting  post-Peta Scale High communication system among accelerators Performance Computing ” Development of parallel language and  program compilers  Higher productivity  Highly optimized (offload, communication) Development of practical applications  GPU Technology Conference 2015 6 Mar. 19, 2015

  7. What is “Tightly Coupled Accelerators (TCA)” ? Concept:  Direct connection between accelerators (GPUs) over the nodes without CPU assistance  Eliminate extra memory copies to the host  Reduce latency, improve strong scaling with small data size  Enable hardware support for complicated communication patterns GPU Technology Conference 2015 7 Mar. 19, 2015

  8. Communication on TCA Architecture  Using PCIe as a communication link between accelerators over the nodes  Direct device P2P communication is available thru PCIe.  PEACH2: Node Node PCI Express Adaptive CPU Memory CPU Memory Communication Hub ver. 2 CPU CPU PCIe Switch PCIe Switch Implementation of the  PCIe PCIe PCe interface and data PCe GPU GPU transfer engine for TCA PEACH2 PEACH2 PCIe GPU Memory GPU Memory GPU Technology Conference 2015 8 Mar. 19, 2015

  9. GPU Communication with traditional MPI  Traditional MPI using InfiniBand requires data copy 3 times  Data copy between CPU and GPU (1 and 3) have to perform manually CPU CPU 2: Data transfer over IB Mem Mem PCIe SW PCIe SW Mem GPU IB IB GPU Mem 1: Copy from GPU mem 3: Copy from CPU mem to CPU mem through PCI Express (PCIe) to GPU mem through PCIe GPU Technology Conference 2015 9 Mar. 19, 2015

  10. GPU Communication with IB/GDR  The InfiniBand controller read and write GPU memory directly (with GDR)  Temporal data copy is eliminated  Lower latency than the previous method  Protocol conversion is still needed 1: Direct data transfer CPU CPU (PCIe -> IB -> PCIe) Mem Mem PCIe SW PCIe SW Mem GPU IB IB GPU Mem GPU Technology Conference 2015 10 Mar. 19, 2015

  11. GPU Communication with TCA (PEACH2)  TCA does not need protocol conversion  direct data copy using GDR  much lower latency than InfiniBand 1: Direct data transfer CPU CPU (PCIe -> PCIe -> PCIe) Mem Mem PCIe SW PCIe SW Mem GPU TCA TCA GPU Mem GPU Technology Conference 2015 11 Mar. 19, 2015

  12. TCA node structure example  Similar to ordinary GPU cluster  PEACH2 can access all configuration except PEACH2 GPUs  80 PCIe lanes are required  NVIDIA Kepler architecture CPU CPU + “ GPUDirect Support for (Xeon (Xeon RDMA” are required. QPI E5 v2) E5 v2) G3 Single PCI address space G2  Connect among 3 nodes G2 x8 G2 G2 G2 PCIe x8 x16 x16 x16 x16 using remaining PEACH2 IB GPU GPU GPU GPU PEA G2 HC 0 1 2 3 CH2 port A x8 G2 G2 QDR x8 x8 GPU: NVIDIA K20X 2port GPU Technology Conference 2015 12 Mar. 19, 2015

  13. TCA node structure example  Similar to ordinary GPU cluster Actually, configuration except PEACH2  Performance over QPI is  80 PCIe lanes are required miserable. CPU CPU (Xeon (Xeon  PEACH2 is available for GPU0, QPI E5 v2) E5 v2) G3 GPU1. G2 G2 x8 G2 G2 G2 PCIe x8 x16 x16 x16 x16  Note that InfiniBand with GPU IB GPU GPU GPU GPU PEA G2 HC Direct for RDMA is available 0 1 2 3 CH2 A x8 only for GPU2, GPU3. G2 G2 x8 x8 GPU: NVIDIA K20X GPU Technology Conference 2015 13 Mar. 19, 2015

  14. Design of PEACH2  Implement by FPGA with four  Latency reduction PCIe Gen.2 IPs  Hardwired logic  Altera Stratix IV GX  Low-overhead routing mechanism  Prototyping, flexible enhancement  Efficient address mapping in PCIe address area using unused  Sufficient communication bits bandwidth  Simple comparator for decision  PCI Express Gen2 x8 for each of output port port (40Gbps = IB QDR) It is not only a proof-of-concept  Sophisticated DMA controller implementation, but it will also be available for product-run in GPU  Chaining DMA, Block-stride transfer function cluster. GPU Technology Conference 2015 14 Mar. 19, 2015

  15. PEACH2 board (Production version for HA-PACS/TCA) FPGA Main board (Altera Stratix IV Most part operates at 250 MHz + sub board (PCIe Gen2 logic runs at 250MHz) 530GX) DDR3- Power supply SDRAM for various voltage PCI Express x8 card edge PCIe x8 cable connecter PCIe x16 cable connecter GPU Technology Conference 2015 15 Mar. 19, 2015

  16. HA-PACS/TCA Compute Node PEACH2 Board is installed here! Front View (8 node / rack ) Rear View 3U height GPU Technology Conference 2015 16 Mar. 19, 2015

  17. Inside of HA-PACS/TCA Compute Node GPU Technology Conference 2015 17 Mar. 19, 2015

  18. Spec. of HA-PACS base cluster & HA-PACS/TCA Base cluster (Feb. 2012) TCA (Nov. 2013) Node CRAY GreenBlade 8204 CRAY 3623G4-SM MotherBoard Intel Washington Pass SuperMicro X9DRG-QF CPU Intel Xeon E5-2670 x 2 socket Intel Xeon E5-2680 v2 x 2 socket (SandyBridge-EP, 2.6GHz 8 core) x2 (IvyBridge-EP, 2.8GHz 10 core) x2 Memory DDR3-1600 128 GB DDR3-1866 128 GB GPU NVIDIA M2090 x4 NVIDIA K20X x 4 # of Nodes 268 (26) 64 (10) (Racks) Interconnect Mellanox InfiniBand QDR x2 (Connect X-3) Mellanox InfiniBand QDR x2 + PEACH2 Peak Perf. 802 TFlops 364 TFlops Power 408 kW 99.3 kW Totally, HA-PACS is over 1PFlops system ! GPU Technology Conference 2015 18 Mar. 19, 2015

Recommend


More recommend