Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications Toshihiro Hanawa Information Technology Center, The University of Tokyo Taisuke Boku Center for Computational Sciences, University of Tsukuba Collaboration with Yuetsu Kodama, Mitsuhisa Sato, Masayuki Umemura @ CCS, Univ. of Tsukuba Hitoshi Murai @ Riken AICS, Hideharu Amano @ Keio Univ. GPU Technology Conference 2015 1 Mar. 19, 2015
Agenda Background Application Examples HA-PACS / AC-Crest Project QUDA (QCD) FFTE (FFT) Introduction of HA-PACS / TCA Introduction of XcalableACC Organization of TCA Concept PEACH2 Board designed for TCA Code Examples Evaluation of Basic Evaluations Performance Summary Collective Communications Implementation Examples Performance Evaluation GPU Technology Conference 2015 2 Mar. 19, 2015
Current Trend of HPC using GPU Computing Advantageous Features High peak performance / cost ratio High peak performance / power ratio Examples of HPC System: GPU Clusters in Green500 (Nov. GPU Clusters and MPPs in TOP500 2014) (“Greenest” Supercomputers (Nov. 2014) ranked in Top500) 2nd: Titan (NVIDIA K20X, 27 PF) 3rd: TSUBAME-KFC (NVIDIA K20X, 4.4 6th: Piz Daint (NVIDIA K20X, 7.8 PF) GF/W) 10th: Cray CS-Storm (NVIDIA K40, 6.1 4th: Cray Storm1 (NVIDIA K40, 3.9 GF / PF) W) 15th: TSUBAME2.5 (NVIDIA K20X, 5.6 7th: HA-PACS/TCA (NVIDIA K20X, 3.5 PF) GF/W) 48 systems use NVIDIA GPUs. 8 systems of Top10 use NVIDIA GPUs. GPU Technology Conference 2015 3 Mar. 19, 2015
Issues of GPU Computing Data I/O performance limitation Ex) K20X: PCIe gen2 x16 Peak Performance: 8GB/s (I/O) ⇔ 1.3 TFLOPS (Computation) Communication bottleneck becomes significant on multi GPU application Strong-scaling on GPU cluster Important to shorten Turn-Around Time of production-run Heavy impact of communication latency Ultra- low latency between GPUs is important for next generation’s HPC Our target is developing a direct communication system between external GPUs for a feasibility study for future accelerated computing. ⇒ “Tightly Coupled Accelerators (TCA)” architecture GPU Technology Conference 2015 4 Mar. 19, 2015
HA-PACS Project HA- PACS is not only a “commodity GPU HA-PACS (Highly Accelerated Parallel cluster ” but also experiment platform Advanced system for Computational Sciences) HA-PACS base cluster for development of GPU-accelerated code 8 th generation of PAX/PACS series for target fields, and performing product-run supercomputer in University of Tsukuba Now in operation since Feb. 2012 HA-PACS/TCA (TCA = Tightly Coupled FY2011-2013, operation until Accelerators) FY2016(?) for elementary research on direct Promotion of computational science communication technology for accelerated applications in key areas in CCS- computing Tsukuba Our original communication chip named Target field: QCD, astrophysics, “PEACH2” was installed in each node. QM/MM (quantum mechanics / Now in operation since Nov. 2013 molecular mechanics, bioscience) GPU Technology Conference 2015 5 Mar. 19, 2015
AC-CREST project Project “ Research and Development on Unified Environment of Accelerated Computing and Interconnection for Post-Petascale Era ” ( AC-CREST ) Objectives Supported by JST-CREST Realization of high-performance (direct) “ Development of System communication among accelerators Software Technologies for Development of system software supporting post-Peta Scale High communication system among accelerators Performance Computing ” Development of parallel language and program compilers Higher productivity Highly optimized (offload, communication) Development of practical applications GPU Technology Conference 2015 6 Mar. 19, 2015
What is “Tightly Coupled Accelerators (TCA)” ? Concept: Direct connection between accelerators (GPUs) over the nodes without CPU assistance Eliminate extra memory copies to the host Reduce latency, improve strong scaling with small data size Enable hardware support for complicated communication patterns GPU Technology Conference 2015 7 Mar. 19, 2015
Communication on TCA Architecture Using PCIe as a communication link between accelerators over the nodes Direct device P2P communication is available thru PCIe. PEACH2: Node Node PCI Express Adaptive CPU Memory CPU Memory Communication Hub ver. 2 CPU CPU PCIe Switch PCIe Switch Implementation of the PCIe PCIe PCe interface and data PCe GPU GPU transfer engine for TCA PEACH2 PEACH2 PCIe GPU Memory GPU Memory GPU Technology Conference 2015 8 Mar. 19, 2015
GPU Communication with traditional MPI Traditional MPI using InfiniBand requires data copy 3 times Data copy between CPU and GPU (1 and 3) have to perform manually CPU CPU 2: Data transfer over IB Mem Mem PCIe SW PCIe SW Mem GPU IB IB GPU Mem 1: Copy from GPU mem 3: Copy from CPU mem to CPU mem through PCI Express (PCIe) to GPU mem through PCIe GPU Technology Conference 2015 9 Mar. 19, 2015
GPU Communication with IB/GDR The InfiniBand controller read and write GPU memory directly (with GDR) Temporal data copy is eliminated Lower latency than the previous method Protocol conversion is still needed 1: Direct data transfer CPU CPU (PCIe -> IB -> PCIe) Mem Mem PCIe SW PCIe SW Mem GPU IB IB GPU Mem GPU Technology Conference 2015 10 Mar. 19, 2015
GPU Communication with TCA (PEACH2) TCA does not need protocol conversion direct data copy using GDR much lower latency than InfiniBand 1: Direct data transfer CPU CPU (PCIe -> PCIe -> PCIe) Mem Mem PCIe SW PCIe SW Mem GPU TCA TCA GPU Mem GPU Technology Conference 2015 11 Mar. 19, 2015
TCA node structure example Similar to ordinary GPU cluster PEACH2 can access all configuration except PEACH2 GPUs 80 PCIe lanes are required NVIDIA Kepler architecture CPU CPU + “ GPUDirect Support for (Xeon (Xeon RDMA” are required. QPI E5 v2) E5 v2) G3 Single PCI address space G2 Connect among 3 nodes G2 x8 G2 G2 G2 PCIe x8 x16 x16 x16 x16 using remaining PEACH2 IB GPU GPU GPU GPU PEA G2 HC 0 1 2 3 CH2 port A x8 G2 G2 QDR x8 x8 GPU: NVIDIA K20X 2port GPU Technology Conference 2015 12 Mar. 19, 2015
TCA node structure example Similar to ordinary GPU cluster Actually, configuration except PEACH2 Performance over QPI is 80 PCIe lanes are required miserable. CPU CPU (Xeon (Xeon PEACH2 is available for GPU0, QPI E5 v2) E5 v2) G3 GPU1. G2 G2 x8 G2 G2 G2 PCIe x8 x16 x16 x16 x16 Note that InfiniBand with GPU IB GPU GPU GPU GPU PEA G2 HC Direct for RDMA is available 0 1 2 3 CH2 A x8 only for GPU2, GPU3. G2 G2 x8 x8 GPU: NVIDIA K20X GPU Technology Conference 2015 13 Mar. 19, 2015
Design of PEACH2 Implement by FPGA with four Latency reduction PCIe Gen.2 IPs Hardwired logic Altera Stratix IV GX Low-overhead routing mechanism Prototyping, flexible enhancement Efficient address mapping in PCIe address area using unused Sufficient communication bits bandwidth Simple comparator for decision PCI Express Gen2 x8 for each of output port port (40Gbps = IB QDR) It is not only a proof-of-concept Sophisticated DMA controller implementation, but it will also be available for product-run in GPU Chaining DMA, Block-stride transfer function cluster. GPU Technology Conference 2015 14 Mar. 19, 2015
PEACH2 board (Production version for HA-PACS/TCA) FPGA Main board (Altera Stratix IV Most part operates at 250 MHz + sub board (PCIe Gen2 logic runs at 250MHz) 530GX) DDR3- Power supply SDRAM for various voltage PCI Express x8 card edge PCIe x8 cable connecter PCIe x16 cable connecter GPU Technology Conference 2015 15 Mar. 19, 2015
HA-PACS/TCA Compute Node PEACH2 Board is installed here! Front View (8 node / rack ) Rear View 3U height GPU Technology Conference 2015 16 Mar. 19, 2015
Inside of HA-PACS/TCA Compute Node GPU Technology Conference 2015 17 Mar. 19, 2015
Spec. of HA-PACS base cluster & HA-PACS/TCA Base cluster (Feb. 2012) TCA (Nov. 2013) Node CRAY GreenBlade 8204 CRAY 3623G4-SM MotherBoard Intel Washington Pass SuperMicro X9DRG-QF CPU Intel Xeon E5-2670 x 2 socket Intel Xeon E5-2680 v2 x 2 socket (SandyBridge-EP, 2.6GHz 8 core) x2 (IvyBridge-EP, 2.8GHz 10 core) x2 Memory DDR3-1600 128 GB DDR3-1866 128 GB GPU NVIDIA M2090 x4 NVIDIA K20X x 4 # of Nodes 268 (26) 64 (10) (Racks) Interconnect Mellanox InfiniBand QDR x2 (Connect X-3) Mellanox InfiniBand QDR x2 + PEACH2 Peak Perf. 802 TFlops 364 TFlops Power 408 kW 99.3 kW Totally, HA-PACS is over 1PFlops system ! GPU Technology Conference 2015 18 Mar. 19, 2015
Recommend
More recommend