Scaling Communication-Intensive Applications on BlueGene/P Using One- Sided Communication and Overlap By Rajesh Nishtala 1 , Paul Hargrove 2 , Dan Bonachea 1 and Katherine Yelick 1,2 1 University of California, Berkeley 2 Lawrence Berkeley National Laboratory (to appear at IEEE IPDPS 2009)
Observations • Performance gains delivered through increasing concurrency rather than clock rates • Application scalability is essential for future performance improvements • 100,000s of processors will be the norm in the very near future • Maximize the use of available resources • Leverage communication/communication and communication/ computation overlap • Systems will favor many slower power efficient processors to fewer faster power inefficient one • Light-weight communication and runtime system to minimize software overhead • Close semantic match to underlying hardware http://upc.lbl.gov 2
Overview • Discuss our new port of GASNet, the communication subsystem for the the Berkeley UPC compiler, to the BlueGene/P • Outline the key differences found between one and two sided communication and their applicability to modern networks • Show how the microbenchmark performance advantages can translate to real applications • Chose the communication bound NAS FT benchmark as the case study • Thesis Statement: • The one-sided communication model found in GASNet is a better semantic fit to modern highly concurrent systems by better leveraging features such as RDMA and thus allowing applications to realize better scaling. http://upc.lbl.gov 3
BlueGene/P Overview • Representative example for future highly !"#$%& +,-."/01 !"#$%& concurrent systems '('()* '()* • Compute node: 4 cores running at 850 2,-34&%-/".&1 MHz w/ 2GB of RAM and 13.6 GB/s between main memory and the cores )-PIJ1 +,-%./(0- • Total cores = (32 nodes / node card) x )99-NK 52,-/6781-9(9(,: 2,-/4;8<=%>-?@)-AB-/".&1 )9-NIJ1 (32 node cards / rack) x (upto 72 racks) ,-NO /,&12$%./(0- • Different networks for different tasks )-/678>-9? CDEF1 92M-HIJ1 *9-HK /341 • 3D Torus for general point-to-point Figure and data from: 9-8.4/%114.1 IBM System Blue Gene Solution: Blue Gene/P )2G*-HIJ1 Application Development communication (5.1 GB/s per node) ,G?-HK-CCD )2G*-HIJ1 by Carlos Sosa and Brant Knudson '-FK-LCDEF • Global Interrupt network for Barriers Published Dec. 2008 by IBM Redbooks ISBN: 0738432113 (1.3 us for 72 racks) • Global Collective Network for One- to-Many broadcast or Many-to-one reductions (0.85 GB/s per link) http://upc.lbl.gov 4
Partitioned Global Address Space (PGAS) Languages • Programming model suitable for both shared and distributed memory systems shared address space • Language presents a logically shared memory • Any thread may directly read/write private address space data located on a remote processor P0 P1 P2 P3 • Address space is partitioned so each processor has affinity to a memory region • Accesses to “local” memory are potentially much faster http://upc.lbl.gov 5
Data Transfers in UPC • MPI Code: double A; MPI_Status stat; • Example: Send P0’s version of if(myrank == 0) { A = 42.0; A to P1 MPI_Send(&A, 1, MPI_DOUBLE, 1 , 0, MPI_COMM_WORLD); } else if(myrank == 1) A A A A A MPI_Recv(&A, 1, MPI_DOUBLE, 0 , MPI_ANY_TAG, MPI_COMM_WORLD, &stat); • UPC Code: P0 P1 P2 P3 shared [1] double A[4]; if( MYTHREAD == upc_threadof(&A[0]) ) { A[0] = 42.0; upc_memput(&A[1], &A[0], sizeof(double)); } http://upc.lbl.gov 6
One-Sided versus Two-Sided Communication host cores one-sided put (i.e. GASNet) dest addr data payload NIC two-sided send/recv (i.e. MPI) pre-posted memory msg id data payload recv • One-sided put/get is able to directly transfer data w/o interrupting host cores • Message contains the information about the remote address to find out where to directly put the data • CPU need not be involved if NIC supports Remote Direct Memory Access (RDMA) • Synchronization is decoupled from the data movement. • Two-sided send/recv requires rendez-vous with host cores to agree where the data needs to be put before RDMA can be used • Bounce buffers can also be used for small enough message but slow serial can make it prohibitively expensive • Most modern networks provide RDMA functionality, so why not just use it directly? http://upc.lbl.gov 7
GASNet Overview • Portable and high performance runtime system for many different PGAS Languages • Projects: Berkeley UPC, GCC-UPC, Titanium, Rice Co-Array Fortran, Cray Chapel, Cray UPC & Co-Array Fortran and many other experimental projects • Supported Networks: BlueGene/P (DCMF), Infiniband (VAPI and IBV), Cray XT (Portals), Quadrics (Elan), Myrinet (GM), IBM LAPI, SHMEM, SiCortex (soon to be released), UDP, MPI • 100% open source and under BSD license • Features: • Multithreaded (works on VN, Dual, or SMP modes) • Provides efficient nonblocking puts and gets • Often just a thin wrapper around hardware puts and gets • Also support for Vector, Index, and Strided (VIS) operations • Provides rich Active Messaging API • Provides Nonblocking Collective Communication • Collectives will soon be automatically tuned http://upc.lbl.gov 8
GASNet Latency Performance • GASNet implemented on top of Deep Computing Messaging Framework (DCMF) 9 MPI Send/Recv • Lower level than MPI GASNet (Get + sync) GASNet (Put + sync) 8 • Provides Puts, Gets, AMSend, and 7 Collectives Roundtrip Latency (microseconds) • Point-to-point ping-ack latency 6 performance 5 • N-byte transfer w/ 0 byte 4 acknowledgement • GASNet takes advantage of 3 DCMF remote completion 2 Good notification 1 • Minimum semantics needed to 0 implement the UPC memory model 1 2 4 8 16 32 64 128 256 512 Transfer Size (Bytes) • Almost a factor of two difference until 32 bytes • Indication of better semantic match to underlying communication system http://upc.lbl.gov 9
GASNet Multilink Bandwidth Six Link Peak • Each node has six 850MB/s* GASNet (6 link) 4500 MPI (6 link) GASNet (4 link) bidirectional link MPI (4 link) 4000 GASNet (2 link) • Vary number of links used from 1 to 6 MPI (2 link) Flood Bandwidth (MB/s 1MB = 2 20 Bytes) One Link Peak 3500 GASNet (1 link) MPI (1 link) • Initiate a series of nonblocking puts 3000 on the links (round-robin) 2500 Good • Communication/communication 2000 overlap 1500 • Both MPI and GASNet asymptote to 1000 the same bandwidth 500 • GASNet outperforms MPI at 0 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M 2M Transfer Size (Bytes) midrange message sizes * Kumar et. al showed the • Lower software overhead implies maximum achievable bandwidth for DCMF transfers is 748 MB/s per more efficient message injection link so we use this as our peak • GASNet avoids rendezvous to bandwidth See “The deep computing messaging framework: generalized scalable leverage RDMA message passing on the blue gene/P supercomputer”, Kumar et al. ICS08 http://upc.lbl.gov 10
Case Study: NAS FT Benchmark !"#$%&''()# 4"#$%&''()# *+#,-&.%/01# *+;+#,-&.%/01# 67827# 6782# 6:82:# 25# 6:## 24# 2!# 23# 69## 69# • Perform a large 3D FFT • Used in many ares of computational science • Molecular dynamics, CFD, image processing, signal processing, astrophysics, etc. • Representative of a class of communication intensive algorithms • Requires parallel many-to-many communication • Stresses communication subsystem • Limited by bandwidth (namely bisection bandwidth) of the network • Building on our previous work, we perform a 2D partition of the domain • Requires two rounds of communication rather than one • Each processor communicates in two rounds with O( � T) threads in each http://upc.lbl.gov 11
Our Terminology • Domain is NX columns by NY rows by NZ 4"#$%&''()# planes *+;+#,-&.%/01# • We overlay a TY x TZ processor grid (i.e. NX 67827# 6:82:# is only contiguous dimension) • Plane: An NX columns by NY rows that is shared amongst a team of TY processors • Slab: An NX columns by NY/TY rows of 69# elements that is entirely on one thread • Each thread owns NZ/TZ slabs • Packed Slab: An NX columns by NY/TY rows by NZ/TZ rows • All the data a particular thread owns http://upc.lbl.gov 12
3D-FFT Algorithm • Perform a 3D FFT (as part of NAS FT) across a large rectangular prism • Perform an FFT in each of the 3 D00 D01 D02 D03 C00 C01 C02 C03 Dimensions B00 B01 B02 B03 D10 D11 D12 D13 A00 A01 A02 A03 • Need to Team-Exchange for other 2/3 P0 C10 C11 C12 C13 dimensions for a 2-D processor layout B10 B11 B12 B13 D20 D21 D22 D23 A10 A11 A12 A13 • Performance limited by bisection C20 C21 C22 C23 bandwidth of the network B20 B21 B22 B23 D30 D31 D32 D33 A20 A21 A22 A23 • Algorithm: C30 C31 C32 C33 B30 B31 B32 B33 • Perform FFT across the rows A30 A31 A32 A33 Each processor owns a row of 4 squares (16 processors in example) http://upc.lbl.gov 13
Recommend
More recommend