basic features of the api
play

Basic features of the API Memory allocation and sample API calls - PowerPoint PPT Presentation

DMAPP in context Basic features of the API Memory allocation and sample API calls Preliminary Gemini performance measurements 2 The Distributed Memory Application (DMAPP) API Supports features of the Gemini Network Interface


  1.  DMAPP in context  Basic features of the API  Memory allocation and sample API calls  Preliminary Gemini performance measurements 2

  2. The Distributed Memory Application (DMAPP) API  Supports features of the Gemini Network Interface  Used by higher levels of the software stack: PGAS compiler runtime  SHMEM library   Balance between portability and hardware intimacy  Intended to be used by system software developers  Application developers should use SHMEM 3

  3. Apps PE MPICH2 MPICH2 Cray SHMEM Cray SHMEM PGAS compilers PGAS compilers user-level GNI DMAPP kernel Linux Core kernel-level GNI Gemini HW Abstraction Layer Gemini HW Abstraction Layer HW Gemini network processor Gemini network processor 4

  4.  Distributed memory model  One-sided model for participating (SPMD) processes launched by Alps aprun command  Each PE has local memory but has one-sided access (PUT/GET) to remote memory  Remote memory has to be in an accessible memory segment 5

  5. PE PE put source destination  Network supports direct remote get/put from user process to user process.  Mechanisms:  Block Transfer (BTE)  Fast Memory Access (FMA) including Atomic Memory Operations (AMOs) 6

  6. process Remote op segments  Remote source or destination in either data or symmetric-heap segments  Symmetry means we can use local address information in remote context 7

  7.  dmapp_init  Sets up access to data and symmetric heap (exports memory)  barrier  you can set or read available resource limits  dmapp_get_jobinfo Returns a structure with useful information:  Number of PEs   Index of this PE Pointers to data and symmetric heap segments  required in other calls 8

  8. dmapp_put(*target_addr, *target_seg, target_pe, source_addr, nelems, type)  Remote locations defined by: address, segment, pe  This is a blocking operation  type can be DMAPP_{BYTE,DW,QW,DQW) for 1, 4, 8 and 16 bytes.  Analogous get call 9

  9.  Blocking (no suffix) dmapp_put , dmapp_get  Non-blocking explicit (_nb suffix) dmapp_put_nb (…, syncid)  Non-blocking implicit (_nbi suffix)  No handle to test for completion  Synchronization (memory completion/visibility)  Can wait on specific syncid  Can wait for all implicit operations to complete 10

  10. iput put Remote data  Strided calls dmapp_iput …, dmapp_iget …  Additional arguments define source and destination stride in terms of elements 11

  11. ixput put Remote data  Scatter/Gather calls dmapp_ixput …, dmapp_ixget …  Local data is contiguous  Remote data is distributed as defined by an array of offsets 12

  12. put_ixpe PE 0 PE 1 put PE 2 nelems =3  Put with indexed PE-stride calls dmapp_put_ixpe …, dmapp_get_ixpe …  Local data is contiguous  Remote data is distributed (as defined by an array of PE-offsets) to the same address on each PE  Use for small amounts of data  These are not collective operations 13

  13. scatter_ixpe PE 2 PE 4 put PE 6 nelems =1  Scatter/Gather with indexed PE-stride calls dmapp_scatter_ixpe , dmapp_gather_ixpe  Local data is contiguous  Source is scattered to (or gathered from) PEs nelems elements at a time. 14

  14. Atomic operations to 8-byte (QW) remote data Command Operation AADD Atomic ADD AAND Atomic AND AOR Atomic OR AXOR Atomic EXCLUSIVE OR AFADD Atomic fetch and ADD AFAND Atomic fetch and AND AFOR Atomic fetch and OR AFXOR Atomic fetch and XOR AFAX Atomic fetch AND-EXCLUSIVE OR ACSWAP Compare and SWAP 15

  15. AADD AFADD t  Direct support in NIC  Be careful to only read values via DMAPP API 16

  16.  Some calls return syncid (_nb)  Can test or wait on completion  dmapp_syncid_wait(*syncid)  dmapp_syncid_test(*syncid,*flag)  For implicit non-blocking (_nbi)  dmapp_gsync_wait()  Dmapp_gsync_test(*flag)  Use for many small messages 17

  17.  DMAPP applications can allocate memory in symmetric heap double *a; a=(double*) dmapp_sheap_malloc(N*sizeof(double));  Associated realloc and free calls.  Application is responsible for maintaining symmetry of allocations 18

  18. DMAPP exports data and symmetric heap for you This means:  For C  File scope and static inside function  Allocated in symmetric Heap  For Fortran (no API but if there was)  SAVEd data  Data in COMMON 19

  19. PE PE PE PE PE +1 AA +1 +1 +1 +1 Barrier counter PE  Atomic add for master counter (FADD for testing)  Master compares (with n-1) and swaps with 0  … master releases other PEs 20

  20. static uint64_t barrier_counter, bc; if (mype==master){ do{ // wait until counter is npes-1, swap with 0 dmapp_acswap_qw(&bc,(void *)&barrier_counter, seg_data,mype,npes-1,0); } while ( bc!=(npes-1)); } else { dmapp_aadd_qw((void*)&barrier_counter,seg_data, master,1); } // now release barrier… 21

  21.  SHMEM  Has same SPMD model  Requires use of symmetric memory  Original interface is blocking  Non-standard extensions for non-blocking put/get  Varying-sized data items with typed API  Get/put with strided and gather/scatter variants  Barrier and collective operations on sets of PEs  Has the same atomic memory operations  SHMEM is implemented using DMAPP for Gemini systems 22

  22.  Data measured on prototype system during Q1 2010  2100MHz Opteron processors  2400MHz HyperTransport interface  Dual node tests run between PEs on neighbouring Gemini routers 23

  23. 2.5 PUT, ping-pong PUT, at source 2.0 GET Time (microsecs) 1.5 1.0 0.5 0.0 8 16 32 64 128 256 512 1024 Size (bytes) 24

  24. 7000 6000 PPN=1 Bandwidth (mbytes/sec) PPN=2 5000 PPN=4 4000 3000 2000 1000 0 8 16 32 64 128 256 512 1024 2K 4K 8K 16K 32K 64K Element size (bytes) 25

  25. 3000 2500 Bandwidth (mbytes/sec) 8 bytes 2000 64 bytes 256 bytes 1500 1000 500 0 1 2 4 8 16 32 64 Non-blocking puts 26

  26. 160 Vector length = 16 140 Vector Length = 64 Vector length = 4096 120 Rate (millions/sec) 100 80 60 40 20 0 2 4 8 16 32 64 128 256 512 1024 2048 4096 Stride (64-bit words) 27

  27. 120 100 1 AMO 8192 AMOs AMO rate (millions) 80 60 40 20 0 0 256 512 768 1024 Number of processes 28

  28.  Latency (~1 s) far better than SeaStar  Good aggregate bandwidths on small transfers  High AMO rates, especially when multiple processes target the same variables  Strided puts are an important case for CAF  Ongoing optimization effort (for example reduce number of FMA descriptor updates) 29

  29.  What is DMAPP and where does it fit?  Basic features of the API  Memory allocation and sample API calls  Preliminary Gemini performance data 30

Recommend


More recommend