yili zheng
play

Yili Zheng Lawrence Berkeley National Laboratory Berkeley UPC Team - PowerPoint PPT Presentation

Yili Zheng Lawrence Berkeley National Laboratory Berkeley UPC Team Project Lead: Katherine Yelick Team members: Filip Blagojevic, Dan Bonachea, Paul Hargrove, Costin Iancu, Seung- Jai Min, Yili Zheng Former members: Christian Bell,


  1. Yili Zheng Lawrence Berkeley National Laboratory

  2. Berkeley UPC Team • Project Lead: Katherine Yelick • Team members: Filip Blagojevic, Dan Bonachea, Paul Hargrove, Costin Iancu, Seung- Jai Min, Yili Zheng • Former members: Christian Bell, Wei Chen, Jason Duell, Parry Husbands, Rajesh Nishtala , Mike Welcome • A joint project of LBNL and UC Berkeley 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 2

  3. Motivation • Scalable systems have either distributed memory or shared memory without cache coherency – Clusters: Ethernet, Infiniband, CRAY XT, IBM BlueGene – Hybrid nodes: CPU + GPU or other kinds of accelerators – SoC: IBM Cell, Intel Single-chip Cloud Computer (SCC) • Challenges of Message Passing programming models – Difficult data partitioning for irregular applications – Memory space starvation due to data replication – Performance overheads from two-sided communication semantics 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 3

  4. Partitioned Global Address Space Shared Shared Shared Shared Segment Segment Segment Segment Private Private Private Private Segment Segment Segment Segment Thread 1 Thread 2 Thread 3 Thread 4  Global data view abstraction for productivity  Vertical partitions among threads for locality control  Horizontal partitions between shared and private segments for data placement optimizations  Friendly to non-cache-coherent architectures 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 4

  5. PGAS Example: Global Matrix Distribution Global Matrix View Distributed Matrix Storage 1 5 2 6 1 2 5 6 9 13 10 14 3 4 8 7 9 10 13 14 3 7 4 8 11 12 15 16 11 15 12 16 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 5

  6. UPC Overview • PGAS dialect of ISO C99 • Distributed shared arrays • Dynamic shared-memory allocation • One-sided shared-memory communication • Synchronization: barriers, locks, memory fences • Collective communication library • Parallel I/O library 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 6

  7. Key Components for Scalability • One-sided communication and active messages • Efficient resource sharing for multi-core systems • Non-blocking collective communication 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 7

  8. Berkeley UPC Software Stack UPC Applications UPC-to-C Translator Hardware Dependant Language Dependant Translated C code with Runtime Calls UPC Runtime GASNet Communication Library Network Driver and OS Libraries 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 8

  9. Berkeley UPC Features • Data transfer for complex data types (vector, indexed, stride) • Non-blocking memory copy • Point-to-point synchronization • Remote atomic operations • Active Messages • Extension to UPC collectives • Portable timers 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 9

  10. One-Sided vs. Two-Sided Messaging two-sided message (e.g., MPI) host message id data payload CPU network one-sided put (e.g., UPC) interface dest. addr. data payload memory • Two-sided messaging – Message does not contain information about the final destination; need to look it up on the target node – Point-to-point synchronization implied with all transfers • One-sided messaging – Message contains information about the final destination – Decouple synchronization from data movement 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 10

  11. Active Messages A B • Active messages = Data + Action Request • Key enabling technology for both one-sided and two-sided communications Request handler – Software implementation of Put/Get – Eager and Rendezvous protocols Reply • Remote Procedural Calls – Facilitate “owner - computes” – Spawn asynchronous tasks Reply handler 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 11

  12. GASNet Bandwidth on BlueGene/P • Torus network – Each node has six 850MB/s* bidirectional links – Vary number of links from 1 to 6 G • Consecutive non-blocking puts O on the links (round-robin) O D • Similar bandwidth for large-size messages • GASNet outperforms MPI for mid-size messages * Kumar et. al showed the maximum achievable bandwidth for DCMF – Lower software overhead transfers is 748 MB/s per link so we use this as our peak bandwidth – More overlapping See “The deep computing messaging framework: generalized scalable message passing on the blue gene/P See “Scaling Communication Intensive Applications on supercomputer”, Kumar et al. ICS08 BlueGene/P Using One-Sided Communication and Overlap”, Rajesh Nishtala, Paul Hargrove, Dan Bonachea, and Katherine Yelick, IPDPS 2009 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 12

  13. GASNet Bandwidth on Cray XT4 1800 Bandwidth of Non-Blocking Put (MB/s) 1600 1400 1200 (up is good) 1000 800 600 portals-conduit Put 400 OSU MPI BW test 200 mpi-conduit Put 0 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Payload Size (bytes) Slide source: Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT, Dan Bonachea, Paul Hargrove, Michael Welcome, Katherine Yelick, CUG 2009 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 13

  14. GASNet Latency on Cray XT4 30 25 Latency of Blocking Put (µs) 20 (down is good) 15 10 mpi-conduit Put MPI Ping-Ack 5 portals-conduit Put 0 1 2 4 8 16 32 64 128 256 512 1024 Payload Size (bytes) Slide source: Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT, Dan Bonachea, Paul Hargrove, Michael Welcome, Katherine Yelick, CUG 2009 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 14

  15. Execution Models on Multi-core – Process vs. Thread Map UPC threads to Processes Map UPC threads to Pthreads CPU CPU CPU CPU Physical Shared-memory Virtual Address Space 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 15

  16. Point-to-Point Performance – Process vs. Thread InfiniBand Bandwidth 1T-16P 2T-8P 4T-4P 8T-2P 16T-1P MPI 6000 5000 Bandwidth (MB/s) 4000 3000 2000 1000 0 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K Size (Bytes) 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 16

  17. Application Performance – Process vs. Thread Fine Grained Comm. 1.2 1 0.8 0.6 0.4 0.2 0 GUPS MCOP SOBEL 16T-1P 1T-16P 2T-8P 4T-4P 8T-2P 1T-16P 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 17

  18. NAS Parallel Benchmarks – Process vs. Thread 1.4 NPB - Class C Comm Fence 1.2 Critical Section Comp 1 0.8 0.6 0.4 0.2 0 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 EP CG IS MG FT LU BT-256 SP-256 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 18

  19. Collective Communication for PGAS • Communication patterns similar to MPI: broadcast, reduce, gather, scatter and alltoall • Global address space enables one-sided collectives • Flexible synchronization modes provide more communication and computation overlapping opportunities 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 19

  20. Collective Communication Topologies 0 0 8 4 2 1 1 8 2 5 9 12 12 10 9 6 5 3 14 13 11 7 3 4 6 7 10 11 13 14 15 Binary Tree binomial tree 0 4 8 12 0 7 1 1 5 13 9 6 2 2 6 10 14 5 3 4 3 7 11 15 Fork Tree Radix 2 Dissemination 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 20

  21. GASNet Module Organization UPC Collectives Other PGAS Collectives GASNet Collectives API Auto-Tuner of Algorithms and Parameters Portable Native Collectives Collectives Shared-Memory Collectives Point-to-point Collective Comm. Driver Comm. Driver Interconnect/Memory 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 21

  22. Auto-tuning Collective Communication Performance Performance Offline tuning Influencing Factors Tuning Space  Optimize for platform Hardware Algorithm selection common characteristics  CPU  Eager vs. rendezvous Memory system  Put vs. get  Minimize runtime  Interconnect  Collection of well-  tuning overhead Software known algorithms  Application Communication topology Online tuning  System software  Tree type  Optimize for application Execution  Tree fan-out  Process/thread Implementation-specific runtime characteristics layout parameters  Refine offline tuning  Input data set  Pipelining depth results  System workload  Dissemination radix 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 22

  23. Broadcast Performance Cray XT4 Nonblocking Broadcast (1024 Cores) 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 23

Recommend


More recommend