Yili Zheng Lawrence Berkeley National Laboratory Berkeley UPC Team - PowerPoint PPT Presentation

Yili Zheng Lawrence Berkeley National Laboratory

Berkeley UPC Team • Project Lead: Katherine Yelick • Team members: Filip Blagojevic, Dan Bonachea, Paul Hargrove, Costin Iancu, Seung- Jai Min, Yili Zheng • Former members: Christian Bell, Wei Chen, Jason Duell, Parry Husbands, Rajesh Nishtala , Mike Welcome • A joint project of LBNL and UC Berkeley 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 2

Motivation • Scalable systems have either distributed memory or shared memory without cache coherency – Clusters: Ethernet, Infiniband, CRAY XT, IBM BlueGene – Hybrid nodes: CPU + GPU or other kinds of accelerators – SoC: IBM Cell, Intel Single-chip Cloud Computer (SCC) • Challenges of Message Passing programming models – Difficult data partitioning for irregular applications – Memory space starvation due to data replication – Performance overheads from two-sided communication semantics 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 3

Partitioned Global Address Space Shared Shared Shared Shared Segment Segment Segment Segment Private Private Private Private Segment Segment Segment Segment Thread 1 Thread 2 Thread 3 Thread 4  Global data view abstraction for productivity  Vertical partitions among threads for locality control  Horizontal partitions between shared and private segments for data placement optimizations  Friendly to non-cache-coherent architectures 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 4

PGAS Example: Global Matrix Distribution Global Matrix View Distributed Matrix Storage 1 5 2 6 1 2 5 6 9 13 10 14 3 4 8 7 9 10 13 14 3 7 4 8 11 12 15 16 11 15 12 16 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 5

UPC Overview • PGAS dialect of ISO C99 • Distributed shared arrays • Dynamic shared-memory allocation • One-sided shared-memory communication • Synchronization: barriers, locks, memory fences • Collective communication library • Parallel I/O library 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 6

Key Components for Scalability • One-sided communication and active messages • Efficient resource sharing for multi-core systems • Non-blocking collective communication 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 7

Berkeley UPC Software Stack UPC Applications UPC-to-C Translator Hardware Dependant Language Dependant Translated C code with Runtime Calls UPC Runtime GASNet Communication Library Network Driver and OS Libraries 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 8

Berkeley UPC Features • Data transfer for complex data types (vector, indexed, stride) • Non-blocking memory copy • Point-to-point synchronization • Remote atomic operations • Active Messages • Extension to UPC collectives • Portable timers 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 9

One-Sided vs. Two-Sided Messaging two-sided message (e.g., MPI) host message id data payload CPU network one-sided put (e.g., UPC) interface dest. addr. data payload memory • Two-sided messaging – Message does not contain information about the final destination; need to look it up on the target node – Point-to-point synchronization implied with all transfers • One-sided messaging – Message contains information about the final destination – Decouple synchronization from data movement 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 10

Active Messages A B • Active messages = Data + Action Request • Key enabling technology for both one-sided and two-sided communications Request handler – Software implementation of Put/Get – Eager and Rendezvous protocols Reply • Remote Procedural Calls – Facilitate “owner - computes” – Spawn asynchronous tasks Reply handler 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 11

GASNet Bandwidth on BlueGene/P • Torus network – Each node has six 850MB/s* bidirectional links – Vary number of links from 1 to 6 G • Consecutive non-blocking puts O on the links (round-robin) O D • Similar bandwidth for large-size messages • GASNet outperforms MPI for mid-size messages * Kumar et. al showed the maximum achievable bandwidth for DCMF – Lower software overhead transfers is 748 MB/s per link so we use this as our peak bandwidth – More overlapping See “The deep computing messaging framework: generalized scalable message passing on the blue gene/P See “Scaling Communication Intensive Applications on supercomputer”, Kumar et al. ICS08 BlueGene/P Using One-Sided Communication and Overlap”, Rajesh Nishtala, Paul Hargrove, Dan Bonachea, and Katherine Yelick, IPDPS 2009 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 12

GASNet Bandwidth on Cray XT4 1800 Bandwidth of Non-Blocking Put (MB/s) 1600 1400 1200 (up is good) 1000 800 600 portals-conduit Put 400 OSU MPI BW test 200 mpi-conduit Put 0 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Payload Size (bytes) Slide source: Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT, Dan Bonachea, Paul Hargrove, Michael Welcome, Katherine Yelick, CUG 2009 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 13

GASNet Latency on Cray XT4 30 25 Latency of Blocking Put (µs) 20 (down is good) 15 10 mpi-conduit Put MPI Ping-Ack 5 portals-conduit Put 0 1 2 4 8 16 32 64 128 256 512 1024 Payload Size (bytes) Slide source: Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT, Dan Bonachea, Paul Hargrove, Michael Welcome, Katherine Yelick, CUG 2009 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 14

Execution Models on Multi-core – Process vs. Thread Map UPC threads to Processes Map UPC threads to Pthreads CPU CPU CPU CPU Physical Shared-memory Virtual Address Space 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 15

Point-to-Point Performance – Process vs. Thread InfiniBand Bandwidth 1T-16P 2T-8P 4T-4P 8T-2P 16T-1P MPI 6000 5000 Bandwidth (MB/s) 4000 3000 2000 1000 0 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K Size (Bytes) 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 16

Application Performance – Process vs. Thread Fine Grained Comm. 1.2 1 0.8 0.6 0.4 0.2 0 GUPS MCOP SOBEL 16T-1P 1T-16P 2T-8P 4T-4P 8T-2P 1T-16P 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 17

NAS Parallel Benchmarks – Process vs. Thread 1.4 NPB - Class C Comm Fence 1.2 Critical Section Comp 1 0.8 0.6 0.4 0.2 0 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 EP CG IS MG FT LU BT-256 SP-256 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 18

Collective Communication for PGAS • Communication patterns similar to MPI: broadcast, reduce, gather, scatter and alltoall • Global address space enables one-sided collectives • Flexible synchronization modes provide more communication and computation overlapping opportunities 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 19

Collective Communication Topologies 0 0 8 4 2 1 1 8 2 5 9 12 12 10 9 6 5 3 14 13 11 7 3 4 6 7 10 11 13 14 15 Binary Tree binomial tree 0 4 8 12 0 7 1 1 5 13 9 6 2 2 6 10 14 5 3 4 3 7 11 15 Fork Tree Radix 2 Dissemination 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 20

GASNet Module Organization UPC Collectives Other PGAS Collectives GASNet Collectives API Auto-Tuner of Algorithms and Parameters Portable Native Collectives Collectives Shared-Memory Collectives Point-to-point Collective Comm. Driver Comm. Driver Interconnect/Memory 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 21

Auto-tuning Collective Communication Performance Performance Offline tuning Influencing Factors Tuning Space  Optimize for platform Hardware Algorithm selection common characteristics  CPU  Eager vs. rendezvous Memory system  Put vs. get  Minimize runtime  Interconnect  Collection of well-  tuning overhead Software known algorithms  Application Communication topology Online tuning  System software  Tree type  Optimize for application Execution  Tree fan-out  Process/thread Implementation-specific runtime characteristics layout parameters  Refine offline tuning  Input data set  Pipelining depth results  System workload  Dissemination radix 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 22

Broadcast Performance Cray XT4 Nonblocking Broadcast (1024 Cores) 6/22/2010 Workshop on Programming Environments for Emerging Parallel Systems 23

Yili Zheng Lawrence Berkeley National Laboratory Berkeley UPC Team - PowerPoint PPT Presentation

Yili Zheng Lawrence Berkeley National Laboratory Berkeley UPC Team Project Lead: Katherine Yelick Team members: Filip Blagojevic, Dan Bonachea, Paul Hargrove, Costin Iancu, Seung- Jai Min, Yili Zheng Former members: Christian Bell,

Oversubscription on Multicore Processors Costin Iancu, Steven Hofmeyr, Filip Blagojevi, Yili

Redshift space distortion and halo velocity bias ZHENG Yi ( ) Zheng, Yong, Oh, et al.,

Identifiability in matrix sparse factorization L eon Zheng leon.zheng@ens-lyon.fr M2

Fast Neural Network Inference with TensorRT on Autonomous Vehicles Zejia Zheng (zheng@zoox.com)

Country Update Country Update: China Xunhua Zheng (xunhua.zheng@post.iap.ac.cn) Institute of

CDA 5416 : CAV Symbolic CTL Model Checking Hao Zheng Department of Computer Science and

Binary Decision Diagrams Hao Zheng Department of Computer Science and Engineering University of

Simulating Chromosome Segregation Qi Zheng Simulating Chromosome Segregation Qi Zheng

Computation Tree Logic Hao Zheng Department of Computer Science and Engineering University of

Model Checker NuSMV Hao Zheng Department of Computer Science and Engineering University of South

Linear-Time Logic Hao Zheng Department of Computer Science and Engineering University of South

Background Review Read Appendix Hao Zheng Department of Computer Science and Engineering

Information Option Stacking (draft-zheng-dhc-relay-agent-stacking-00) Robin Zheng IETF 76 - DHC

The PAC Learning Framework Guoqing Zheng January 20, 2015 Guoqing Zheng The PAC Learning

RaFM Rank-Aware Factorization Machines Yin Zheng On Behalf of Xiaoshuang Chen, Yin Zheng,

CDA 5416 : CAV Symbolic CTL Model Checking Hao Zheng Department of Computer Science and

METAMOC: Modular Execution Time Analysis Using Model Checking Mads Chr. Olesen <

Runtime System COMP 524: Programming Languages Based in part on slides and notes by J. Erickson,

Vectors and Semantics Peter Turney Vectors and Semantics Vision of the Future future of

An exploratory study of program comprehension strategies of procedural and object-oriented

Processes Don Porter Portions courtesy Emmett Witchel 1 COMP 530: Operating Systems What is a

Processes Summer 2011 Cornell University 1 Today From source code to output Programs

Processes and Threads Chi Zhang czhang@cs.fiu.edu 1 Process Concept Process a program

Implementing Processes Implementing Processes Review: Threads vs vs. Processes . Processes

Yili Zheng Lawrence Berkeley National Laboratory Berkeley UPC Team - PowerPoint PPT Presentation

Yili Zheng Lawrence Berkeley National Laboratory Berkeley UPC Team Project Lead: Katherine Yelick Team members: Filip Blagojevic, Dan Bonachea, Paul Hargrove, Costin Iancu, Seung- Jai Min, Yili Zheng Former members: Christian Bell,

Oversubscription on Multicore Processors Costin Iancu, Steven Hofmeyr, Filip Blagojevi, Yili

Redshift space distortion and halo velocity bias ZHENG Yi ( ) Zheng, Yong, Oh, et al.,

Identifiability in matrix sparse factorization L eon Zheng leon.zheng@ens-lyon.fr M2

Fast Neural Network Inference with TensorRT on Autonomous Vehicles Zejia Zheng (zheng@zoox.com)

Country Update Country Update: China Xunhua Zheng (xunhua.zheng@post.iap.ac.cn) Institute of

CDA 5416 : CAV Symbolic CTL Model Checking Hao Zheng Department of Computer Science and

Binary Decision Diagrams Hao Zheng Department of Computer Science and Engineering University of

Simulating Chromosome Segregation Qi Zheng Simulating Chromosome Segregation Qi Zheng

Computation Tree Logic Hao Zheng Department of Computer Science and Engineering University of

Model Checker NuSMV Hao Zheng Department of Computer Science and Engineering University of South

Linear-Time Logic Hao Zheng Department of Computer Science and Engineering University of South

Background Review Read Appendix Hao Zheng Department of Computer Science and Engineering

Information Option Stacking (draft-zheng-dhc-relay-agent-stacking-00) Robin Zheng IETF 76 - DHC

The PAC Learning Framework Guoqing Zheng January 20, 2015 Guoqing Zheng The PAC Learning

RaFM Rank-Aware Factorization Machines Yin Zheng On Behalf of Xiaoshuang Chen, Yin Zheng,

CDA 5416 : CAV Symbolic CTL Model Checking Hao Zheng Department of Computer Science and

METAMOC: Modular Execution Time Analysis Using Model Checking Mads Chr. Olesen &lt;

Runtime System COMP 524: Programming Languages Based in part on slides and notes by J. Erickson,

Vectors and Semantics Peter Turney Vectors and Semantics Vision of the Future future of

An exploratory study of program comprehension strategies of procedural and object-oriented

Processes Don Porter Portions courtesy Emmett Witchel 1 COMP 530: Operating Systems What is a

Processes Summer 2011 Cornell University 1 Today From source code to output Programs

Processes and Threads Chi Zhang czhang@cs.fiu.edu 1 Process Concept Process a program

Implementing Processes Implementing Processes Review: Threads vs vs. Processes . Processes

METAMOC: Modular Execution Time Analysis Using Model Checking Mads Chr. Olesen <