Optimizing MPI Intra-node Communication with New Task Model for - PowerPoint PPT Presentation

Optimizing MPI Intra-node Communication with New Task Model for Many-core Systems Research and Development Group, Hitachi, Ltd. Akio SHIMADA LENS INTERNATIONAL WORKSHOP 2015

Background core system since the appearance of multi-core processor and try to accelerate intra-node communication on many-core systems (e.g. hybrid MPI) 2 • A large number of parallel processes can be invoked within a node on a many- • MPI and some PGAS language runtimes invokes multiple processes • Fast Intra-node communication is required • Many researches proposed a variety of intra-node communication schemes(e.g. KNEM, LiMIC) Parallel*Processes � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Parallel*Processes � Process � Process � Process � Process � Process � Process � Process � Process � Core � Core � Core � Core � Core � Core � Process � Process � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Node � Node � Communica/on*on*Mul/1core*Node � Communica/on*on*Many1core*Node �

Conventional Intra-node Communication Schemes Buffer OS Kernel Buffer Receive Receiver Send Buffer Sender copy memory copy memory Buffer Intermediate Shared Memory Receive Receiver Send Buffer Sender communication System call overhead is produced for every • OS kernel assistance (KNEM, LiMIC, etc.) required for every communication Double-copy via shared memory is • Shared Memory 3 are produced memory copy • Overheads for “crossing address space boundaries among processes” • There are address space boundaries among processes

Proposal on many-core systems within the same node to run in the same address space space boundary from intra-node communication 4 • Partitioned Virtual Address Space（PVAS） • A new task model for efficient parallel processing • PVAS make it possible for parallel processes • PVAS can remove overheads for crossing address

Address Space Layout HEAP Address Low High PVAS Partition 1 TEXT DATA&BSS STACK Partition 0 TEXT DATA&BSS HEAP STACK PVAS Task 0 PVAS Task 1 Normal Task Model PVAS Task Model KERNEL PVAS ･･･ KERNEL them to parallel processes (PVAS tasks) within a PVAS partition assigned to the other PVAS task) other processes 5 Process 0 TEXT DATA&BSS HEAP STACK KERNEL Process 1 TEXT DATA&BSS HEAP STACK • PVAS partitions a single address space into multiple segments (PVAS partition) and assigns • Parallel processes uses the same page table for managing memory mapping informations • PVAS task can use only its own PVAS partition as its local memory (cannot allocate memory • PVAS task is almost same as normal process except sharing the same address space with

PVAS Feature other PVAS tasks within the same node other PVAS tasks by load/store instructions (There are no address space boundaries among them) without overheads for crossing an address space boundary 6 • All memory of the PVAS task is exposed to the • PVAS task can access the memory of the • A pair of PVAS tasks can exchange the data

Optimizing Open MPI by PVAS Transfer Layer (BTL) of the Open MPI • Supporting double-copy communication via shared memory • Supporting single-copy communication with OS kernel assistance (using KNEM) OS kernel assistance by using PVAS facility 7 • PVAS BTL component is implemented in the Byte • SM BTL • PVAS BTL（developed on the basis of the SM BTL） • Copying the data from send buffer to receive buffer without

PVAS BTL Receive from the send buffer ② Receiver copies the data   to the send buffer 8 Buffer (PVAS Task 1) MPI Process 1 Send Buffer (PVAS Task 0) MPI Process 0 when transferring the data • Invoking MPI process as PVAS task • Copying the data from send buffer to receive buffer directly • The overheads for crossing address space boundary is not produced • Single-copy communication (avoiding extra memory copy) • OS kernel assistance is not necessary (avoiding system call overhead) ① Sender posts the pointer  

Evaluation Environment • Intel Xeon Phi 5110P • 1.083 GHZ, 60 cores (4HT) • 32 KB L1 cache, 512 KB L2 cache • 8 GB of main memory • OS • Intel MPSS linux 2.6.38.8 with PVAS facility • MPI • Open MPI 1.8 with PVAS BTL

Latency Evaluation Intel MPI Benchmarks 10 when message size is small because of the system call overhead • Ping-pong communication latency was measured by running 1000000" SM" SM"(KNEM)" 100000" PVAS" 10000" Lanteyc"(usec) � 1000" 100" 10" 1" 64" 128" 256" 512" 1K" 2K" 4K" 8K" 16K" 32K" 64K" 128K" 256K" 512K" 1M" 2M" 4M" 8M" 16M" 32M" Message"Size"(Bytes) � • PVAS BTL outperforms others regardless of the message size • Latency of the SM BTL (KNEM) is higher than that of SM BTL

NAS Parallel Benchmarks (NPB) • 11 performance by up to 28% CLASS A, B, C (A < B < C) N/A 225（SP, BT） • 128（MG, CG, FT, IS, LU） • 15$ 10$ 5$ Performance$Improvement$(%) � 0$ !5$ !10$ • Running NPB on a single node !15$ !20$ !25$ !30$ !35$ !40$ • Number of Processes !45$ !50$ SM$(KNEM)$ PVAS$ !55$ !60$ MG$ CG$ FT$ IS$ LU$ SP$ BT$ Benchmark$(CLASS$A) � 25$ 20$ Performance$Improvement$(%)$ 15$ 10$ 5$ 0$ !5$ !10$ !15$ • Problem size !20$ !25$ SM$(KNEM)$ !30$ PVAS$ !35$ MG$ CG$ FT$ IS$ LU$ SP$ BT$ Bechmark$(CLASS$B) � 30$ 25$ Performance$Improvement$ 20$ • PVAS BTL improves benchmark 15$ 10$ 5$ 0$ !5$ !10$ SM$(KNEM)$ PVAS$ !15$ • SP（CLASS C） !20$ MG$ CG$ FT$ IS$ LU$ SP$ BT$ Bemchmark$(CLASS$C) �

MPI Process 0 (PVAS Task 0) ① Sender and receiver exchange the pointer to   ②’ memory copy by receiver the intermediate buffer ② Sender posts the pointer to by receiver ③ memory copy ② memory copy by sender PVAS BTL SM BTL Shared Memory Send Buffer by sender ① memory copy 12 consulting the data type informations of them the data type informations of them Receive Buffer Send Buffer MPI Process 1 (PVAS Task 1) Receive Buffer MPI Process 1 (PVAS Task 1) MPI Process 0 (PVAS Task 0) Optimizing Non-contiguous Data Transfer Using Derived Data Types when using PVAS facility • Sender and receiver exchange the pointer to the data type informations of them • MPI process can access the MPI internal objects of the other MPI process • Sender and receiver copies the data from the send buffer to the receive buffer • Sender and receiver copy the data in parallel

Latency Evaluation Using DDTBench(1/2) SPECFEM3D 13 X-axis: Data Size, Y-axis: Latency (usec) derived data types • DDTBench [Timo et al., EurMPI’12] mimics the commutation pattern of MPI applications by using • MPI processes send and receive the non-contiguous data in WRF, MILC, NPB, LAMMPS, WRF_x_vec � WRF_x_sa � WRF_y_sa � WRF_y_vec � 6000" 1200" 6000" 1200" SM" SM" SM" SM" 5000" PVAS" 1000" 5000" 1000" PVAS" PVAS" PVAS" 4000" 4000" 800" 800" 3000" 3000" 600" 600" 2000" 2000" 400" 400" 1000" 1000" 200" 200" 0" 0" 0" 0" 63K" 102K" 173K" 63K" 102K" 173K" 43K" 55K" 63K" 75K" 90K" 43K" 55K" 63K" 75K" 90K" NAS_MG_z � NAS_MG_x � NAS_MG_y � MILC_su3_zd � 7000" 7000" 70000" 800" SM" SM" SM" SM" 700" 6000" 6000" 60000" PVAS" PVAS" PVAS" PVAS" 600" 5000" 5000" 50000" 500" 4000" 4000" 40000" 400" 3000" 30000" 3000" 300" 2000" 2000" 20000" 200" 1000" 1000" 10000" 100" 0" 0" 0" 0" 1M" 32K" 4K" 65K" 262K" 4K" 65K" 262K" 1M" 12K" 24K" 49K" 98K" 2K" 131K" 524K"

Optimizing MPI Intra-node Communication with New Task Model for - PowerPoint PPT Presentation

Optimizing MPI Intra-node Communication with New Task Model for Many-core Systems Research and Development Group, Hitachi, Ltd. Akio SHIMADA LENS INTERNATIONAL WORKSHOP 2015 Background core system since the appearance of multi-core processor

Title node 1 branch 1 branch 2 node 2 root branch 3 node 3 branch 4 node 4 Title node

Anonymity and Censorship Resistance Entry node Middle node Exit node Tor user Tor Node Tor

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

1 Agenda Quick'Intro' Node.js:'The'Beginning' What'Is'Node.js? Why'Use'Node.js?

African Trade Champions African Trade Champions (INTRA-CHAMPS) (INTRA-CHAMPS) Statement by:

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Automation of Determination of Optimal Intra-Compute Node Parallelism PRESENTED BY: Scalable

Node.js Workshop Tom Hughes-Croucher Chief Evangelist / Node Tech Lead @sh1mmer tom@joyent.com

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node->m_data == value) {

Eugene Syriani Project n = new Node(); n = new Node(); n = new Node(); n.add(graph);

Lecture 4: Probability models Independence assumptions Smoothing Parameter estimation: relative

Last Time The components of convolutional neural networks Alexnet, VGG Today

Traffic Routes & Segregation Pedestrians Site Welfare

Joint Research and Development Member States Support Programmes (MSSPs) as a Vehicle for the

Heterogeneous Computing Systems Mikiko Sato Tokyo University of Agriculture and Technology

NodeJS Security Still unsafe at most speeds @DinisCruz London, 29th Sep 2016 Me Developer

Keine Angst vor dem eigenen Quellcode .NETDAY Mai, 2017 Thomas Bandixen (@tbandixen) Thomas

CCNY at TRECVID 2015: Localization Yuancheng Ye 1 , Xuejian Rong 2 , Xiaodong Yang 3 , and Yingli

Optimizing MPI Intra-node Communication with New Task Model for - PowerPoint PPT Presentation

Optimizing MPI Intra-node Communication with New Task Model for Many-core Systems Research and Development Group, Hitachi, Ltd. Akio SHIMADA LENS INTERNATIONAL WORKSHOP 2015 Background core system since the appearance of multi-core processor

Title node 1 branch 1 branch 2 node 2 root branch 3 node 3 branch 4 node 4 Title node

Anonymity and Censorship Resistance Entry node Middle node Exit node Tor user Tor Node Tor

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

1 Agenda Quick'Intro' Node.js:'The'Beginning' What'Is'Node.js? Why'Use'Node.js?

African Trade Champions African Trade Champions (INTRA-CHAMPS) (INTRA-CHAMPS) Statement by:

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Automation of Determination of Optimal Intra-Compute Node Parallelism PRESENTED BY: Scalable

Node.js Workshop Tom Hughes-Croucher Chief Evangelist / Node Tech Lead @sh1mmer tom@joyent.com

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node-&gt;m_data == value) {

Eugene Syriani Project n = new Node(); n = new Node(); n = new Node(); n.add(graph);

Lecture 4: Probability models Independence assumptions Smoothing Parameter estimation: relative

Last Time The components of convolutional neural networks Alexnet, VGG Today

Traffic Routes &amp; Segregation Pedestrians Site Welfare

Joint Research and Development Member States Support Programmes (MSSPs) as a Vehicle for the

Heterogeneous Computing Systems Mikiko Sato Tokyo University of Agriculture and Technology

NodeJS Security Still unsafe at most speeds @DinisCruz London, 29th Sep 2016 Me Developer

Keine Angst vor dem eigenen Quellcode .NETDAY Mai, 2017 Thomas Bandixen (@tbandixen) Thomas

CCNY at TRECVID 2015: Localization Yuancheng Ye 1 , Xuejian Rong 2 , Xiaodong Yang 3 , and Yingli

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node->m_data == value) {

Traffic Routes & Segregation Pedestrians Site Welfare