Efficient Intra-node Communication on Intel MIC Clusters Sreeram - PowerPoint PPT Presentation

Efficient Intra-node Communication on Intel MIC Clusters Sreeram Potluri Akshay Venkatesh Devendar Bureddy Krishna Kandalla Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University 1

Outline • Introduction • Problem Statement • Hybrid MPI Communication Runtime • Performance Evaluation • Conclusion and Future Work 2

Many Integrated Core (MIC) Architecture Hybrid Architectures Single Core Twelve Core Dual Core Quad Core Oct Core • Hybrid system architectures with graphics processors have become common - high compute density and high performance per watt • Intel introduced Many Integrated Core (MIC) architecture geared for HPC • X86 compatibility - applications and libraries can run out-of-the-box or with minor modifications • Many low-power processor cores, hardware threads and wide vector units • MPI continues to be a predominant programming model in HPC 3

Programming Models on Clusters with MIC Xeon Xeon Phi Multi-core Centric MPI Host-only Program Offloaded MPI Offload Computation Program MPI MPI Symmetric Program Program MPI MIC-only Program Many-core Centric • Xeon Phi, the first commercial product based on MIC architecture • Flexibility in launching MPI jobs on clusters with Xeon Phi 4

MPI Communication on Node with a Xeon Phi Intel Xeon PCIe Intel Xeon Phi IB HCA MIC-to-Host Host-to-MIC Intra-MIC Intra-Host • Various paths for MPI communication on a node with Xeon Phi 5

Symmetric Communication Stack with MPSS Xeon Phi Host/Xeon Phi POSIX POSIX SCIF SCIF IB Verbs IB Verbs Calls Calls API API SHM SHM IB IB-SCIF IB-SCIF IB SCIF SCIF PCI-E IB-HCA • MPSS – Intel Manycore Platform Software Stack – Shared Memory – Symmetric Communication InterFace (SCIF) – over PCIe – IB Verbs – through IB adapter – IB-SCIF – IB Verbs over SCIF 6

Problem Statement What are the performance characteristics of different communication channels available on a node with Xeon Phi? How can an MPI communication runtime take advantage of the different channels? Can a low latency and high bandwidth hybrid communication channel be designed, leveraging the all channels? What is the impact of such a hybrid communication channel on performance of benchmarks and applications? 7

MVAPICH2/MVAPICH2-X Software • High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-3.0), available since 2002M – MVAPICH2-‑X ¡(MPI ¡+ ¡PGAS), ¡Available ¡since ¡2012 PI + PGAS), Available since 2012 – Used by more than 2,000 organizations (HPC Centers, Industry and Universities) in 70 countries – More than 165,000 downloads from OSU site directly – Empowering many TOP500 clusters • 7 th ranked 204,900-core cluster (Stampede) at TACC • 14 th ranked 125,980-core cluster (Pleiades) at NASA • and many others – Available with software stacks of many IB, HSE and server vendors including Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Partner in the U.S. NSF-TACC Stampede (9 PFlop) System 9

Intra-MIC Communication • Shared Memory Interface (CH3-SHM) Xeon Phi – POSIX Shared Memory API MVAPICH2 – Small Messages: pair-wise memory CH3 regions between processes SCIF-CH3 SHM-CH3 – Large Messages: buffer pool per process, data is divided into chunks SCIF (8KB) to pipeline copy in and copy out – MPSS offers two implementations of memcpy – multi-threaded copy – DMA-assisted copy: offers low latency for large messages – We use 64KB chunks to trigger the use of DMA-assisted copies for large messages 10

Intra-MIC Communication • SCIF Channel (CH3-SCIF) Xeon Phi – Control of DMA engine to the user MVAPICH2 – API for remote memory access: CH3 • Registration: scif_register SCIF-CH3 SHM-CH3 • Initiation: scif_writeto/readfrom • Completion: scif_fence_signal SCIF – We use a write-based rendezvous protocol • Sender sends Request-To-Send (RTS) • Receiver responds with Ready-to-Receive (RTR) with registered buffer offset and flag offset • Sender issues scif_writeto followed by scif_fence_signal • Both processes poll for flag to be set 11

Host-MIC Communication Xeon Phi Host MVAPICH2 MVAPICH2 OFA-IB-CH3 SCIF-CH3 SCIF-CH3 OFA-IB-CH3 IB-Verbs IB-Verbs scif0 mlx4_0 mlx4_0 scif0 SCIF SCIF PCI-E IB-HCA • IB Channel (OFA-IB-CH3) – Uses IB verbs – Selection of IB network interface to switch between IB and IB-SCIF • SCIF-CH3 – Can be used for communication between Xeon Phi and Host 12

Host-MIC Communication: Host-Initiated SCIF • DMA can be initiated by host or Xeon Phi HOST HOST • But performance is not symmetric Host-to-MIC Host-to-MIC • Host-initiated DMA delivers better MIC-to-Host MIC-to-Host performance MIC MIC • Host-initiated mode takes advantage of this Symmetric Host-Initiated – Write-based from Host-to-Xeon Phi – Read-based transfer from Xeon Phi-to-Host • Symmetric mode to maximize resource utilization on host and Xeon Phi 13

Experimental Setup • TACC Stampede Node – Host • Dual-socket oct-core Intel Sandy Bridge (E5-2680 @ 2.70GHz) • CentOS release 6.3 (Final) – MIC • SE10P (B0-KNC) • 61 cores @ 1085.854 MHz, 4 hardware threads/core • OS 2.6.32-279.el6.x86_64, MPSS 2.1.4346-16 – Compiler : Intel Composer_xe_2013.2.146 – Network Adapter : IB FDR MT 4099 HCA – Enhanced MPI based on MVAPICH2 1.9 15

Intra-MIC Point-to-Point Communication Bandwidth (MB/sec) Latency (usec) 12000 8000 10000 6000 8000 Better Better 6000 4000 4000 2000 2000 0 0 4K 16K 64K 256K 1M 4M 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes) osu_latency osu_bw 10000 Bi-Bandwidth (MB/sec) • Default chunk size severely limits 8000 performance 6000 • Tuned block size alleviates it but shm 4000 Better performance still low 2000 • Using SCIF works around these 0 limitations – 75% improvement in 4K 16K 64K 256K 1M 4M latency, 4.0x improvement in b/w over Message Size (Bytes) SHM-TUNED 16 osu_bibw

Host-MIC Point-to-Point Communication 60 Latency (usec) 3000 50 Latency (usec) 2500 Better Better 2000 40 1500 30 1000 20 500 10 0 0 4K 16K 64K 256K 1M 4M 0 2 8 32 128 512 2K Message Size (Bytes) Message Size (Bytes) osu_latency : small messages osu_latency : large messages • IB provides a low-latency path – 4.7usec for 4Byte messages • IB-SCIF overheads due to SCIF and additional software layer • SCIF designs are already hybrid, use IB for small messages • SCIF outperforms IB for large messages – 72% improvement for 4MB messages • Host-Initiated SCIF takes advantage of faster DMA – 33% improvement over SCIF for 64KB messages 17

Host-MIC Point-to-Point Communication Bandwidth (MB/sec) 8000 Bandwidth (MB/sec) 8000 6000 6000 Better Better 4000 4000 2000 2000 0 0 4K 16K 64K 256K 1M 4M 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes) osu_bw: mic-to-host osu_bw: host-to-mic Bi-Bandwidth (MB/sec) 10000 • IB bandwidth limited mic-to-host due to 8000 peer-to-peer limitation on Sandy Bridge 6000 Better • SCIF works around this, Host-initiated 4000 DMA delivers better bandwidth too – 2000 6.6x improvement over IB 0 • Host-initiated SCIF worse than SCIF in 4K 16K 64K 256K 1M 4M bibw due to wasted resources Message Size (Bytes) osu_bibw 18

Collective Communication 10000 350000 Latency (usec) Latency (usec) 300000 8000 250000 Better Better 6000 200000 150000 4000 100000 2000 50000 0 0 4K 16K 64K 256K 1M 4K 16K 64K 256K 1M Message Size (Bytes) Message Size (Bytes) osu_gather: root on host osu_alltoall • 16 processes on host + 16 processes on MIC • Host-initiated SCIF or symmetric SCIF based on the communication pattern and message size, collective level selected • Gather, rooted collective uses host-initiated SCIF – 75% improvement in at 1MB • All-to-all uses symmetric SCIF – 78% improvement at 1MB 19

Performance of 3D Stencil Communication Benchmark 6 Time per Step (msec) 4 Better 67% 2 0 4+4 8+8 16+16 Processes Count (Host + MIC) • Near-neighbor communication – upto 6 neighbors – 64KB messages • 67% improvement in time per step 20

Performance of P3DFFT Library 8 Time per Loop (sec) 16% 6 Better 4 2 19% 0 256x256x256 512x512x512 Problem Size • (MPI + OpenMP) version of popular library for 3D Fast Fourier Transforms - test performs forward transform and a backward transform in each iteration • 2 processes on Host (8 threads/process) + 8 processes on MIC (8 threads/process) • Uses symmetric SCIF because of the MPI_Alltoall • Upto 19% improvement using SCIF-ENHANCED 21

Efficient Intra-node Communication on Intel MIC Clusters Sreeram - PowerPoint PPT Presentation

Efficient Intra-node Communication on Intel MIC Clusters Sreeram Potluri Akshay Venkatesh Devendar Bureddy Krishna Kandalla Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering

Title node 1 branch 1 branch 2 node 2 root branch 3 node 3 branch 4 node 4 Title node

Anonymity and Censorship Resistance Entry node Middle node Exit node Tor user Tor Node Tor

1 Agenda Quick'Intro' Node.js:'The'Beginning' What'Is'Node.js? Why'Use'Node.js?

African Trade Champions African Trade Champions (INTRA-CHAMPS) (INTRA-CHAMPS) Statement by:

Node.js Workshop Tom Hughes-Croucher Chief Evangelist / Node Tech Lead @sh1mmer tom@joyent.com

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node->m_data == value) {

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

Automation of Determination of Optimal Intra-Compute Node Parallelism PRESENTED BY: Scalable

Image and Video Coding: Intra Prediction & Picture Partitioning Intra-Picture Prediction

NODE.JS ANTI-PATTERNS and bad practices ADOPTION OF NODE.JS KEEPS GROWING CHAMPIONS Walmart,

1 Agenda Node&Modules Module&Loaders Node&Packages

Dev Lab: Node + Express What is Node? Node.js = JavaScript + File I/O + A Package Manager or:

Menzies Distributing the world. Problem The whole world in one server API GET node/#id Returns

Recursive Structures in Python class Node: data: int next: Node An attribute can refer to

Eugene Syriani Project n = new Node(); n = new Node(); n = new Node(); n.add(graph);

Curriculum Committee Cheshire Board of Education December 14, 2015 World Language Task Force

Payroll Processing / Issues Presenter: Kevin Garland Senior Human Resources Specialist May 5,

1 "$#+0

Ideas + Action for a Better City learn more at SPUR.org tweet about this event: @SPUR_Urbanist #

Welcome! The webinar will start in a few minutes A few housekeeping items: This session will be

EMAs Perception Survey PCWP/ HCPWP joint meeting Session on communication and information on

Dispatching control system Miroslav Kocur for substation Own products Dispatching control

Performance of Privacy-Enhancing Cryptography on Smartphones BUT Cryptology Research Group Dr.

Efficient Intra-node Communication on Intel MIC Clusters Sreeram - PowerPoint PPT Presentation

Efficient Intra-node Communication on Intel MIC Clusters Sreeram Potluri Akshay Venkatesh Devendar Bureddy Krishna Kandalla Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering

Title node 1 branch 1 branch 2 node 2 root branch 3 node 3 branch 4 node 4 Title node

Anonymity and Censorship Resistance Entry node Middle node Exit node Tor user Tor Node Tor

1 Agenda Quick'Intro' Node.js:'The'Beginning' What'Is'Node.js? Why'Use'Node.js?

African Trade Champions African Trade Champions (INTRA-CHAMPS) (INTRA-CHAMPS) Statement by:

Node.js Workshop Tom Hughes-Croucher Chief Evangelist / Node Tech Lead @sh1mmer tom@joyent.com

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node-&gt;m_data == value) {

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

Automation of Determination of Optimal Intra-Compute Node Parallelism PRESENTED BY: Scalable

Image and Video Coding: Intra Prediction &amp; Picture Partitioning Intra-Picture Prediction

NODE.JS ANTI-PATTERNS and bad practices ADOPTION OF NODE.JS KEEPS GROWING CHAMPIONS Walmart,

1 Agenda Node&amp;Modules Module&amp;Loaders Node&amp;Packages

Dev Lab: Node + Express What is Node? Node.js = JavaScript + File I/O + A Package Manager or:

Menzies Distributing the world. Problem The whole world in one server API GET node/#id Returns

Recursive Structures in Python class Node: data: int next: Node An attribute can refer to

Eugene Syriani Project n = new Node(); n = new Node(); n = new Node(); n.add(graph);

Curriculum Committee Cheshire Board of Education December 14, 2015 World Language Task Force

Payroll Processing / Issues Presenter: Kevin Garland Senior Human Resources Specialist May 5,

1 &quot;$#+0

Ideas + Action for a Better City learn more at SPUR.org tweet about this event: @SPUR_Urbanist #

Welcome! The webinar will start in a few minutes A few housekeeping items: This session will be

EMAs Perception Survey PCWP/ HCPWP joint meeting Session on communication and information on

Dispatching control system Miroslav Kocur for substation Own products Dispatching control

Performance of Privacy-Enhancing Cryptography on Smartphones BUT Cryptology Research Group Dr.

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node->m_data == value) {

Image and Video Coding: Intra Prediction & Picture Partitioning Intra-Picture Prediction

1 Agenda Node&Modules Module&Loaders Node&Packages

1 "$#+0