Predictability using Multiple Virtual Lanes in Modern Multi-Core - PowerPoint PPT Presentation

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of Computer Science & Engineering The Ohio State University

Outline • Introduction & Motivation • Problem Statement • Design • Performance Evaluation and Results • Conclusions and Future Work ICPP '10 2

Introduction & Motivation Fabric Card Switches • Supercomputing clusters growing in size and scale Line Card Switches Line Card Switches • MPI – predominant programming model for HPC • High performance interconnects like InfiniBand increased network capacity • Compute capacity outstrips network capacity with advent of multi/many core processors • Gets aggravated as jobs get assigned to random nodes and share links ICPP '10 3

Analysis of Traffic Pattern in a Supercomputer • Traffic flow in the Ranger supercomputer at Texas Advanced Computing Center (TACC) shows heavy link sharing – http://www.tacc.utexas.edu Color of Dot Description Green Network Elements Black Line Card Switches Red Fabric Card Switches Color Number of Streams Black 1 Blue 2 Red 3 – 4 Orange 5 - 8 Courtesy - TACC Green > 8 ICPP '10 4

Possible Issue with Link Sharing Switch Compute Compute Node Node • Few communicating peers – No Problem • Packets get backed up as number of communicating peers increases • Results in delayed arrival of packet at destination ICPP '10 5

Frequency Distribution of Inter Arrival Times • Packet size – 2 KB (results same for 1 KB to 16 KB) • Arrival time is directly proportional to the load on the links ICPP '10 6

Introduction & Motivation (Cont) Can modern networks like InfiniBand alleviate this? ICPP '10 7

InfiniBand Architecture • An industry standard for low latency, high bandwidth System Area Networks • Multiple features – Two communication types • Channel Semantics • Memory Semantics (RDMA mechanism) – Queue Pair (QP) based communication – Quality of Service (QoS) support – Multiple Virtual Lanes (VL) – QPs associated to VLs by means of pre-specified Service Levels • Multiple communication speeds available for Host Channel Adapters (HCA) – 10 Gbps (SDR) / 20 Gbps (DDR) / 40 Gbps (QDR) ICPP '10 8

InfiniBand Network Buffer Architecture • Buffers in most IB HCAs InfiniBand Host Channel Adapter (HCA) and switches grouped into two Virtual Lane 0 – Common Buffer Pool and, – Private VL buffers Common Virtual Virtual Lane 1 • Most current generation Physical Buffer Lane Link MPIs only use one VL Arbiter • Inefficient use of available Pool network resources Virtual Lane 15 • Why not use more VLs? • Possible con – Would it take more time to poll all the VLs ICPP '10 9

Problem Statement • Can multiple virtual lanes be used to improve performance of HPC applications • How can we integrate this design into an MPI library so that end applications will be benefited ICPP '10 11

Proposed Framework and Goals • No change to application • Re-design MPI library to use Job Scheduler multiple VLs • Need new methods to take Application advantage of multiple VLs – Traffic Distribution MPI Library • Load balance traffic across multiple VLs Traffic Traffic – Traffic Segregation Distribution Segregation • Ensure one kind of traffic does not disturb other InfiniBand Network • Distinguish between – Low & High priority traffic – Small & Large messages ICPP '10 13

Proposed Design • Re-design MPI library to use multiple VLs – Multiple Virtual Lanes configured with different characteristics • Transmit less packets at high priority • Transmit more packets at lower priority etc – Multiple Service Levels (SL) defined to match VLs – Queue Pairs (QPs) assigned proper SLs at QP creation time • Multiple ways to assign Service Levels to applications – Assign SLs with similar characteristics in a round robin fashion • Traffic Distribution – Assign SLs with desired characteristic based on type of application • Traffic Segregation – Other designs being explored ICPP '10 14

Proposed Design (Cont) Job MPI Library Scheduler Virtual Lane 0 Service Level Virtual Service Level Virtual Lane 1 Physical Lane Application Link Arbiter Service Level Virtual Lane 15 ICPP '10 15

Outline • Introduction & Motivation • Problem Statement • Design • Performance Evaluation & Results • Conclusions and Future Work ICPP '10 16

Experimental Testbed • Compute platforms – Intel Nehalem • Intel Xeon E5530 Dual quad-core processors operating at 2.40 GHz • 12GB RAM, 8MB cache • PCIe 2.0 interface • Network Equipments – MT26428 QDR ConnectX HCAs – 36-port Mellanox QDR switch used to connect all the nodes • Red Hat Enterprise Linux Server release 5.3 (Tikanga) • OFED-1.4.2 • OpenSM version 3.1.6 • Benchmarks – Modified version of OFED perftest for verbs level tests – MPIBench collective benchmark – CPMD used for application level evaluation ICPP '10 17

MVAPICH / MVAPICH2 Software • High Performance MPI Library for IB and 10GE – MVAPICH (MPI-1) and MVAPICH2 (MPI-2) – Used by more than 1255 organizations in 59 countries – More than 44,500 downloads from OSU site directly – Empowering many TOP500 clusters • 11 th ranked 62,976-core cluster (Ranger) at TACC – Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED) – Also supports uDAPL device to work with any network supporting uDAPL – http://mvapich.cse.ohio-state.edu/ ICPP '10 18

Verbs Level Performance • Tests use 8 communicating pairs • One QP per pair • Packet size – 2 KB • Results same for 1 KB to 16 KB • Traffic distribution using multiple VLs results in more predictable Inter arrival time • Slight increase in average latency ICPP '10 19

MPI Level Point to Point Performance 3500 140 1-VL 3000 120 Bandiwdth (MBps) 8-VLs 2500 100 Latency (us) 2000 80 1500 60 1000 40 1-VL 500 20 8-VLs 0 0 1K 2K 4K 8K 16K 32K 64K 1K 2K 4K 8K 16K 32K 64K Message Size (Bytes) Message Size (Bytes) 1.8 Message Rate (in Millions) 1-VL 1.6 • Tests use 8 communicating pairs 1.4 8-VLs 1.2 • One QP per pair 1 • Traffic distribution using multiple VLs 0.8 result in better overall performance 0.6 0.4 • 13% performance improvement over 0.2 case with just one VL 0 1K 2K 4K 8K 16K 32K 64K ICPP '10 20 Message Size (Bytes)

MPI Level Collective Performance Traffic Distribution Traffic Segregation 7000 350 2 Alltoalls (No Segregation) 1-VL 6000 300 2 Alltoalls (Segregation) 8-VLs 5000 250 1 Alltoall Latency (us) Latency (us) 4000 200 3000 150 2000 100 1000 50 0 0 1K 2K 4K 8K 16K 1K 2K 4K 8K 16K Message Size (Bytes) Message Size (Bytes) • For 64 process Alltoall, Traffic Distribution and Traffic Segregation through use of multiple VLs results in better performance • 20% performance improvement seen with Traffic Distribution • 12% performance improvement seen with Traffic Segregation ICPP '10 21

Application Level Performance 1 • CPMD application • 64 processes 0.8 Normalized Time • Traffic Distribution through use of multiple VLs results in 0.6 better performance • 11% performance 0.4 improvement in Alltoall performance • 6% improvement in overall 0.2 performance 0 Total Time Time in Alltoall 1 VL 8 VLs ICPP '10 22

Outline • Introduction & Motivation • Problem Statement • Design • Performance Evaluation & Results • Conclusions & Future Work ICPP '10 23

Conclusions & Future Work • Explore use of Virtual Lanes to improve predictability and performance of HPC applications • Integrate our scheme into MVAPICH2 MPI library and conduct performance evaluations at various levels • Consistent increase in performance at verbs, MPI and application level evaluations • Explore advanced schemes to improve performance using multiple virtual lanes • Proposed solution will be available in future MVAPICH2 releases ICPP '10 24

Thank you! {subramon, laipi, surs, panda}@cse.ohio-state.edu Network-Based Computing Laboratory http://mvapich.cse.ohio-state.edu/ ICPP '10 25

Predictability using Multiple Virtual Lanes in Modern Multi-Core - PowerPoint PPT Presentation

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of Computer Science & Engineering The

Managed Lanes in California: Where Weve Been Where We ve Been Where Were Going Joe Rouse

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Animal Predictability in Baboon Movement Characterize Predictability in Existing Baboon

Bus Toll Lanes Price Managed Lanes as a Transit Option Premium Transit Service using price

I-77 EXPRESS LANES February 20, 2019 I-77 Express Lanes | February 20, 2019 Agenda Meeting

Reclaim the Lanes presentation Carolines bit - Slides: Vancouver country lanes - Middlesbrough

ExpressLanes/HOT Lanes (I-110 ExpressLanes/HOT Lanes (I-110) DEIR/EA Project Overview March 9

Predictability and Efficiency in Predictability and Efficiency in Wireless Sensor Networks

Express Lanes Campaign Research In February 2018, HPTE hosted four focus groups in the

ROADWAY CONCEPTS STATION NO. 1 Concept No. 2 Four Lanes Concept No. 1 Four Lanes Remove

Arterial Managed Lanes Chris Swenson, P.E. Wilbur Smith Associates M PO M anaged Lanes

395 Express Lanes Project Update Parkfairfax | October 11, 2018 Project Overview Project

Lanes and I-285 Top End Express Lanes Fulton County Schools Briefing Tim Matthews, P.E.

395 Express Lanes Extension March 2016 Building a network of Express Lanes Project overview Add

Design Opportunities & Concerns Project Advisory Committee March 17, 2016 Lake Street Project

I-285 Top End Express Lanes I-285 Westside Express Lanes 1 Unprecedented Growth in Metro

LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE PROCESSOR USING SINGLE-FLUX-QUANTUM CIRCUITS Naofumi

HiPEAC11 Heraklion - Crete DDM-VM c : The Data-Driven Multithreading Virtual Machine for the

The Chorus Line Reliable Wind Power Control The Chorus Line, managed by its ChorusDIRECTOR

Lock-Free, Wait-Free and Multi-core Programming Roger Deran boilerbay.com Fast, Efficient

An Integrated Edge and Fog System for Future Communication Networks IEEE WCNC 2018 COMPASS

1 <Insert Picture Here> The Native NDB Engine for Memcached John David Duncan

CLICK HERE TO KNOW MORE LTA-UAV: The Future of Disaster Response and Surveillance Shattri MANSOR,

TENNIS OPENED UP Chris Pollard, Abbie Lench, Tom Murray AGENDA Journey over the last 12

Predictability using Multiple Virtual Lanes in Modern Multi-Core - PowerPoint PPT Presentation

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of Computer Science & Engineering The

Managed Lanes in California: Where Weve Been Where We ve Been Where Were Going Joe Rouse

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Animal Predictability in Baboon Movement Characterize Predictability in Existing Baboon

Bus Toll Lanes Price Managed Lanes as a Transit Option Premium Transit Service using price

I-77 EXPRESS LANES February 20, 2019 I-77 Express Lanes | February 20, 2019 Agenda Meeting

Reclaim the Lanes presentation Carolines bit - Slides: Vancouver country lanes - Middlesbrough

ExpressLanes/HOT Lanes (I-110 ExpressLanes/HOT Lanes (I-110) DEIR/EA Project Overview March 9

Predictability and Efficiency in Predictability and Efficiency in Wireless Sensor Networks

Express Lanes Campaign Research In February 2018, HPTE hosted four focus groups in the

ROADWAY CONCEPTS STATION NO. 1 Concept No. 2 Four Lanes Concept No. 1 Four Lanes Remove

Arterial Managed Lanes Chris Swenson, P.E. Wilbur Smith Associates M PO M anaged Lanes

395 Express Lanes Project Update Parkfairfax | October 11, 2018 Project Overview Project

Lanes and I-285 Top End Express Lanes Fulton County Schools Briefing Tim Matthews, P.E.

395 Express Lanes Extension March 2016 Building a network of Express Lanes Project overview Add

Design Opportunities &amp; Concerns Project Advisory Committee March 17, 2016 Lake Street Project

I-285 Top End Express Lanes I-285 Westside Express Lanes 1 Unprecedented Growth in Metro

LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE PROCESSOR USING SINGLE-FLUX-QUANTUM CIRCUITS Naofumi

HiPEAC11 Heraklion - Crete DDM-VM c : The Data-Driven Multithreading Virtual Machine for the

The Chorus Line Reliable Wind Power Control The Chorus Line, managed by its ChorusDIRECTOR

Lock-Free, Wait-Free and Multi-core Programming Roger Deran boilerbay.com Fast, Efficient

An Integrated Edge and Fog System for Future Communication Networks IEEE WCNC 2018 COMPASS

1 &lt;Insert Picture Here&gt; The Native NDB Engine for Memcached John David Duncan

CLICK HERE TO KNOW MORE LTA-UAV: The Future of Disaster Response and Surveillance Shattri MANSOR,

TENNIS OPENED UP Chris Pollard, Abbie Lench, Tom Murray AGENDA Journey over the last 12

Design Opportunities & Concerns Project Advisory Committee March 17, 2016 Lake Street Project

1 <Insert Picture Here> The Native NDB Engine for Memcached John David Duncan