NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement - PowerPoint PPT Presentation

NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement Taeuk Kim , Awais Khan, Youngjae Kim, Sungyong Park, Scott Atchley PDSW-DISCS 17 WIP session November 13, 2017, Denver, USA LABORATORY FOR ADVANCED SYSTEM SOFTWARE 1 SOGANG UNIVERSITY

Need for Data Coupling over ESnet • Data Coupling across HPC facilities  Nuclear interaction datasets generated at NERSC needed at the OLCF for Peta-scale simulation  Climate simulations run at ALCF and OLCF validated with BER datasets at ORNL data centers Coupling Data: Example: Moving Large Data Sets LABORATORY FOR ADVANCED SYSTEM SOFTWARE 2 SOGANG UNIVERSITY

Terabits Network Environment • Terabits network improvement only Streaming Neutron contributed the network transfer rate. Experiment Data Data Mover IB WAN Ethernet Data Mover Simulation Verbs Data UDP WAN or LinuxEthDirect UDP LAN or Myricom But, data sets are stored at slow storage systems! LABORATORY FOR ADVANCED SYSTEM SOFTWARE 3 SOGANG UNIVERSITY

LADS: Layout-Aware Data Scheduling [FAST’15] • LADS offers an end-to-end data transfer Streaming Neutron optimization. Experiment Data Data Mover IB WAN Ethernet Data Mover Simulation Verbs Data UDP WAN or LinuxEthDirect UDP LAN or Myricom LADS solved the impedance mismatch problem between the faster network and slower storage system ! LABORATORY FOR ADVANCED SYSTEM SOFTWARE 4 SOGANG UNIVERSITY

What Memory Bottleneck Occurs in LADS? DTN IB WAN Ethernet DTN NUMA node 0 NUMA node 1 Simulation Data DRAM DRAM Cores in CPU0 Cores in CPU1 QPI Scheduler Communicator I/O thread DRAM DRAM Cores in CPU2 Cores in CPU3 RMA buffer NUMA node 2 NUMA node 3 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 5 SOGANG UNIVERSITY

Architectural Overview for LADS • NUMA-based DTN Architecture in Source and Sink NUMA node 0 NUMA node 1 RMA DRAM buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 6 SOGANG UNIVERSITY

Memory Bottleneck with Single RMA Buffer • NUMA-based DTN Architecture in Source and Sink NUMA node 0 NUMA node 1 RMA DRAM buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 7 SOGANG UNIVERSITY

Memory Bottleneck with Single RMA Buffer • NUMA-based DTN Architecture in Source and Sink Remote Memory CPU Socket 1 accessing RMA Buffer hosted by Accesses!!! CPU Socket 0 NUMA node 0 NUMA node 1 RMA DRAM buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 8 SOGANG UNIVERSITY

Multiple RMA Buffers • Distributing the RMA buffer to all CPU sockets  To reduce the remote socket’s memory access NUMA node 0 NUMA node 1 RMA RMA buffer buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 9 SOGANG UNIVERSITY

Multiple RMA Buffers Possibility for accessing remote socket’s memory reduced! NUMA node 0 NUMA node 1 RMA RMA buffer buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 10 SOGANG UNIVERSITY

Memory-aware Thread Scheduling (MTS) • Binding all threads to in-socket RMA buffer • Load balancing among in-socket NUMA nodes NUMA node 1 NUMA node 0 RMA RMA buffer buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 11 SOGANG UNIVERSITY

Memory-aware Thread Scheduling (MTS) Local Memory Accesses & Load Balancing NUMA node 1 NUMA node 0 RMA RMA buffer buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 12 SOGANG UNIVERSITY

Test-bed Configuration • Data Transfer Nodes (DTNs) • Workloads  2 CPU sockets, 4 NUMA nodes,  8x3GB files (Big file workload) 24 cores  24,000x1MB files (Small file  128GB memory workload)  InfiniBand EDR (100Gb/s) • Storage Systems  We used the memory file system (tmpfs) to eliminate storage bottlenecks. DTN DTN IB QDR (40Gb/s) sink source Memory File System Memory File System (tmpfs) (tmpfs) LABORATORY FOR ADVANCED SYSTEM SOFTWARE 13 SOGANG UNIVERSITY

Evaluation Throughput (MB/s) 4000 3500 3000 2500 Single RMA Baseline 2000 Buffer 1500 1000 500 0 2 4 8 16 32 64 Number of I/O Threads LABORATORY FOR ADVANCED SYSTEM SOFTWARE 14 SOGANG UNIVERSITY

Evaluation Throughput (MB/s) 4000 3500 3000 2500 Single RMA Baseline Buffer 2000 NUMA-aware MTS Scheduling 1500 1000 500 0 2 4 8 16 32 64 Number of I/O Threads LABORATORY FOR ADVANCED SYSTEM SOFTWARE 15 SOGANG UNIVERSITY

Evaluation Throughput (MB/s) 4000 3500 3000 2500 Single RMA Baseline Buffer 2000 NUMA-aware MTS Scheduling 1500 1000 Throughput increased to an average of 24.3%! 500 0 2 4 8 16 32 64 Number of I/O Threads LABORATORY FOR ADVANCED SYSTEM SOFTWARE 16 SOGANG UNIVERSITY

Q&A Contact: Taeuk Kim (taugi323@sogang.ac.kr) Department of Computer Science and Engineering Sogang University, Seoul, Republic of KOREA LABORATORY FOR ADVANCED SYSTEM SOFTWARE 17 SOGANG UNIVERSITY

NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement - PowerPoint PPT Presentation

NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement Taeuk Kim , Awais Khan, Youngjae Kim, Sungyong Park, Scott Atchley PDSW-DISCS 17 WIP session November 13, 2017, Denver, USA LABORATORY FOR ADVANCED SYSTEM SOFTWARE 1 SOGANG

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

A Four-Terabit Single-Stage A Four-Terabit Single-Stage Packet Switch with Large Packet Switch

Chapter 2 Process, Thread and Process, Thread and Chapter 2 Scheduling Scheduling

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

13 IN THIS CHAPTER Benefits of Thread Pooling 308 Considerations and Costs of Thread

On Terabit Flow Analysis FloCon 2008, Savannah Jonathan M. Smith CIS Department, U. Penn

TECHNOLOGICAL CHALLENGES FOR FIELD DEPLOYMENT AND UPGRADE OF MULTI-TERABIT/S UPGRADE OF

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

Chapter 2 Process, thread, and Process, thread, and Chapter 2 scheduling scheduling

Chapter 2 Process, Thread and Chapter 2 Process, Thread and Scheduling Scheduling

NUMA-Aware Thread Migration for High Performance NVMM File Systems Ying Wang , Dejun Jiang, Jin

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

Testing asymptotically safe quantum gravity through coupling to dynamical matter Astrid Eichhorn

Local Markov Chains, Path Coupling and Belief Propagation (BP) Eric Vigoda Georgia Tech joint

Charge-spin coupling as a probe of correlated quantum materials Joo N. B. Rodrigues PI: Lucas

Introduction to Event Generators Simon Pltzer Particle Physics, University of Vienna at the

Response Prediction of Compliant Structures in Hypersonic Flow Jack J. McNamara ---

Spatial and Temporal Scales Coupling in Reactive Flows Ashraf N. Al-Khateeb R EACTIVE F LOW M

GuidedSampler: Coverage-guided Sampling of SMT Solutions Rafael Dutra, Jonathan Bachrach,

Learning Sets of Rules Sequential covering algorithms FOIL Induction as inverse of

Sambuz

Useful Links

Newsletter

Mail Us