NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement Taeuk Kim , Awais Khan, Youngjae Kim, Sungyong Park, Scott Atchley PDSW-DISCS 17 WIP session November 13, 2017, Denver, USA LABORATORY FOR ADVANCED SYSTEM SOFTWARE 1 SOGANG UNIVERSITY
Need for Data Coupling over ESnet • Data Coupling across HPC facilities Nuclear interaction datasets generated at NERSC needed at the OLCF for Peta-scale simulation Climate simulations run at ALCF and OLCF validated with BER datasets at ORNL data centers Coupling Data: Example: Moving Large Data Sets LABORATORY FOR ADVANCED SYSTEM SOFTWARE 2 SOGANG UNIVERSITY
Terabits Network Environment • Terabits network improvement only Streaming Neutron contributed the network transfer rate. Experiment Data Data Mover IB WAN Ethernet Data Mover Simulation Verbs Data UDP WAN or LinuxEthDirect UDP LAN or Myricom But, data sets are stored at slow storage systems! LABORATORY FOR ADVANCED SYSTEM SOFTWARE 3 SOGANG UNIVERSITY
LADS: Layout-Aware Data Scheduling [FAST’15] • LADS offers an end-to-end data transfer Streaming Neutron optimization. Experiment Data Data Mover IB WAN Ethernet Data Mover Simulation Verbs Data UDP WAN or LinuxEthDirect UDP LAN or Myricom LADS solved the impedance mismatch problem between the faster network and slower storage system ! LABORATORY FOR ADVANCED SYSTEM SOFTWARE 4 SOGANG UNIVERSITY
What Memory Bottleneck Occurs in LADS? DTN IB WAN Ethernet DTN NUMA node 0 NUMA node 1 Simulation Data DRAM DRAM Cores in CPU0 Cores in CPU1 QPI Scheduler Communicator I/O thread DRAM DRAM Cores in CPU2 Cores in CPU3 RMA buffer NUMA node 2 NUMA node 3 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 5 SOGANG UNIVERSITY
Architectural Overview for LADS • NUMA-based DTN Architecture in Source and Sink NUMA node 0 NUMA node 1 RMA DRAM buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 6 SOGANG UNIVERSITY
Memory Bottleneck with Single RMA Buffer • NUMA-based DTN Architecture in Source and Sink NUMA node 0 NUMA node 1 RMA DRAM buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 7 SOGANG UNIVERSITY
Memory Bottleneck with Single RMA Buffer • NUMA-based DTN Architecture in Source and Sink Remote Memory CPU Socket 1 accessing RMA Buffer hosted by Accesses!!! CPU Socket 0 NUMA node 0 NUMA node 1 RMA DRAM buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 8 SOGANG UNIVERSITY
Multiple RMA Buffers • Distributing the RMA buffer to all CPU sockets To reduce the remote socket’s memory access NUMA node 0 NUMA node 1 RMA RMA buffer buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 9 SOGANG UNIVERSITY
Multiple RMA Buffers Possibility for accessing remote socket’s memory reduced! NUMA node 0 NUMA node 1 RMA RMA buffer buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 10 SOGANG UNIVERSITY
Memory-aware Thread Scheduling (MTS) • Binding all threads to in-socket RMA buffer • Load balancing among in-socket NUMA nodes NUMA node 1 NUMA node 0 RMA RMA buffer buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 11 SOGANG UNIVERSITY
Memory-aware Thread Scheduling (MTS) Local Memory Accesses & Load Balancing NUMA node 1 NUMA node 0 RMA RMA buffer buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 12 SOGANG UNIVERSITY
Test-bed Configuration • Data Transfer Nodes (DTNs) • Workloads 2 CPU sockets, 4 NUMA nodes, 8x3GB files (Big file workload) 24 cores 24,000x1MB files (Small file 128GB memory workload) InfiniBand EDR (100Gb/s) • Storage Systems We used the memory file system (tmpfs) to eliminate storage bottlenecks. DTN DTN IB QDR (40Gb/s) sink source Memory File System Memory File System (tmpfs) (tmpfs) LABORATORY FOR ADVANCED SYSTEM SOFTWARE 13 SOGANG UNIVERSITY
Evaluation Throughput (MB/s) 4000 3500 3000 2500 Single RMA Baseline 2000 Buffer 1500 1000 500 0 2 4 8 16 32 64 Number of I/O Threads LABORATORY FOR ADVANCED SYSTEM SOFTWARE 14 SOGANG UNIVERSITY
Evaluation Throughput (MB/s) 4000 3500 3000 2500 Single RMA Baseline Buffer 2000 NUMA-aware MTS Scheduling 1500 1000 500 0 2 4 8 16 32 64 Number of I/O Threads LABORATORY FOR ADVANCED SYSTEM SOFTWARE 15 SOGANG UNIVERSITY
Evaluation Throughput (MB/s) 4000 3500 3000 2500 Single RMA Baseline Buffer 2000 NUMA-aware MTS Scheduling 1500 1000 Throughput increased to an average of 24.3%! 500 0 2 4 8 16 32 64 Number of I/O Threads LABORATORY FOR ADVANCED SYSTEM SOFTWARE 16 SOGANG UNIVERSITY
Q&A Contact: Taeuk Kim (taugi323@sogang.ac.kr) Department of Computer Science and Engineering Sogang University, Seoul, Republic of KOREA LABORATORY FOR ADVANCED SYSTEM SOFTWARE 17 SOGANG UNIVERSITY
Recommend
More recommend