numa aware thread and resource scheduling for terabit
play

NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement - PowerPoint PPT Presentation

NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement Taeuk Kim , Awais Khan, Youngjae Kim, Sungyong Park, Scott Atchley PDSW-DISCS 17 WIP session November 13, 2017, Denver, USA LABORATORY FOR ADVANCED SYSTEM SOFTWARE 1 SOGANG


  1. NUMA-Aware Thread and Resource Scheduling for Terabit Data Movement Taeuk Kim , Awais Khan, Youngjae Kim, Sungyong Park, Scott Atchley PDSW-DISCS 17 WIP session November 13, 2017, Denver, USA LABORATORY FOR ADVANCED SYSTEM SOFTWARE 1 SOGANG UNIVERSITY

  2. Need for Data Coupling over ESnet • Data Coupling across HPC facilities  Nuclear interaction datasets generated at NERSC needed at the OLCF for Peta-scale simulation  Climate simulations run at ALCF and OLCF validated with BER datasets at ORNL data centers Coupling Data: Example: Moving Large Data Sets LABORATORY FOR ADVANCED SYSTEM SOFTWARE 2 SOGANG UNIVERSITY

  3. Terabits Network Environment • Terabits network improvement only Streaming Neutron contributed the network transfer rate. Experiment Data Data Mover IB WAN Ethernet Data Mover Simulation Verbs Data UDP WAN or LinuxEthDirect UDP LAN or Myricom But, data sets are stored at slow storage systems! LABORATORY FOR ADVANCED SYSTEM SOFTWARE 3 SOGANG UNIVERSITY

  4. LADS: Layout-Aware Data Scheduling [FAST’15] • LADS offers an end-to-end data transfer Streaming Neutron optimization. Experiment Data Data Mover IB WAN Ethernet Data Mover Simulation Verbs Data UDP WAN or LinuxEthDirect UDP LAN or Myricom LADS solved the impedance mismatch problem between the faster network and slower storage system ! LABORATORY FOR ADVANCED SYSTEM SOFTWARE 4 SOGANG UNIVERSITY

  5. What Memory Bottleneck Occurs in LADS? DTN IB WAN Ethernet DTN NUMA node 0 NUMA node 1 Simulation Data DRAM DRAM Cores in CPU0 Cores in CPU1 QPI Scheduler Communicator I/O thread DRAM DRAM Cores in CPU2 Cores in CPU3 RMA buffer NUMA node 2 NUMA node 3 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 5 SOGANG UNIVERSITY

  6. Architectural Overview for LADS • NUMA-based DTN Architecture in Source and Sink NUMA node 0 NUMA node 1 RMA DRAM buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 6 SOGANG UNIVERSITY

  7. Memory Bottleneck with Single RMA Buffer • NUMA-based DTN Architecture in Source and Sink NUMA node 0 NUMA node 1 RMA DRAM buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 7 SOGANG UNIVERSITY

  8. Memory Bottleneck with Single RMA Buffer • NUMA-based DTN Architecture in Source and Sink Remote Memory CPU Socket 1 accessing RMA Buffer hosted by Accesses!!! CPU Socket 0 NUMA node 0 NUMA node 1 RMA DRAM buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 8 SOGANG UNIVERSITY

  9. Multiple RMA Buffers • Distributing the RMA buffer to all CPU sockets  To reduce the remote socket’s memory access NUMA node 0 NUMA node 1 RMA RMA buffer buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 9 SOGANG UNIVERSITY

  10. Multiple RMA Buffers Possibility for accessing remote socket’s memory reduced! NUMA node 0 NUMA node 1 RMA RMA buffer buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 10 SOGANG UNIVERSITY

  11. Memory-aware Thread Scheduling (MTS) • Binding all threads to in-socket RMA buffer • Load balancing among in-socket NUMA nodes NUMA node 1 NUMA node 0 RMA RMA buffer buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 11 SOGANG UNIVERSITY

  12. Memory-aware Thread Scheduling (MTS) Local Memory Accesses & Load Balancing NUMA node 1 NUMA node 0 RMA RMA buffer buffer QPI DRAM DRAM NUMA node 3 NUMA node 2 CPU socket 0 CPU socket 1 LABORATORY FOR ADVANCED SYSTEM SOFTWARE 12 SOGANG UNIVERSITY

  13. Test-bed Configuration • Data Transfer Nodes (DTNs) • Workloads  2 CPU sockets, 4 NUMA nodes,  8x3GB files (Big file workload) 24 cores  24,000x1MB files (Small file  128GB memory workload)  InfiniBand EDR (100Gb/s) • Storage Systems  We used the memory file system (tmpfs) to eliminate storage bottlenecks. DTN DTN IB QDR (40Gb/s) sink source Memory File System Memory File System (tmpfs) (tmpfs) LABORATORY FOR ADVANCED SYSTEM SOFTWARE 13 SOGANG UNIVERSITY

  14. Evaluation Throughput (MB/s) 4000 3500 3000 2500 Single RMA Baseline 2000 Buffer 1500 1000 500 0 2 4 8 16 32 64 Number of I/O Threads LABORATORY FOR ADVANCED SYSTEM SOFTWARE 14 SOGANG UNIVERSITY

  15. Evaluation Throughput (MB/s) 4000 3500 3000 2500 Single RMA Baseline Buffer 2000 NUMA-aware MTS Scheduling 1500 1000 500 0 2 4 8 16 32 64 Number of I/O Threads LABORATORY FOR ADVANCED SYSTEM SOFTWARE 15 SOGANG UNIVERSITY

  16. Evaluation Throughput (MB/s) 4000 3500 3000 2500 Single RMA Baseline Buffer 2000 NUMA-aware MTS Scheduling 1500 1000 Throughput increased to an average of 24.3%! 500 0 2 4 8 16 32 64 Number of I/O Threads LABORATORY FOR ADVANCED SYSTEM SOFTWARE 16 SOGANG UNIVERSITY

  17. Q&A Contact: Taeuk Kim (taugi323@sogang.ac.kr) Department of Computer Science and Engineering Sogang University, Seoul, Republic of KOREA LABORATORY FOR ADVANCED SYSTEM SOFTWARE 17 SOGANG UNIVERSITY

Recommend


More recommend