NUMA Implication for Storage I/O Throughput in Modern Servers Shoaib Akram, Manolis Marazakis, and Angelos Bilas Computer Architecture and VLSI Laboratory FORTH-ICS Greece
Outline Introduction Motivation Previous Work The “Affinity” Space Exploration Evaluation Methodology Results Conclusions
Introduction The number of processors on a single motherboard is increasing. Each processor has faster access to local memory compared to remote memory leading to the well-known NUMA problem. Much effort in the past is devoted to NUMA-aware memory management and scheduling. There is a similar “non-uniform latency of access” relation between processors and I/O devices. In this work, we quantify the combined impact of non-uniform latency in accessing memory and I/O devices.
Motivation-Trends in I/O Subsystems Storage-related features: Fast point-to-point interconnects. Multiple storage controllers. Arrays of storage devices with high bandwwidth. Result: Storage bandwidth is catching up with memory bandwidth. The anatomy of a modern server Today, NUMA affinity is a machine with 2 NUMA domains. problem for the entire system.
Motivation-Trends in Processor Technology and Applications Backend ¡ Data ¡Stores ¡ OLTP ¡ 250 ¡ 200 ¡ Growth ¡of ¡Cores ¡ 150 ¡ GB/s ¡ 100 ¡ 50 ¡ 0 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ 1024 ¡ 2048 ¡ 4096 ¡ Number ¡of ¡Cores ¡ Throughput of typical workloads in today’s data- centres measured for 16 cores and projected to many cores.
Motivation-Trends in Processor Technology and Applications Backend ¡ Data ¡Stores ¡ OLTP ¡ 250 ¡ Today : 200 ¡ Growth ¡of ¡Cores ¡ Low device throughput. 150 ¡ GB/s ¡ High per I/O cycle overhead. 100 ¡ Low server utilization in data- 50 ¡ centres. 0 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ 1024 ¡ 2048 ¡ 4096 ¡ Number ¡of ¡Cores ¡ Throughput of typical workloads in today’s data- centres measured for 16 cores and projected to many cores.
Motivation-Trends in Processor Technology and Applications Backend ¡ Data ¡Stores ¡ OLTP ¡ 250 ¡ Today : 200 ¡ Growth ¡of ¡Cores ¡ Low device throughput. 150 ¡ GB/s ¡ High per I/O cycle overhead. 100 ¡ Low server utilization in data- 50 ¡ centres. 0 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ 1024 ¡ 2048 ¡ 4096 ¡ Number ¡of ¡Cores ¡ In Future : Throughput of typical workloads in today’s data- Many GB/s of device throughput. centres measured for 16 cores and projected to many cores. Low per I/O cycle overhead as system stacks improve. High server utilization in data- centres through server consolidation etc.
Previous Work Multiprocessors Multicore Processors
The “Affinity” Space Exploration The movement of data on any I/O access:
The “Affinity” Space Exploration The movement of data on any I/O access: to/from the application Memory buffer from/to the kernel Copy buffer.
The “Affinity” Space Exploration The movement of data on any I/O access: to/from the application Memory buffer from/to the kernel Copy buffer. to/from the kernel buffer from/to the I/O device. DMA Transfer
The “Affinity” Space Exploration The movement of data on any I/O access: to/from the application buffer from/to the kernel Memory buffer. Copy to/from the kernel buffer from/to the I/O device. Application and kernel buffers could be located in DMA Transfer different NUMA domains. Kernel buffers and I/O devices could be located in different NUMA domains.
The “Affinity” Space Exploration Transfer (TR) Copy (CP) Configuration The axis of the space are: Transfer between I/O devices Local (L) Local (L) TRLCPL and kernel buffers. Local (L) Remote (R) TRLCPR Copy from kernel buffer to the application buffer. Remote (R) Local (L) TRRCPL Remote (R) Remote (R) TRRCPR TRLCPR TRRCPR TRLCPL Example Scenarios
NUMA Memory Allocation How application buffers are allocated? The domain in which the process is executing. The numactl utility allows pinning application processes and the memory buffers on a particular domain. How kernel buffers are allocated? Kernel buffer allocation can not be controlled using namactl . I/O buffers in the kernel are shared by all contexts performing I/O in the kernel. How to “maximize possibility of keeping kernel buffers in a particular socket?” Start each experiment with a clean buffer cache. Large DRAM/Socket compared to application datasets.
Evaluation Methodology Bandwidth characterization of evaluation testbed. Total Memory Bandwidth=24 GB/s 12 GB/s 12 GB/s 24 GT/s 2 Storage Controllers=3 GB/s 2 Storage Controllers=3 GB/s SSD Read Throughput=250 MB/s SSD Read Throughput=250 MB/s 12 SSDS=3 GB/s 12 SSDS=3 GB/s Total Storage Throughput=6 GB/s
Benchmarks, Applications and Datasets Benchmarks and Applications zmIO : in-house microbenchmark. fsmark : a filesystem stress benchmarks. stream : a streaming workload. psearchy : a file indexing application (part of MOSBENCH). IOR : application checkpointing. Workload Configuration and Datasets Four software RAID Level 0 devices, each on top of 6 SSDS and 1 storage controller. One workload instance per RAID device. Datasets consist of large files , parameters that result in high concurrency , high read/write throughput and stressing of system stack .
Evaluation Metrics Problem with high-level application metrics: Not possible to map actual volume of data transmitted. Not possible to look at indiviual components of complex software stacks. It is important to look at indiviual components of execution time(user, system, idle, and iowait). Cycles per I/O Physical cycles consumed by the application during the execution time divided by the number of I/O operations. Can be used as an efficiency metric. Can be converted to energy per I/O.
Results – mem_stress and zmIO Remote memory accesses: mem_stress 16 Local Memory throughput drops by 14 Remote 12 one half. GB/s 10 8 The degradation starts from 6 4 one instance of mem_stress. 2 Number of Instances 0 Remote transfers: 1 2 3 4 5 6 Device throughput drops by one half. zmIO 7 TRLCPL The throughput is same for one 6 TRRCPR 5 instance. GB/s 4 3 Contention is a possible culprit 2 for two and more instances. 1 Number of Instances 0 1 2 3 4 5 6 7 8 Round-robin assignments of instances to RAID devices.
Results – fsmark and psearchy fsmark is filesystem fsmark 9000 iowait 8000 intensive: Cycles per I/O Sector system 7000 6000 Remote transfers result in 5000 4000 40% higher system time. 3000 2000 1000 130% increase in iowait 0 TRLCPL TRRCPR TRRCPL TRLCPR time. Configuration psearchy is both filesystem psearchy 7000 iowait and I/O intensive: system Cycles per I/O Sector 6000 5000 57% increase in system 4000 time. 3000 2000 70% increase in iowait 1000 0 time. TRLCPL TRRCPR TRRCPL TRLCPR Configuration
Results - IOR IOR is both read and write 8000 Read intensive benchmark. IOR 7000 Write 6000 15% decrease in read 5000 MB/s 4000 throughput due to remote 3000 2000 transfers and memory 1000 0 copies. TRLCPL TRRCPR TRRCPL TRLCPR Configuration 20% decrease in write throughput due to remote transfers and copies.
Results - stream 24 SSDs are divided into two domains. 250 Each set of 12 SSDs are stream Set A 200 Set B connected to two MB/s 150 controllers. 100 50 Ideally, symmetric 0 TRLCPL TRRCPR TRRCPL TRLCPR throughput is expected. Configuration Remote transfers result in a 27% drop in throughput of one of the sets.
Conclusions A mix of synthetic benchmarks and applications show the potential of NUMA affinity to hurt I/O throughput. Future systems will have increased heterogeneity, more domains and high bandwidths. Today, NUMA affinity is not a problem for cores within a single processor socket. Future processors with 100s of cores will have domains within a processor chip. The range of performance degradation is important. Different server configurations and runtime libraries result in throughput within a range. Partitioning of the system stacks based on sockets will become necessary.
Recommend
More recommend