numa implication for storage i o throughput in modern
play

NUMA Implication for Storage I/O Throughput in Modern Servers - PowerPoint PPT Presentation

NUMA Implication for Storage I/O Throughput in Modern Servers Shoaib Akram, Manolis Marazakis, and Angelos Bilas Computer Architecture and VLSI Laboratory FORTH-ICS Greece Outline Introduction Motivation Previous Work


  1. NUMA Implication for Storage I/O Throughput in Modern Servers Shoaib Akram, Manolis Marazakis, and Angelos Bilas Computer Architecture and VLSI Laboratory FORTH-ICS Greece

  2. Outline — Introduction — Motivation — Previous Work — The “Affinity” Space Exploration — Evaluation Methodology — Results — Conclusions

  3. Introduction — The number of processors on a single motherboard is increasing. — Each processor has faster access to local memory compared to remote memory leading to the well-known NUMA problem. — Much effort in the past is devoted to NUMA-aware memory management and scheduling. — There is a similar “non-uniform latency of access” relation between processors and I/O devices. — In this work, we quantify the combined impact of non-uniform latency in accessing memory and I/O devices.

  4. Motivation-Trends in I/O Subsystems — Storage-related features: — Fast point-to-point interconnects. — Multiple storage controllers. — Arrays of storage devices with high bandwwidth. — Result: — Storage bandwidth is catching up with memory bandwidth. — The anatomy of a modern server — Today, NUMA affinity is a machine with 2 NUMA domains. problem for the entire system.

  5. Motivation-Trends in Processor Technology and Applications Backend ¡ Data ¡Stores ¡ OLTP ¡ 250 ¡ 200 ¡ Growth ¡of ¡Cores ¡ 150 ¡ GB/s ¡ 100 ¡ 50 ¡ 0 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ 1024 ¡ 2048 ¡ 4096 ¡ Number ¡of ¡Cores ¡ — Throughput of typical workloads in today’s data- centres measured for 16 cores and projected to many cores.

  6. Motivation-Trends in Processor Technology and Applications Backend ¡ Data ¡Stores ¡ OLTP ¡ 250 ¡ — Today : 200 ¡ Growth ¡of ¡Cores ¡ — Low device throughput. 150 ¡ GB/s ¡ — High per I/O cycle overhead. 100 ¡ — Low server utilization in data- 50 ¡ centres. 0 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ 1024 ¡ 2048 ¡ 4096 ¡ Number ¡of ¡Cores ¡ — Throughput of typical workloads in today’s data- centres measured for 16 cores and projected to many cores.

  7. Motivation-Trends in Processor Technology and Applications Backend ¡ Data ¡Stores ¡ OLTP ¡ 250 ¡ — Today : 200 ¡ Growth ¡of ¡Cores ¡ — Low device throughput. 150 ¡ GB/s ¡ — High per I/O cycle overhead. 100 ¡ — Low server utilization in data- 50 ¡ centres. 0 ¡ 16 ¡ 32 ¡ 64 ¡ 128 ¡ 256 ¡ 512 ¡ 1024 ¡ 2048 ¡ 4096 ¡ Number ¡of ¡Cores ¡ — In Future : — Throughput of typical workloads in today’s data- — Many GB/s of device throughput. centres measured for 16 cores and projected to many cores. — Low per I/O cycle overhead as system stacks improve. — High server utilization in data- centres through server consolidation etc.

  8. Previous Work — Multiprocessors — Multicore Processors

  9. The “Affinity” Space Exploration — The movement of data on any I/O access:

  10. The “Affinity” Space Exploration — The movement of data on any I/O access: — to/from the application Memory buffer from/to the kernel Copy buffer.

  11. The “Affinity” Space Exploration — The movement of data on any I/O access: — to/from the application Memory buffer from/to the kernel Copy buffer. — to/from the kernel buffer from/to the I/O device. DMA Transfer

  12. The “Affinity” Space Exploration — The movement of data on any I/O access: — to/from the application buffer from/to the kernel Memory buffer. Copy — to/from the kernel buffer from/to the I/O device. — Application and kernel buffers could be located in DMA Transfer different NUMA domains. — Kernel buffers and I/O devices could be located in different NUMA domains.

  13. The “Affinity” Space Exploration Transfer (TR) Copy (CP) Configuration — The axis of the space are: — Transfer between I/O devices Local (L) Local (L) TRLCPL and kernel buffers. Local (L) Remote (R) TRLCPR — Copy from kernel buffer to the application buffer. Remote (R) Local (L) TRRCPL Remote (R) Remote (R) TRRCPR TRLCPR TRRCPR TRLCPL Example Scenarios

  14. NUMA Memory Allocation — How application buffers are allocated? — The domain in which the process is executing. — The numactl utility allows pinning application processes and the memory buffers on a particular domain. — How kernel buffers are allocated? — Kernel buffer allocation can not be controlled using namactl . — I/O buffers in the kernel are shared by all contexts performing I/O in the kernel. — How to “maximize possibility of keeping kernel buffers in a particular socket?” — Start each experiment with a clean buffer cache. — Large DRAM/Socket compared to application datasets.

  15. Evaluation Methodology — Bandwidth characterization of evaluation testbed. Total Memory Bandwidth=24 GB/s 12 GB/s 12 GB/s 24 GT/s 2 Storage Controllers=3 GB/s 2 Storage Controllers=3 GB/s SSD Read Throughput=250 MB/s SSD Read Throughput=250 MB/s 12 SSDS=3 GB/s 12 SSDS=3 GB/s Total Storage Throughput=6 GB/s

  16. Benchmarks, Applications and Datasets — Benchmarks and Applications — zmIO : in-house microbenchmark. — fsmark : a filesystem stress benchmarks. — stream : a streaming workload. — psearchy : a file indexing application (part of MOSBENCH). — IOR : application checkpointing. — Workload Configuration and Datasets — Four software RAID Level 0 devices, each on top of 6 SSDS and 1 storage controller. — One workload instance per RAID device. — Datasets consist of large files , parameters that result in high concurrency , high read/write throughput and stressing of system stack .

  17. Evaluation Metrics — Problem with high-level application metrics: — Not possible to map actual volume of data transmitted. — Not possible to look at indiviual components of complex software stacks. — It is important to look at indiviual components of execution time(user, system, idle, and iowait). — Cycles per I/O — Physical cycles consumed by the application during the execution time divided by the number of I/O operations. — Can be used as an efficiency metric. — Can be converted to energy per I/O.

  18. Results – mem_stress and zmIO — Remote memory accesses: mem_stress 16 Local — Memory throughput drops by 14 Remote 12 one half. GB/s 10 8 — The degradation starts from 6 4 one instance of mem_stress. 2 Number of Instances 0 — Remote transfers: 1 2 3 4 5 6 — Device throughput drops by one half. zmIO 7 TRLCPL — The throughput is same for one 6 TRRCPR 5 instance. GB/s 4 3 — Contention is a possible culprit 2 for two and more instances. 1 Number of Instances 0 1 2 3 4 5 6 7 8 Round-robin assignments of instances to RAID devices.

  19. Results – fsmark and psearchy — fsmark is filesystem fsmark 9000 iowait 8000 intensive: Cycles per I/O Sector system 7000 6000 — Remote transfers result in 5000 4000 40% higher system time. 3000 2000 1000 — 130% increase in iowait 0 TRLCPL TRRCPR TRRCPL TRLCPR time. Configuration — psearchy is both filesystem psearchy 7000 iowait and I/O intensive: system Cycles per I/O Sector 6000 5000 — 57% increase in system 4000 time. 3000 2000 — 70% increase in iowait 1000 0 time. TRLCPL TRRCPR TRRCPL TRLCPR Configuration

  20. Results - IOR — IOR is both read and write 8000 Read intensive benchmark. IOR 7000 Write 6000 — 15% decrease in read 5000 MB/s 4000 throughput due to remote 3000 2000 transfers and memory 1000 0 copies. TRLCPL TRRCPR TRRCPL TRLCPR Configuration — 20% decrease in write throughput due to remote transfers and copies.

  21. Results - stream — 24 SSDs are divided into two domains. 250 — Each set of 12 SSDs are stream Set A 200 Set B connected to two MB/s 150 controllers. 100 50 — Ideally, symmetric 0 TRLCPL TRRCPR TRRCPL TRLCPR throughput is expected. Configuration — Remote transfers result in a 27% drop in throughput of one of the sets.

  22. Conclusions — A mix of synthetic benchmarks and applications show the potential of NUMA affinity to hurt I/O throughput. — Future systems will have increased heterogeneity, more domains and high bandwidths. — Today, NUMA affinity is not a problem for cores within a single processor socket. Future processors with 100s of cores will have domains within a processor chip. — The range of performance degradation is important. Different server configurations and runtime libraries result in throughput within a range. — Partitioning of the system stacks based on sockets will become necessary.

Recommend


More recommend