When NVMe over Fabrics Meets Arm: Performance and Implications - PowerPoint PPT Presentation

When NVMe over Fabrics Meets Arm: Performance and Implications Yichen Jia * Eric Anger † Feng Chen * * Louisiana State University † Arm Inc MSST’19 May 23 th , 2019

Table of Content • Background • Experimental Setup • Experimental Results • System Implications • Conclusions 1

Background Arm, NVMe and NVMe over Fabrics

Background : Arm Processors • Arm processors have become dominant in IoT and mobile phones, etc • The recently released 64-bit ARM CPUs are suitable for cloud and data centers • Arm-based instances have been available in Amazon AWS since Nov, 2018 • One of its important applications is to be the storage server • Enhanced computing capability and power efficiency 3

Background : NVM Express • Flash-based SSD is becoming cheaper and more popular - High throughput and low latency - Suitable for parallel I/Os • Non-Volatile Memory Express (NVMe) - Supporting deep and paired queues - Scalable for the next generation NVM NVMe Structure* 4 *https://nvmexpress.org/about/nvm-express-overview/

Background: NVMe-over-Fabrics Direct Attached Storage (DAS) • Application - Computing and storage in one box File System NVMe Fabrics Block Layer - Less flexible, hard to scale, etc Target Block Layer Kernel NVMe Fabrics NVMe PCIe Initiator NVMe PCIe • Storage Disaggregation Driver Driver RDMA Driver RDMA Driver - Separated computing and storage PCI Express PCI Express - Reduced total cost of ownership (TCO) NIC + RDMA NIC + RDMA HW NVMe SSD NVMe SSD - Improved hardware utilization - Examples: NVMe over Fabrics, iSCSI Ethernet Host Side Target Side NVMe over Fabrics 5

Motivations • Continuous investment in Arm-based solutions • Increasingly popular NVMe over Fabrics • Integrating Arm with NVMeoF is highly appealing • However, the first-hand comprehensive experimental data is still lacking 9

Motivations • Continuous investment in Arm-based solutions • Increasingly popular NVMe over Fabrics • Integrating Arm with NVMeoF is highly appealing • However, the first-hand comprehensive experimental data is still lacking A thorough performance study of NVMeoF on Arm is becoming necessary. 10

Experimental Setup

Experimental Setup Target Side : Broadcom 5880X Stingray. • Server/Client Arm/x86 x86/Arm - CPU: 8-core 3GHz ARMv8 Coretx-A72 CPU Bandwidth(Gb/s) 45.42 45.40 - Memory: 48GB Latency (us) 3.26 3.17 - Storage: Intel Data Center P3600 SSD - Network: Broadcom NetXtreme NIC RoCEv2 Performance • Host Side : Lenovo ThinkCentre M910s - CPU: Intel(R) 4-core (HT) i7-6700 3.40GHz CPU - Memory: 16GB - Network: Broadcom NetXtreme NIC • The host and target machines are connected by a Leoni ParaLink@23 cable • Speed on both host and target sides is configured to be 50Gb/s • Benchmarking tool: FIO 12

Experimental Results

Experiments • Effect of Parallelism • Study of Computational Cost • Effect of IODepth • Effect of Request Sizes 14

Experiments • Effect of Parallelism • Study of Computational Cost • Effect of IODepth • Effect of Request Sizes 15

Parallelism Feature in NVMe NVMe Structure* P arallel I/Os play an important role in NVMe to fully exploit hardware potentials • • I/O parallelism will also have a great impact on NVMe-over-Fabrics *https://nvmexpress.org/about/nvm-express-overview/ 16

Finding #1: Effect of Parallelism 1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server 17 *Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs

Finding #1: Effect of Parallelism Close 1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server 18 *Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs

Finding #1: Effect of Parallelism Local NVMeoF 1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server 19 *Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs

Finding #1: Effect of Parallelism Plateau Linear 1. Latency increases as the number of jobs increases 2. NVMeoF has a close or shorter tail latency for seq read 3. BW reaches plateau when job number reaches 4 4. CPU utilization on target side is much lower 5. Arm is powerful enough to be storage server 20 *Sequential Read, 4KB, 128 IODepth, 1-16 Concurrent Jobs

Finding #2 : Computational Cost 26.9% 1. NVMeoF consumes 31.5% more CPU on host side than local NVMe 2. Kernel level overhead is dominant(26.9%) when request size is 4KB 3. Kernel level overhead are amortized as request size increases 23 *Random Write, 4 KB-128KB, 8 Concurrent jobs, 128 IODepth

Finding #2 : Computational Cost 26.9% 31.5% 1. NVMeoF consumes 31.5% more CPU on host side than local NVMe 2. Kernel level overhead is dominant(26.9%) when request size is 4KB 3. Kernel level overhead are amortized as request size increases 24 *Random Write, 4 KB-128KB, 8 Concurrent jobs, 128 IODepth

IODepth is important for NVMeoF NVMe and RDMA Queues 25

Finding #3: Effect of IODepth When IODepth is small, local access has a short tail latency than remote access • When IODepth is large, remote access has a short tail latency than local access • 26 * Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core

Finding #3: Effect of IODepth Local NVMeoF When IODepth is small, local access has a short tail latency than remote access • When IODepth is large, remote access has a short tail latency than local access • 27 * Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core

Finding #3: Effect of IODepth NVMeoF Local When IODepth is small, local access has a short tail latency than remote access • When IODepth is large, remote access has a short tail latency than local access • 28 * Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core

Finding #3: Effect of IODepth Local NVMeoF NVMeoF Local When IODepth is small, local access has a short tail latency than remote access • When IODepth is large, remote access has a short tail latency than local access • 29 * Sequential Read, 4KB, 8 Concurrent Jobs, 1-128 IODepth, one Arm core

When NVMe over Fabrics Meets Arm: Performance and Implications - PowerPoint PPT Presentation

When NVMe over Fabrics Meets Arm: Performance and Implications Yichen Jia * Eric Anger Feng Chen * * Louisiana State University Arm Inc MSST19 May 23 th , 2019 Table of Content Background Experimental Setup Experimental

FINISHES OVERVIEW 2019 FABRIC UPDATES OPS is proud to now offer 360+ fabrics across Grade 1 and

NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

ARM Software Suite Powered by GDM Why use ARM Software? ARM is the software solution to plan,

ARM Advanced RISC Machines The ARM Instruction Set The ARM Instruction Set - ARM University

ARM Cortex-M4 Programming Model ARM = Advanced RISC Machines, Ltd. ARM licenses IP to other

Tissue Properties and Manufacturing Forming and TAD Fabrics Peter McCabe Tissue Business Leader

Regular Fabrics for Retiming & Regular Fabrics for Retiming & Pipelining over Global

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

SuperNova burst buffer ( NVMe from Zynq ) Roy Wastie University of Oxford 1 17/10/19 DUNE-UK

Exposition of Fabrics and Accessories for Garment Production September 5 6, 2018 Moscow City

Verifying the Motion of a Robot Arm Akul Penugonda 1 /6 Akul Penugonda - Robot Arm Motion 2

ARM v4T CS2253 Owen Kaser, UNBSJ ARM v4T History of ARM processors R is for RISC

NEW TEXTILES Romo fabrics & Lola velour LINARA Romo fabrics / 8 colours selected by HAY

Decorative synthetic upholstery fabrics for over 50 years. Morbern... A long past, a bright

ARM A55 Cortex Austin Bae, Harrison Ding 12/5/2018 Introduction Implements the ARM v8.2-A

Preliminary Match-up of AIRS to ARM CART Soundings and AVN Grids Eric Fetzer AIRS Science Team

Load-reserve / Store-conditional on POWER and ARM Peter Sewell (slides from Susmit Sarkar) 1

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of

LEADING COLLABORAT ION IN THE ARM ECOSYSTEM Linaro workshop Open Source HPC Collaboration on

Bayesian generalized linear models and an appropriate default prior Andrew Gelman, Aleks Jakulin,

Performance Investigations Hannes Tschofenig, Manuel Pgouri-Gonnard 25 th March 2015 1

Classical Planning George Konidaris gdk@cs.brown.edu Fall 2019 The Planning Problem Finding a

The Tor Censorship Arms Race: The Next Chapter 1 O n l i n e A n o n y mi t y

Sambuz

Useful Links

Newsletter

Mail Us

When NVMe over Fabrics Meets Arm: Performance and Implications - PowerPoint PPT Presentation

When NVMe over Fabrics Meets Arm: Performance and Implications Yichen Jia * Eric Anger Feng Chen * * Louisiana State University Arm Inc MSST19 May 23 th , 2019 Table of Content Background Experimental Setup Experimental

FINISHES OVERVIEW 2019 FABRIC UPDATES OPS is proud to now offer 360+ fabrics across Grade 1 and

NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

ARM Software Suite Powered by GDM Why use ARM Software? ARM is the software solution to plan,

ARM Advanced RISC Machines The ARM Instruction Set The ARM Instruction Set - ARM University

ARM Cortex-M4 Programming Model ARM = Advanced RISC Machines, Ltd. ARM licenses IP to other

Tissue Properties and Manufacturing Forming and TAD Fabrics Peter McCabe Tissue Business Leader

Regular Fabrics for Retiming &amp; Regular Fabrics for Retiming &amp; Pipelining over Global

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

SuperNova burst buffer ( NVMe from Zynq ) Roy Wastie University of Oxford 1 17/10/19 DUNE-UK

Exposition of Fabrics and Accessories for Garment Production September 5 6, 2018 Moscow City

Verifying the Motion of a Robot Arm Akul Penugonda 1 /6 Akul Penugonda - Robot Arm Motion 2

ARM v4T CS2253 Owen Kaser, UNBSJ ARM v4T History of ARM processors R is for RISC

NEW TEXTILES Romo fabrics &amp; Lola velour LINARA Romo fabrics / 8 colours selected by HAY

Decorative synthetic upholstery fabrics for over 50 years. Morbern... A long past, a bright

ARM A55 Cortex Austin Bae, Harrison Ding 12/5/2018 Introduction Implements the ARM v8.2-A

Preliminary Match-up of AIRS to ARM CART Soundings and AVN Grids Eric Fetzer AIRS Science Team

Load-reserve / Store-conditional on POWER and ARM Peter Sewell (slides from Susmit Sarkar) 1

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of

LEADING COLLABORAT ION IN THE ARM ECOSYSTEM Linaro workshop Open Source HPC Collaboration on

Bayesian generalized linear models and an appropriate default prior Andrew Gelman, Aleks Jakulin,

Performance Investigations Hannes Tschofenig, Manuel Pgouri-Gonnard 25 th March 2015 1

Classical Planning George Konidaris gdk@cs.brown.edu Fall 2019 The Planning Problem Finding a

The Tor Censorship Arms Race: The Next Chapter 1 O n l i n e A n o n y mi t y

Sambuz

Useful Links

Newsletter

Mail Us

Regular Fabrics for Retiming & Regular Fabrics for Retiming & Pipelining over Global

NEW TEXTILES Romo fabrics & Lola velour LINARA Romo fabrics / 8 colours selected by HAY