reflex remote flash local flash
play

ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos - PowerPoint PPT Presentation

ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos Kozyrakis NVMW18 Memorable Paper Award Finalist 1 Flash in Datacenters Flash provides 1000 higher throughput and 100 lower latency than disk PCIe Flash:


  1. ReFlex: Remote Flash ≈ Local Flash Ana Klimovic Heiner Litz Christos Kozyrakis NVMW’18 Memorable Paper Award Finalist 1

  2. Flash in Datacenters • Flash provides 1000× higher throughput and 100× lower latency than disk PCIe Flash: – 1,000,000 IOPS – 70 µs read latency • Flash is often underutilized due to imbalanced resource requirements Solution: share SSD between remote tenants 2

  3. Existing Approaches • Remote access to disk (e.g. iSCSI) • Remote access to DRAM or NVMe over RDMA • There are 2 main issues: 1. Performance overhead 2. Interference on shared remote flash device 3

  4. Issue 1: Performance Overhead 4kB random read 1000 p95 read latency (us) 800 4x throughput drop 600 400 Local Flash 200 iSCSI (1 core) 2× latency increase libaio+libevent (1core) 0 0 50 100 150 200 250 300 IOPS (Thousands) • Traditional network storage protocols and Linux I/O libraries (e.g. libaio, libevent) have high overhead 4

  5. Issue 2: Performance Interference 2000 Latency depends 1800 on IOPS load 1600 1400 p95 read latency (us) Writes impact 1200 read tail latency 1000 100%read 800 99%read 600 95%read 90%read 400 75%read 200 50%read 0 0 250 500 750 1000 1250 Total IOPS (Thousands) To share Flash, we need to enforce performance isolation 5

  6. How does ReFlex achieve high performance? Linux vs. ReFlex User Space User Space Remote Storage Application Remote Storage Application ReFlex Filesystem Linux Control Block I/O Data Plane Plane Device Driver Hardware Hardware Network Flash Network Flash Interface Storage Interface Storage 6

  7. How does ReFlex achieve high performance? Linux vs. ReFlex User Space User Space Remove SW bloat Remote Storage Application Remote Storage Application by separating control & data plane ReFlex Filesystem Linux Control Block I/O Data Plane Plane Device Driver Hardware Hardware Network Flash Network Flash Interface Storage Interface Storage 7

  8. How does ReFlex achieve high performance? Linux vs. ReFlex User Space User Space Remote Storage Application Remote Storage Application Direct access ReFlex Filesystem Linux to hardware Control Block I/O Data Plane Plane Device Driver 1 data plane DPDK SPDK per CPU core Hardware Hardware Network Flash Network Flash Interface Storage Interface Storage 8

  9. How does ReFlex achieve high performance? Linux vs. ReFlex User Space User Space Remote Storage Application Remote Storage Application Polling vs. ReFlex Filesystem Linux interrupts Control Block I/O Data Plane Plane Device Driver IRQ Hardware Hardware Network Flash Network Flash Interface Storage Interface Storage 9

  10. How does ReFlex achieve high performance? Linux vs. ReFlex User Space User Space Remote Storage Application Remote Storage Application Run to Polling vs. ReFlex Filesystem Linux completion interrupts Control Block I/O Data Plane Plane Device Driver IRQ Hardware Hardware Network Flash Network Flash Interface Storage Interface Storage 10

  11. How does ReFlex achieve high performance? Linux vs. ReFlex User Space User Space Remote Storage Application Remote Storage Application Adaptive Polling vs. ReFlex Filesystem Linux batching interrupts Control Block I/O Data Plane Plane Device Driver IRQ Hardware Hardware Network Flash Network Flash Interface Storage Interface Storage 11

  12. How does ReFlex achieve high performance? Linux vs. ReFlex User Space User Space Remote Storage Application Remote Storage Application 3. Zero-copy ReFlex Filesystem 2. Linux device-to-device Control Block I/O 4. Data Plane Plane Device Driver 1. Hardware Hardware Network Flash Network Flash Interface Storage Interface Storage 12

  13. How does ReFlex enable performance isolation? • Request cost based scheduling • Determine the impact of tenant A on the tail latency and IOPS of tenant B • Control plane assigns tenants with a quota • Data plane enforces quotas through throttling 13

  14. Request Cost Modeling Compensate for read-write asymmetry For this device: Write == 10x Read 2000 2000 1800 1800 1600 1600 p95 Read Latency (us) 1400 1400 p95 read latency (us) 1200 1200 1000 1000 100%read 800 100%read 800 99%read 99%read 600 600 95%read 95%read 400 90%read 400 90%read 75%read 75%read 200 200 50%read 50%read 0 0 0 200 400 600 800 1000 0 250 500 750 1000 1250 Weighted IOPS (x 10 3 tokens/s ) Total IOPS (Thousands) 14

  15. Request Cost Based Scheduling 15

  16. Request Cost Based Scheduling 1ms tail latency SLO 16

  17. Request Cost Based Scheduling 1ms tail latency SLO Device max IOPS: 510K 17

  18. Request Cost Based Scheduling 1ms tail latency SLO 200K Device max IOPS: IOPS 510K SLO 18

  19. Request Cost Based Scheduling 1ms tail latency SLO 200K 310K Device max IOPS: IOPS Slack 510K SLO 19

  20. Results: Local ≈ Remote Latency 1000 900 Linux: ReFlex: 75K IOPS/core 850K IOPS/core 800 p95 Read Latency (us) 700 600 500 Local-1T 400 300 ReFlex-1T 200 100 Linux-1T Libaio-1T 0 0 250 500 750 1000 1250 IOPS (Thousands) 20

  21. Results: Local ≈ Remote Latency 1000 900 800 p95 Read Latency (us) 700 600 Latency Local Flash 78 µs 500 Local-1T ReFlex 99 µs 400 Linux 200 µs 300 ReFlex-1T 200 100 Linux-1T Libaio-1T 0 0 250 500 750 1000 1250 IOPS (Thousands) 21

  22. Results: Local ≈ Remote Latency 1000 900 ReFlex: saturates Flash 800 p95 Read Latency (us) 700 600 Local-1T 500 Local-2T 400 ReFlex-1T 300 ReFlex-2T 200 Linux-1T Libaio-1T 100 Linux-2T Libaio-2T 0 0 250 500 750 1000 1250 IOPS (Thousands) 22

  23. Results: Performance Isolation 4000 140 I/O sched disabled I/O sched disabled Tenant A IOPS SLO 3500 120 I/O sched enabled I/O sched enabled Read p95 latency (us) 3000 IOPS (Thousands) 100 2500 80 Tenant B IOPS SLO 2000 60 1500 40 1000 Latency SLO 20 500 0 0 Tenant A Tenant B Tenant C Tenant D Tenant A Tenant B Tenant C Tenant D 100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd • Tenants A & B: latency-critical; Tenant C + D: best effort 23

  24. Results: Performance Isolation 4000 140 I/O sched disabled I/O sched disabled Tenant A IOPS SLO 3500 120 I/O sched enabled I/O sched enabled Read p95 latency (us) 3000 IOPS (Thousands) 100 2500 80 Tenant B IOPS SLO 2000 60 1500 40 1000 Latency SLO 20 500 0 0 Tenant A Tenant B Tenant C Tenant D Tenant A Tenant B Tenant C Tenant D 100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd • Tenants A & B: latency-critical; Tenant C + D: best effort • Without scheduler: latency and bandwidth QoS for A/B are violated 24

  25. Results: Performance Isolation 4000 140 I/O sched disabled I/O sched disabled Tenant A IOPS SLO 3500 120 I/O sched enabled I/O sched enabled Read p95 latency (us) 3000 IOPS (Thousands) 100 2500 80 Tenant B IOPS SLO 2000 60 1500 40 1000 Latency SLO 20 500 0 0 Tenant A Tenant B Tenant C Tenant D Tenant A Tenant B Tenant C Tenant D 100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd • Tenants A & B: latency-critical; Tenant C + D: best effort • Without scheduler: latency and bandwidth QoS for A/B are violated • Scheduler rate limits best-effort tenants to enforce SLOs 25

  26. ReFlex Summary 1. Enables Flash disaggregation à improve utilization – Performance: remote ≈ local – Commodity networking, low CPU overhead 2. Guarantees QoS in shared resource deployments Quality of Service aware request scheduling – 26

  27. Impact of ReFlex • Open source: https://github.com/stanford-mast/reflex Works on AWS i3 cloud instances with NVMe Flash • • Integrated as a remote Flash dataplane in the Apache Crail distributed storage system (collaboration with IBM Research) • Broadcom is porting ReFlex to ARM-based SoC 27

  28. Thank You! Download the source code at: https://github.com/stanford-mast/reflex Original paper presented at ASPLOS’17. 28

Recommend


More recommend