Flash Storage Disaggregation Ana Klimovic 1 , Christos Kozyrakis 1,4 , Eno Thereska 3,5 , Binu John 2 and Sanjeev Kumar 2 2 3 1 5 4
Flash is underutilized • Flash provides higher throughput and lower latency than disk PCIe Flash: – 100,000s of IOPS – 10s of µs latency • Flash is underutilized in datacenters due to imbalanced resource requirements 2
Datacenter Flash Use-Case Applica(on Tier Datastore Tier Datastore Service Key-Value Store get(k) So9ware put(k,val) App App Tier App Tier TCP/IP Servers Clients CPU RAM NIC Hardware get (k) Flash 3
Imbalanced Resource Utilization • Sample utilization of Facebook servers hosting a Flash- based key-value store over 6 months 4
Imbalanced Resource Utilization • Sample utilization of Facebook servers hosting a Flash- based key-value store over 6 months 5
Imbalanced Resource Utilization • Sample utilization of Facebook servers hosting a Flash- based key-value store over 6 months utilization 6
Imbalanced Resource Utilization • Flash capacity and IOPS are underutilized for long periods of time utilization 7
Imbalanced Resource Utilization • CPU and Flash utilization vary with separate trends utilization 8
Local Flash Architecture Applica(on Tier Datastore Tier Datastore Service Key-Value Store get(k) So9ware put(k,val) App App Tier App Tier TCP/IP Servers Clients CPU RAM NIC Hardware Flash Provision Flash and CPU in a dependent manner. 9
Disaggregated Flash Architecture Applica(on Tier Datastore Tier Datastore Service Key-Value Store get(k) So5ware put(k,val) App App Tier App Tier TCP/IP Servers Clients CPU RAM NIC Hardware read(blk); write(blk,data) Protocol iSCSI Flash Tier Remote Block Service So5ware NIC Flash Hardware CPU RAM 10
Contributions For real applications at Facebook, we analyze: 1. What is the performance overhead of remote Flash using existing protocols? 2. What optimizations improve performance? 3. When does disaggregating Flash lead to resource efficiency benefits? 11
Flash Workloads at Facebook • Analyze IO patterns of real Flash-based Facebook applications • Applications use RocksDB, a key-value store with a log structured merge tree architecture IOPS/TB IO size Lots of random reads Read 2K – 10K 10KB – 50KB Large, bursty writes Write 100 – 1K 500KB – 2MB 12
Workload Analysis Application Tier Datastore Tier SSDB server wrapper Software RocksDB App mutilate Tier TCP/IP load CPU RAM Hardware NIC generator Protocol Flash Tier Software Remote Block Service NIC Flash Hardware 13
Workload Analysis Application Tier Datastore Tier SSDB server wrapper Software RocksDB App mutilate Tier TCP/IP load CPU RAM Hardware NIC generator iSCSI Flash Tier iSCSI is a standard network storage protocol that Software Remote Block Service transports block storage commands over TCP/IP NIC Flash Hardware 14
Workload Analysis Application Tier Datastore Tier SSDB server wrapper Software RocksDB App mutilate Tier TCP/IP load CPU RAM Hardware NIC generator iSCSI Flash Tier √ Transparent to application Software Remote Block Service √ Runs on commodity network √ Scales datacenter-wide NIC Flash Hardware 15
Workload Analysis Application Tier Datastore Tier Measure round-trip SSDB latency server wrapper Software RocksDB App App mutilate Tier Tier TCP/IP load Clients 10 Hardware 6 cores 4GB generator Gb/E iSCSI Flash Tier Software Remote Block Service Intel P3600 10Gb/E Hardware PCIe Flash 16
Unloaded Latency • Remote access with iSCSI adds 260µs to p95 latency, tolerable for our target application (latency SLO ~5ms) 260µs 17
Application Throughput • 45% throughput drop with “out of the box” iSCSI Flash • Need to optimize remote Flash server for higher throughput 2 1.8 1.6 1.4 Client Latency (ms) 45% drop 1.2 1 0.8 0.6 0.4 Local Flash 0.2 iSCSI baseline (8 processes) 0 0 10 20 30 40 50 60 70 80 QPS (thousands) 18
Multi-process iSCSI • Vary number of iSCSI processes that issue IO • Want enough parallelism, avoid scheduling interference 2 1.8 1.6 1.4 Client Latency (ms) 12% 1.2 1 0.8 0.6 Local Flash 0.4 6 iSCSI processes (optimal) 0.2 8 iSCSI processes (default) 1 iSCSI process 0 0 10 20 30 40 50 60 70 80 QPS (thousands) 19
NIC offloads • Enable NIC offloads for TCP segmentation (TSO/LRO) to reduce CPU load on Flash server and datastore server 2 1.8 1.6 1.4 Client Latency (ms) 8% 1.2 1 0.8 0.6 Local Flash 0.4 NIC offload 0.2 iSCSI with 6 processes iSCSI baseline (8 processes) 0 0 10 20 30 40 50 60 70 80 QPS (thousands) 20
Jumbo Frames • Jumbo frames further reduce overhead by reducing segmentation altogether (max MTU 9kB) 2 1.8 1.6 1.4 Client Latency (ms) 10% 1.2 1 0.8 0.6 Local Flash Jumbo frame 0.4 NIC offload 0.2 iSCSI with 6 processes iSCSI baseline (8 processes) 0 0 10 20 30 40 50 60 70 80 QPS (thousands) 21
Interrupt Affinity Tuning • Steer NIC interrupts to core handling TCP connection and Flash interrupts to cores issuing IO commands 2 1.8 1.6 1.4 Client Latency (ms) 4% 1.2 1 0.8 Local Flash 0.6 Interrupt affinity 0.4 Jumbo frame NIC offload 0.2 iSCSI with 6 processes iSCSI baseline (8 processes) 0 0 10 20 30 40 50 60 70 80 QPS (thousands) 22
Optimized Application Throughput • Steer NIC interrupts to core handling TCP connection and Flash interrupts to cores issuing IO commands 2 1.8 1.6 1.4 Client Latency (ms) 42% 1.2 1 0.8 Local Flash 0.6 Interrupt affinity 0.4 Jumbo frame NIC offload 0.2 iSCSI with 6 processes iSCSI baseline (8 processes) 0 0 10 20 30 40 50 60 70 80 QPS (thousands) 23
Application Throughput • 20% drop in application throughput, on average 2 1.8 1.6 1.4 20% drop Client Latency (ms) 1.2 1 0.8 local_avg 0.6 remote_avg 0.4 local_p95 0.2 remote_p95 0 0 10 20 30 40 50 60 70 80 QPS (thousands) 24
Application Throughput • At the tail, overhead of remote access is masked by other factors like write interference on Flash 2 1.8 1.6 1.4 20% drop Client Latency (ms) 1.2 10% drop 1 0.8 local_avg 0.6 remote_avg 0.4 local_p95 0.2 remote_p95 0 0 10 20 30 40 50 60 70 80 QPS (thousands) 25
Sharing Remote Flash • Sharing Flash among 2 or more tenants leads to more write interference à degrades tail performance 4.0 local_avg remote_avg 3.5 local_p95 3.0 remote_p95 Client Latency (ms) 2.5 20% drop 25% drop 2.0 on avg @ tail 1.5 1.0 0.5 0.0 0 20 40 60 80 100 120 140 QPS (thousands) 26
Disaggregation Benefits • Make up for throughput loss by cost-effectively scaling resources with disaggregation • Improve overall resource utilization • Formulate cost model to quantify benefits 27
Resource Savings • Resource savings of disaggregated vs. local Flash architecture as app requirements scale 40% Compute Intensity Scaling Factor 30% 20% 10% 0% -10% % cost benefit of disaggregation Storage Capacity Scaling Factor 28
Resource Savings • Resource savings of disaggregated vs. local Flash architecture as app requirements scale 40% Compute Intensity Scaling Factor 30% 20% 10% 0% -10% % cost benefit of Balanced CPU & disaggregation Flash utilization Storage Capacity Scaling Factor 29
Resource Savings • When storage scales at higher rate than compute, save resources by deploying Flash without as much CPU 40% Compute Intensity Scaling Factor 30% 20% 10% Deploy more Flash 0% servers than compute -10% % cost benefit of Balanced CPU & disaggregation Flash utilization Storage Capacity Scaling Factor 30
Resource Savings • When compute and storage demands remain balanced, no benefit with disaggregation 40% Compute Intensity Scaling Factor 30% 20% 10% 0% -10% % cost benefit of Balanced CPU & disaggregation Flash utilization Storage Capacity Scaling Factor 31
Implications for System Design • Dataplane: – Reduce compute overhead of network (storage) stack • Optimize TCP/IP processing • Use a light-weight protocol – Provide isolation mechanisms for shared remote Flash • Control plane: – Policies for allocating and sharing remote Flash • Important to consider write IO patterns of applications 32
Applica(on Tier Datastore Tier 2 1.8 Datastore Service 1.6 1.4 Client Latency (ms) 20% drop Key-Value Store get(k) So5ware 1.2 10% put(k,val) App App 1 drop Tier App Tier 0.8 TCP/IP Servers local_avg Clients CPU RAM NIC Hardware 0.6 remote_avg 0.4 read(blk); write(blk,data) local_p95 0.2 remote_p95 Protocol iSCSI Flash Tier 0 0 10 20 30 40 50 60 70 80 Remote Block Service So5ware QPS (thousands) 34 NIC Flash Hardware CPU RAM 40% 2 Compute Intensity Scaling Factor 1.8 30% 1.6 1.4 Client Latency (ms) 42% 20% 1.2 1 10% 0.8 Local Flash 0.6 Interrupt affinity 0% 0.4 Jumbo frame NIC offload 0.2 iSCSI with 6 processes iSCSI baseline (8 processes) -10% 0 0 10 20 30 40 50 60 70 80 % cost benefit of QPS (thousands) 32 disaggrega1on Storage Capacity Scaling Factor
Recommend
More recommend