Fla FlashSh shShare are: : Pun Punch ching ing Throug Through Ser h Server er Stora Storage Stack ge Stack from K from Kern ernel to Firmwa el to Firmware re for Ultra or Ultra-Lo Low Latenc w Latency y SSDs SSDs Jie Zhang, Miryeong Kwon, Donghyun Gouk, Sungjoon Koh, Changlim Lee, Mohammad Alian, Myoungjun Chun, Mahmut Kandemir, Nam Sung Kim, Jihong Kim and Myoungsoo Jung
Executab Ex ecutable le Sum Summar mary Latency-critical Throughput application application Reduce average turnaround Ultra low latency Longer I/O latency response times by 22% File System Barrier Unaware of ULL-SSD Block Layer Reduce 99 th -percentile Unaware of turnaround response times latency-critical NVMe Driver application by 31% Flash Firmware FlashShare Punches through Interference ULL-SSD Memory-like the performance barriers performance Datacenter
Motiv Motivation ation: : app applic licatio ations ns in d in data atacenter center Datacenter executes a wide range of latency-critical workloads. • Driven by the market of social media and web services; • Required to satisfy a certain level of service-level agreement; • Sensitive to the latency (i.e., turn-around response time); A typical example: Apache apache server TCP/IP service TCP/IP service HTTP request worker worker queue monitor worker A key metric: data obj. data obj. user experience respond
Motiv Motivation ation: : app applic licatio ations ns in d in data atacenter center • Latency-critical applications exhibit varying loads during a day. • Datacenter overprovisions its server resources to meet the SLA. • However, it results in a low utilization and low energy efficiency. 30% avg. Fraction of Time utilization Varying loads Hour of the day CPU utilization Figure 1. Example diurnal pattern in queries Figure 2. CPU utilization analysis of per second for a Web Search cluster 1 . Google server cluster 2 . 1 Power Management of Online Data-Intensive Services. 2 The Datacenter as a Computer.
Motiv Motivation ation: : app applic licatio ations ns in d in data atacenter center Popular solution: co-locating latency-critical and throughput workloads. ISCA’15 Micro’11 Eurosys’14
Cha Challe lleng nge: : ap appli plicat cation ions s in d in dat atace acenter nter Experiment: Apache+PageRank vs. Apache only Server configuration : Applications : Apache – Online latency-critical application; PageRank – Offline throughput application; Performance metrics : SSD device latency; Response time of latency-critical application;
Cha Challe lleng nge: : app applic licatio ations ns in d in data atacenter center Experiment: Apache+PageRank vs. Apache only Fig 1: Apache SSD latency Fig 2: Apache response time increases due to PageRank. increases due to PageRank. • The throughput-oriented application drastically increases the I/O access latency of the latency-critical application. • This latency increase deteriorates the turnaround response time of the latency-critical application.
Cha Challe lleng nge: : ULL-SSD SSD There are emerging U ltra L ow- L atency SSD (ULL-SSD) technologies, which can be used for faster I/O services in the datacenter. Optane nvNitro ZNAND XL-Flash Phase change Technique MRAM New NAND Flash RAM Samsung Toshiba Vendor Intel Everspin Read 10us 6us 3us N/A Write 10us 6us 100us N/A
Cha Challe lleng nge: : ULL-SSD SSD In this work, we use engineering sample of Z-SSD. Z-NAND 1 SLC based 3D NAND Technology 48 stacked word-line layer Capacity 64Gb/die Page Size 2KB/Page Z-NAND based archives “Z - SSD” [1] Cheong, Wooseong, et al. "A flash memory controller for 15 μ s ultra-low-latency SSD using high-speed 3D NAND flash with 3 μ s read time." 2018 IEEE International Solid-State Circuits Conference-(ISSCC) , 2018.
Cha Challe lleng nge: : Datacenter server with ULL-SSD SSD Unfortunately, the short latency characteristics of ULL-SSD cannot be exposed to users (in particular, for the latency-critical applications). Server configuration : 42x 36us 28us Device latency analysis Applications : Apache – online latency-critical application; PageRank – offline throughput application;
Cha Challe lleng nge: : Datacenter server with ULL-SSD SSD ULL-SSD fails to bring short latency, because of the storage stack. App App App Caching layer The storage stack is unaware of the characteristics of both latency-critical Filesystem workload and ULL-SSD blkmq blkmq The current design of blkmq layer, NVMe Driver NVMe Driver NVMe driver, and SSD firmware can hurt the performance of latency- critical applications. ULL-SSD ULL-SSD
Blk Blkmq mq lay layer er: : challenge Software queue: holds latency-critical I/O requests for a long time; Incoming Req Req Req App App App requests Caching layer Software Queue Req Req Req Filesystem I/O submission Req Req Req blkmq blkmq Queuing Merge Merge Queuing Hardware NVMe Driver Queue ULL-SSD
Blk Blkmq mq lay layer er: : challenge Software queue: holds latency-critical I/O requests for a long time; Hardware queue: dispatches an I/O request without a knowledge of the latency-criticality; App App App Caching layer Software Queue Req Req Filesystem I/O submission Req Req Req blkmq blkmq Token=0 Token=1 Token=0 Hardware NVMe Driver Queue Dispatch ULL-SSD
Blk Blkmq mq lay layer er: : optimization Latency-critical I/Os: bypass blkmq for a faster response; Our solution: bypass. Throughput I/Os: merge in blkmq for a higher storage bandwidth. Incoming ThrReq LatReq ThrReq requests App App App LatReq Caching layer Software Software Queue Queue Req Filesystem I/O submission Bypass Req Req Req ThrReq ThrReq blkmq blkmq Hardware Hardware NVMe Driver Req Queue Queue Little penalty No merge Req Req Addressed No I/O scheduling in NVMe ULL-SSD
NVMe NVMe SQ SQ: : challenge (bypass is not simple enough) NVMe protocol-level queue: a latency-critical I/O request can be blocked by prior I/O requests; >200% Time Cost = T fetch-self + 2xT fetch Incoming Req overhead App App App requests Caching layer Head Filesystem Head I/O submission Core 2 Core 1 Core 0 Wait Head Tail blkmq + T fetch + 2xT fetch Tail NVMe SQ NVMe CQ Fetch Ring NVMe Driver NVMe Driver SQ doorbell NVMe CQ doorbell register controller register ULL-SSD ULL-SSD
Inc NVMe NVMe SQ SQ: : optimization Target: Designing towards a responsiveness-aware NVMe submission. Key Insight: • Conventional NVMe controller(s) allow to customize the standard arbitration strategy for different NVMe protocol-level queue accesses. • Thus, we can make the NVMe controller to decide which NVMe command to fetch by sharing a hint for the I/O urgency.
NVMe NVMe SQ SQ: : optimization Our Solution: 2. Double the SQ doorbell register 1. Double SQs 3. New arbitration strategy: gives the highest priority to the Lat-SQ (one for latency-critical I/Os, another for throughput I/Os) Incoming ThrReq LatReq ThrReq requests App App App Caching layer Filesystem I/O submission Core 2 Core 2 Core 1 Core 1 Core 0 Core 0 blkmq NVMe SQ NVMe CQ Lat-SQ Thr-SQ CQ Ring Ring Immediate fetch NVMe Driver Postpone NVMe Driver Lat-SQ Thr-SQ NVMe CQ SQ doorbell NVMe CQ doorbell doorbell doorbell CTL doorbell register controller register ULL-SSD ULL-SSD
SSD firmw SSD firmware are: : challenge Embedded cache cannot protect the latency-critical I/O from an eviction; Embedded cache provides the fastest response Embedded cache can be polluted by the throughput requests; (DRAM service) Incoming requests Req@0x05 Req@0x04 Req@0x08 Req@0x0b Req@0x01 App App App Cost: T CL +T CACHE Cost: T CL +T FTL +T NAND +T CACHE Caching layer NVMe Controller Embedded Cache Filesystem I/O submission I/O Hit way addr blkmq Caching Layer Caching Layer Evict 0x0 0 0x4 0x0 Miss 1 0x1 0x8 0x1 0x1 NVMe Driver FTL 2 0x5 0xb 0x2 0x5 NAND Flash ULL-SSD ULL-SSD
SSD firmw SSD firmware are: : optimization Our design: splits the internal cache space to protect latency-critical I/O requests; Incoming requests Req@0x05 Req@0x04 Req@0x08 Req@0x01 App App App Caching layer NVMe Controller Embedded Cache Filesystem I/O submission Evict way addr blkmq Caching Layer Caching Layer 0x4 0x0 0 0x4 0x0 0x8 1 0x1 0x1 NVMe Driver Protection FTL region 2 0x5 0x2 NAND Flash ULL-SSD ULL-SSD
NVMe NVMe CQ CQ: : challenge NVMe completion: MSI overhead for each I/O request; Blkmq layer Cost: T CS Cost: T CS Cost: T ISR context context Interrupt Interrupt CPU Controller Service Routine switch switch App App App Caching layer MSI Interrupt Cost: 2xT CS +T ISR Head Tail Filesystem Tail I/O completion Core 2 Core 1 Core 0 blkmq NVMe SQ NVMe CQ NVMe Driver NVMe Driver Message SQ doorbell NVMe CQ doorbell register controller register ULL-SSD ULL-SSD
NVMe NVMe CQ CQ: : optimization Key insight: state-of-the-art Linux supports a poll mechanism; Blkmq layer Blkmq context context Interrupt Interrupt Save: T ISR Save: 2xT CS +T ISR Save: T CS CPU Save: T CS layer Controller Service Routine switch switch App App App Poll Caching layer MSI Interrupt Filesystem I/O completion Core 2 Core 1 Core 0 blkmq NVMe SQ NVMe CQ NVMe Driver NVMe Driver Message SQ doorbell NVMe CQ doorbell register controller register ULL-SSD ULL-SSD
Recommend
More recommend