ATC 2020 Fully Hardware Automated Open Research Framework for Future Fast NVMe Device Myoungsoo Jung Computer Architecture and Memory systems Laboratory Sponsored by CAME AME L Lab ab
Emerging Non-Volatile Memory for SSDs 450 us Latency (reads) 150 us 25 us 3 us 120 ns 50~80 ns 60~80 ns Memory Types MRAM TLC MLC SLC New Flash PRAM DRAM Storage Class Memory ( SCM ) Flash Technologies CAMEL ELab ab
NVMe Internals and Interfaces Flash Flash Flash Flash CTRL CPU CAMEL ELab ab
NVMe Storage Stack Applications (Processes) VFS Page Flash Flash Flash Flash CTRL /FS cache CPU Block layer 1~3GB/sec Block device driver CAMEL ELab ab
NVMe Storage Stack Redesign Applications • FlashShare: Punching Through Server Storage Stack (Processes) from Kernel to Firmware for Ultra-Low Latency SSDs (OSDI’18) VFS Page Flash Flash Flash Flash • De-indirection for Flash-Based SSDs with Nameless CTRL /FS cache CPU writes (FAST’12) • Towards SLO Complying SSDs Through OPS Isolation Block layer (FAST’15) • The case of FEMU: Cheap, Accurate, Scalable and 1~3GB/sec Challenges #1: Block device Extensible Flash Emulator (FAST’18) driver Most storage • There’re more and more! research relies on simulation/kernel- level emulation CAMEL ELab ab
SCM-based NVMe Storage Card Challenges #2: SSD’s Applications CPU can be a (Processes) performance bottleneck for SCMs VFS Page CTRL SCM SCM SCM SCM /FS cache CPU Block layer 7GB/sec Block device driver CAMEL ELab ab
What Does SSD’s CPU Do? Applications (Processes) VFS Page CTRL SCM SCM SCM SCM /FS cache CPU Block layer 7GB/sec Block device driver CAMEL ELab ab
What Does SSD’s CPU Do? Address space Completion queue (CQ) Applications Host (Processes) memory VFS Page CTRL SCM /FS cache CPU Submission queue (SQ) Block layer Block device driver Device register SQ Doorbell CQ Doorbell CAMEL ELab ab
What Does SSD’s CPU Do? Address space Completion queue (CQ) Applications (Processes) VFS Page CTRL SCM /FS cache Data CPU (PRP) ❶ I/O Submission queue (SQ) Block layer submission Block device driver SQ Doorbell CQ Doorbell CAMEL ELab ab
What Does SSD’s CPU Do? Address space Completion queue (CQ) Applications (Processes) VFS Page CTRL SCM /FS cache Data CPU (PRP) Submission queue (SQ) ❷ Ring SQ Block layer doorbell Block device driver SQ Doorbell SQ Doorbell CQ Doorbell CAMEL ELab ab
What Does SSD’s CPU Do? Address space Completion queue (CQ) Applications (Processes) ❸ I/O VFS Page CTRL SCM fetch /FS cache Data CPU (PRP) Submission queue (SQ) Block layer Block device driver SQ Doorbell SQ Doorbell CQ Doorbell CAMEL ELab ab
What Does SSD’s CPU Do? Address space Completion queue (CQ) Applications ❹ Data (Processes) transfer VFS Page CTRL SCM /FS cache Data CPU (PRP) Submission queue (SQ) Block layer Block device driver SQ Doorbell CQ Doorbell CAMEL ELab ab
What Does SSD’s CPU Do? Address space Completion queue (CQ) Applications ❺ I/O (Processes) process VFS Page CTRL SCM /FS cache Data CPU (PRP) Submission queue (SQ) Block layer Block device driver SQ Doorbell CQ Doorbell CAMEL ELab ab
What Does SSD’s CPU Do? Address space ❻ I/O Completion queue (CQ) Applications completion (Processes) VFS Page CTRL SCM /FS cache CPU Submission queue (SQ) Block layer Block device driver SQ Doorbell CQ Doorbell CAMEL ELab ab
What Does SSD’s CPU Do? Address space Completion queue (CQ) Applications ❼ Interrupt (Processes) (notification) VFS Page CTRL SCM /FS cache CPU Submission queue (SQ) Block layer Block device driver SQ Doorbell CQ Doorbell CAMEL ELab ab
What Does SSD’s CPU Do? Address space ❽ Process Completion queue (CQ) completion Applications (Processes) VFS Page CTRL SCM /FS cache CPU Submission queue (SQ) Block layer Block device driver SQ Doorbell CQ Doorbell CAMEL ELab ab
What Does SSD’s CPU Do? Address space Completion queue (CQ) Applications (Processes) VFS Page CTRL SCM /FS cache CPU Submission queue (SQ) ❾ Ring CQ Block layer doorbell Block device driver SQ Doorbell CQ Doorbell CQ Doorbell CAMEL ELab ab
What Does SSD’s CPU Do? Address space Completion queue (CQ) Applications (Processes) All these NVMe VFS Page CTRL SCM activities give a /FS cache CPU burden on the Submission queue (SQ) Block layer storage! Block device driver SQ Doorbell CQ Doorbell CQ Doorbell CAMEL ELab ab
Multi-core IP for High-Performance SSD Backend I-RAM I-RAM I-RAM PCIe Client Logic Channel Complex NVMe driver PCIe SQ CQ Interconnection Networks Core0 Outbound Inbound PCIe Memory Controller SRAM CPU CAMEL ELab ab
Component Latency Decomposition Completion Translation PRP Completion Translation PRP Queue/Doorbells Fetching NVM Queue/Doorbells Fetching NVM 1.0 1.0 Latency breakdown Latency breakdown 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 ZNAND MRAM PRAM ZNAND TLCMLCSLC PRAM MRAM TLCMLCSLC CAMEL ELab ab
Component Latency Decomposition Completion Translation PRP Completion Translation PRP Queue/Doorbells Fetching NVM Queue/Doorbells Fetching NVM 1.0 1.0 1.0 1.0 Latency breakdown Latency breakdown Latency breakdown Latency breakdown 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 D D M M M M D D C C C C N N M M M C C M A A A A C C C C C C N N L L L L L L A A A A A R R A R R L M M L L L L A A T T S S N N L R R R R P P M M T M M S S T N N Z Z M M P P Z Z CAMEL ELab ab
Component Latency Decomposition Completion Translation PRP Completion Translation PRP Queue/Doorbells Fetching NVM Queue/Doorbells Fetching NVM 1.0 1.0 1.0 1.0 1.0 1.0 Latency breakdown Latency breakdown Latency breakdown Latency breakdown Latency breakdown Latency breakdown 0.8 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 0.0 D D D M M M M M M ZNAND D D C C C C C C N N N PRAM M MRAM M M C C C M A A A A A A TLCMLCSLC C C C C C C N N L L L L L L L L L A A A A A A R R R A R R R L M M M L L L L A A T T T S S S N N N L R R R R P P P M M M T M M S S T N N Z Z Z M M P P Z Z CAMEL ELab ab
Component Latency Decomposition Completion Translation PRP Completion Translation PRP Queue/Doorbells Fetching NVM Queue/Doorbells Fetching NVM 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 Latency breakdown Latency breakdown Latency breakdown Latency breakdown Latency breakdown Latency breakdown Latency breakdown Latency breakdown 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ZNAND D D D MRAM M M M PRAM M M M ZNAND ZNAND D D TLCMLCSLC C C C C C C N N N PRAM PRAM M MRAM MRAM M M C C C M A A A A A A TLCMLCSLC TLCMLCSLC C C C C C C N N L L L L L L L L L A A A A A A R R R A R R R L M M M L L L L A A T T T S S S N N N L R R R R P P P M M M T M M S S T N N Z Z Z M M P P Z Z CAMEL ELab ab
Component Latency Decomposition Completion Translation PRP Completion Translation PRP Queue/Doorbells Fetching NVM Queue/Doorbells Fetching NVM 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 Latency breakdown Latency breakdown Latency breakdown Latency breakdown Latency breakdown Latency breakdown Latency breakdown Latency breakdown Latency breakdown Latency breakdown 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ZNAND ZNAND D D D MRAM MRAM M M M PRAM PRAM M M M ZNAND ZNAND D D D TLCMLCSLC TLCMLCSLC C C C C C C N N N PRAM PRAM M M MRAM MRAM M M M C C C M A A A A A A TLCMLCSLC TLCMLCSLC C C C C C C C C C N N N L L L L L L L L L A A A A A A A A R R R A R R R L L M M M L L L L L L A A A T T T S S S N N N L R R R R R R P P P M M M T T M M M S S S T N N N Z Z Z M M M P P P Z Z Z CAMEL ELab ab
Recommend
More recommend