Write Optimization of Log-structured Flash File System for Parallel - PowerPoint PPT Presentation

Write Optimization of Log-structured Flash File System for Parallel I/O on Manycore Servers Chang-Gyu Lee , Hyunki Byun, Sunghyun Noh, Hyeongu Kang, Youngjae Kim Department of Computer Science and Engineering Sogang University, Seoul, Republic of Korea SYSTOR ‘19 1

Data Intensive Applications Massive data explosion in recent years and expected to grow 2007, 2010, 2013, 2020, 281 EB 1.2 ZB 4.4 ZB ~44 ZB Memory Growing Capacity Demands Parallel Writes Database Applications Storage 2

Manycore CPU and NVMe SSD Manycore Servers OS File System Parallel Writes (F2FS) High-Performance SSD 3

What are Parallel Writes? Shared File Writes (DWOM from FxMark[ATC’16]) Multiple processes write private regions on a single file. Process 1 Process 2 Process 3 Process N Direct I/O Write Shared File Private File Write with FSYNC (DWSL from FxMark[ATC’16]) Multiple processes write private files, then call fsync system calls. Process 1 Process 2 Process 3 Process N Write and fsync Private Files * FxMark[ATC’16]: Min. et. al., "Understanding Manycore Scalability of File Systems", USENIX ATC 2016 4

Preliminary Results <DWOM Workload> <DWSL Workload> 200 200 150 150 K IOPS K IOPS 100 100 50 50 0 0 1 1 2 4 5 7 8 9 1 1 1 1 2 4 5 7 8 9 1 1 5 8 2 6 0 4 8 1 2 5 8 2 6 0 4 8 1 2 2 0 2 0 # of Cores # of Cores In DWOM workload, the performance does not scale. In DWSL workload, the performance does not scale after 42 cores. 5

Contents Introduction and Motivation Background: F2FS Research Problems Parallel Writes do never scale with respect to the increased number of cores on Manycore servers. Approaches Applying Range-Locking NVM Node Logging for file and file system metadata Pin-Point Update to completely eliminate checkpointing Evaluation Results Conclusion 6

F2FS: Flash Friendly File System F2FS is a log-structured file system designed for NAND Flash SSD. F2FS employs two types of logs to benefit with Flash’s parallelism and garbage collection. Data log for directory entry and user data Node log for inode and indirect node Node Address Table (NAT) translates Node id (NID) to block address . In memory, block address of an NAT entry is updated when corresponding Node Log is flushed. Entire NAT is flushed to the storage device during checkpointing. Filesystem Metadata Main Log Area (Random write) (Sequential Write) CP NAT SIT SSA Node Log Data Log 7

Problem(1): Serialized Shared File Writes Single file write A B C Blocked Grant Lock Inode File 8

Problem(2): fsync Processing in F2FS Old Data New Data Reference Flushing Node id Block NAT ❶ ❷ DRAM inode inode Data Data SSD Node id Block NAT Data Log Node Log 9

Problem(3): I/O Blocking during Checkpointing Old Data New Data Reference Flushing Node id Block 60 Sec. NAT Checkpointing ❶ ❷ DRAM inode inode Data Data SSD Node id Block NAT Data Log Node Log 10

Problem(3): I/O Blocking during Checkpointing Old Data New Data Reference Flushing Node id Block 60 Sec. NAT Checkpointing ❶ ❸ ❷ DRAM inode inode Data Data SSD Node id Block NAT Data Log Node Log 11

Problem(3): I/O Blocking during Checkpointing Old Data New Data Reference Flushing User Level Filesystem Level Node id Block NAT ❶ ❸ ❷ DRAM inode inode Data Data SSD Node id Block NAT Data Log Node Log 12

Summary We identified the causes of bottlenecks in F2FS for parallel writes as follows. 1. Serialization of parallel writes on a single file 2. High latency of fsync system call 3. I/O blocking by checkpointing of F2FS 13

Approach(1): Range Locking In F2FS, parallel writes to a single file are serialized by inode mutex lock. We employed a range-based lock to allow parallel writes on a single file. A B C Inode B, ref=0 A, ref=0 Grant Lock C, ref=1 Grant Lock Block File 14

Approach(2): High Latency of fsync Processing When fsync is called, F2FS has to flush data and metadata. Even if only small portion of metadata is changed, a block has to be flushed. The latency of fsync is dominated by block I/O latency. To mitigate high latency of fsync, we propose NVM Node Logging and fine- graind inode. Slow Block I/O Better Latency DRAM DRAM inode inode SSD NVM Write Amplification Byte-addressability 15

Approach(2): Node Logging on NVM Old Data New Data Reference Flushing Node id Block NAT ❷ ❶ DRAM inode inode Data Data NVM SSD Node id Block NAT Data Log Node Log 16

Approach(3): Fine-grained inode Structure inode inode 0.4KB Address Address nid Address Space 4KB Data Data Direct Node Indirect Node Double Indirect nid Fine-grained inode inode in baseline F2FS 17

Approach(4): Pin-Point NAT Update Frequent fsync calls trigger checkpointing in F2FS However, F2FS blocks all incoming I/O requests during checkpointing. To eliminate checkpointing, we propose Pin-Point NAT Update. 18

Approach(4): Pin-Point NAT Update Old Data In Pin-Point NAT Update, we update only the modified NAT entry New Data directly in NVM when fsync is called. Therefore, checkpointing is Reference not necessary to persist the entire NAT. Flushing Node id Block NAT ❸ ❷ ❶ DRAM inode inode Data Data NVM SSD Node id Block NAT Data Log Node Log 19

Approach(4): Pin-Point NAT Update Old Data New Data Reference Flushing Node id Block NAT ❸ ❷ ❶ DRAM inode inode Data Data NVM SSD Node id Block NAT Data Log Node Log 20

Evaluation Setup Microbenchmark (FxMark) Test-bed DWOM IBM x3950 X6 Manycore Server Shared File Write Intel Xeon E7-8870 v2 2.3GHz DWSL CPU 8 CPU Nodes (15 Cores per Node) Private File Write with fsync Total 120 cores RAM 740GB Intel SSD 750 Series 400GB (NVMe) SSD Read: 2200 MB/s, Write: 900 MB/s 32GB Emulated as PMEM device on R NVM AM OS Linux kernel 4.14.11 * FxMark[ATC’16]: Min. et. al., "Understanding Manycore Scalability of File Systems", USENIX ATC 2016 21

Shared File Write (DWOM Workload) 140 120 100 X15 80 K IOPS 60 X6.8 40 Baseline and node logging lines overlap. • Node Logging does not help at all because DWOM • 20 workload does not carry fsync calls. 0 1 15 28 42 56 70 84 98 112 120 # of Cores baseline range lock node logging integrated 22

Frequent fsync (DWSL Workload) 250 200 X1.6 150 K IOPS 100 50 0 1 15 28 42 56 70 84 98 112 120 # of Cores baseline range lock node logging integrated 23

Conclusion We identified performance bottlenecks of F2FS for parallel writes. 1. Serialization of share file writes on a file 2. High latency of fsync operations in F2FS 3. High I/O blocking times during checkpointing. To solve these problem, we proposed 1. File-level Range Lock to allow parallel writes on a shared file 2. NVM Node Logging to provides lower latency for updating file/file system metadata 3. Pin-Point NAT Update to eliminate I/O blocking times of checkpointing 24

Q&A Thank you! Contact: Changgyu Lee (changgyu@sogang.ac.kr) Department of Computer Science and Engineering Sogang University, Seoul, Republic of Korea 25

Write Optimization of Log-structured Flash File System for Parallel - PowerPoint PPT Presentation

Write Optimization of Log-structured Flash File System for Parallel I/O on Manycore Servers Chang-Gyu Lee , Hyunki Byun, Sunghyun Noh, Hyeongu Kang, Youngjae Kim Department of Computer Science and Engineering Sogang University, Seoul, Republic of

2004: Poisson Matting 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: The

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Arc Flash Protection Arc Flash Protection Electrical Reliability Services Arc Flash Hazard Arc

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Architecting a 30 PB all - Architecting a 30 PB all flash file system flash file system Kirill

ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos Kozyrakis NVMW18

The Basics Of Flash Building A Web Application With Flash What is Flash? Introduction

File System Implementation Summer 2016 Cornell University Today File allocation Unix

[537] Distributed Systems Chapters 42 Tyler Harter 11/19/14 File-System Case Studies Local -

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

Log-Structured File System CS 416: Operating Systems Design, Spring 2011 Department of Computer

File Management What is a file? Elements of file management File organization

Chapter 12: File System Implementation File System Structure File System Implementation

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Oracle waits log file sync and log file parallel write on Linux July 2017 Nikolay

Flash Memory Overview Steven Swanson Humanity processed 9 Zettabytes in 2008* Welcome to the

Explosive Astrophysics with Flash Alan Calder (alan.calder@stonybrook.edu) Sean Couch

Overview/Questions Overview/Questions Review: creating flash project, motion Review:

Status of the SNO+ experiment Richie Bonventre for the SNO+ collaboration WIN 2017 Lawrence

One Exploit to Rule them All? On the Security of Drop-in Replacement and Counterfeit

TSUBAME-KFC : Ultra Green Supercomputing Testbed Toshio Endo Akira Nukada, Satoshi Matsuoka

AICPA Business and Industry Economic Outlook Survey Detailed Survey Results: 3Q 2020 Management

Blending Control for Fuel Production Yann Creff IFP Lyon CEA-EDF-INRIA School Optimal Control:

Sambuz

Useful Links

Newsletter

Mail Us