A Semi ‐ Preemptive Garbage A Semi ‐ Preemptive Garbage Collector for Solid State Collector for Solid State Collector for Solid State Collector for Solid State Drives Drives Junghee Lee, Youngjae Kim, Galen M. Shipman, Sarp Oral, Feiyi Wang, and Jongman Kim Presented by Junghee Lee Presented by Junghee Lee
High Performance Storage Systems High Performance Storage Systems • Server centric services – File, web & media servers, transaction processing servers File, web & media servers, transaction processing servers • Enterprise ‐ scale Storage Systems – Information technology focusing on storage, protection, retrieval of data in large ‐ scale environments Storage Unit High Performance Hard Disk Drive Google's massive server farms Google s massive server farms Storage Systems 2
Spider: A Large ‐ scale Storage System Spider: A Large ‐ scale Storage System • Jaguar – Peta ‐ scale computing machine Peta scale computing machine – 25,000 nodes with 250,000 cores and over 300 TB memory • Spider storage system – The largest center ‐ wide Lustre ‐ based file system based file system – Over 10.7 PB of RAID 6 formatted capacity • 13,400 x 1 TB HDDs 13 400 1 TB HDD – 192 Lustre I/O servers • Over 3TB of memory (on Lustre I/O servers) 3
Emergence of NAND Flash based SSD Emergence of NAND Flash based SSD • NAND Flash vs. Hard Disk Drives – Pros: • Semi-conductor technology, no mechanical parts • Offer lower access latencies – μ s for SSDs vs. ms for HDDs μ s for SSDs vs. ms for HDDs • Lower power consumption • Higher robustness to vibrations and temperature – Cons: C • Limited lifetime – 10K - 1M erases per block • High cost – About 8X more expensive than current hard disks • Performance variability y 4
Outline Outline • Introduction • Background and Motivation • Background and Motivation – NAND Flash and SSD – Garbage Collection – Pathological Behavior of SSDs • Semi ‐ Preemptive Garbage Collection • Evaluation • Conclusion 5
NAND Flash based SSD NAND Flash based SSD fwrite fwrite Process Process Process Process Application (file, data) File System (FAT, Ext2, NTFS …) Bl Block write k it OS (LBA, size) Block Device Driver P Page write i Block Interface (SATA, SCSI, etc) Device (bank, block, page) CPU Memory Memory SSD (FTL) Flash Flash Flash Flash Flash Flash Flash Flash 6
NAND Flash Organization NAND Flash Organization Plane 0 Read Read Package Package R Register i t Die 0 Die 1 0.025 ms Block 0 Page 0 … Write Plane 0 Plane 1 Plane 2 Plane 3 Plane 0 Plane 1 Plane 2 Plane 3 Page 63 0.200 ms … E Erased d Block 2047 Page 0 … … Erase Page 63 1.500 ms 7
Out ‐ Of ‐ Place Write Out ‐ Of ‐ Place Write Physical Blocks Logical-to-Physical Address Mapping Table Address Mapping Table P0 P0 I I P1 V LPN0 PPN1 P2 V I P3 P3 P3 P3 E V V E LPN1 LPN1 PPN4 PPN4 LPN2 PPN2 PPN3 P4 V P5 V LPN3 LPN3 PPN5 PPN5 P6 E P7 E Write to Invalidate Write to Update LPN2 PPN2 PPN3 table 8
Garbage Collection Garbage Collection Physical Blocks P0 P0 P0 P0 E E I I Select Victim Block P1 P1 P1 E V I P2 P2 E I P3 P3 P3 P3 P3 P3 E V V E I I Move Valid Pages P4 V P5 V P6 P6 V E Erase Victim Block P7 P7 E V 2 reads + 2 writes + 1 erase= 2*0.025 + 2*0.200 + 1.5 = 1.950(ms) !! 9
Pathological Behavior of SSDs Pathological Behavior of SSDs • Does GC have an impact on the foreground operations? – If so, we can observe sudden bandwidth drop If so, we can observe sudden bandwidth drop – More drop with more write requests – More drop with more bursty workloads • Experimental Setup – SSD devices SSD d i • Intel (SLC) 64GB SSD • SuperTalent (MLC) 120GB SSD – I/O generator • Used libaio asynchronous I/O library for block ‐ level testing 10
Bandwidth Drop for Write ‐ Dominant Bandwidth Drop for Write ‐ Dominant Workloads Workloads o o oads oads • Experiments – Measured bandwidth for 1MB by varying read ‐ write ratio Measured bandwidth for 1MB by varying read write ratio 1MB Sequential 1MB Sequential 240 280 260 220 240 240 200 220 MB/s MB/s 180 200 180 160 160 140 SuperTalent MLC (SSD) Intel SLC (SSD) 140 120 120 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Time (Sec) Time (Sec) 80% Write 20% Read 40% Write 60% Read 80% Write 20% Read 40% Write 60% Read 60% Write 40% Read 20% Write 80% Read 60% Write 40% Read 20% Write 80% Read Performance variability increases as we increase write-percentage of workloads. rite percentage of orkloads 11
Performance Variability for Bursty Workloads Performance Variability for Bursty Workloads • Experiments – Measured SSD write bandwidth for queue depth (qd) is 8 and 64 Measured SSD write bandwidth for queue depth (qd) is 8 and 64 – Normalized I/O bandwidth with a Z distribution Intel SLC (SSD) Intel SLC (SSD) SuperTalent MLC (SSD) SuperTalent MLC (SSD) Performance variability increases as we increase the arrival- rate of req ests (b rst rate of requests (bursty workloads). orkloads) 12
Lessons Learned Lessons Learned • From the empirical study, we learned: – Performance variability increases as the percentage of writes in Performance variability increases as the percentage of writes in workloads increases. – Performance variability increases with respect to the arrival rate of write requests. write requests • This is because: – Any incoming requests during the GC should wait until the on ‐ going GC ends. – GC is not preemptive GC i t ti 13
Outline Outline • Introduction • Background and Motivation • Background and Motivation • Semi ‐ Preemptive Garbage Collection – Semi ‐ Preemption p – Further Optimization – Level of Allowed Preemption • Evaluation • Conclusion 14
Technique #1: Semi ‐ Preemption Technique #1: Semi ‐ Preemption Request q W z W z z GC R x R x W x W x R y R y W y W y E Time W z W z W z W z Preemptive GC P ti GC Non-Preemptive GC Read page x Data transfer R x R x Write page x Meta data update W x W x Erase a block Erase a block E E 15
Technique #2: Merge Technique #2: Merge Request q R y R y y GC R x R x W x W x R y R y W y W y E Time Read page x Data transfer R x R x Write page x Meta data update W x W x Erase a block Erase a block E E 16
Technique #3: Pipeline Technique #3: Pipeline Request q R z R z z GC R x R x W x W x R y R y W y W y E Time R z R z Read page x Data transfer R x R x Write page x Meta data update W x W x Erase a block Erase a block E E 17
Level of Allowed Preemption Level of Allowed Preemption • Drawback of PGC : The completion time of GC is delayed p y May incur lack of free blocks Sometimes need to prohibit preemption • States of PGC Garbage Read Write collection collection requests requests requests requests State 0 X State 1 O O O State 2 O O X State 3 O X X 18
Outline Outline • Introduction • Background and Motivation • Background and Motivation • Semi ‐ Preemptive Garbage Collection • Evaluation Evaluation – Setup – Synthetic Workloads – Realistic Workloads • Conclusion 19
Setup Setup • Simulator – MSR’s SSD simulator based on DiskSim • Workloads – Synthetic workloads • Used the synthetic workload generator in DiskSim • Used the synthetic workload generator in DiskSim – Realistic workloads Average request g q Read ratio Arrival rate W Workloads kl d size (KB) (%) (IOP/s) Financial 7.09 18.92 47.19 Write dominant dominant Cello Cello 7.06 7 06 19 63 19.63 74 24 74.24 TPC-H 31.62 91.80 172.73 Read dominant OpenMail 9.49 63.30 846.62 20
Performance Improvements for Synthetic Performance Improvements for Synthetic Workloads Workloads o o oads oads • Varied four parameters: request size, inter ‐ arrival time, sequentiality and read/write ratio • Varied one at a time fixing others V i d t ti fi i th NPGC ms) 2.5 4.5 se time (m 4.0 PGC eviation 3.5 2.0 NPGC std 3.0 PGC std PGC std e respons tandard d 1.5 2.5 2.0 1.0 1.5 1 0 1.0 Average St 0.5 0.5 0 0 8 16 32 64 Request size (KB) 21
Performance Improvement for Synthetic Performance Improvement for Synthetic Workloads (con’t) Workloads (con’t) o o oads (co t) oads (co t) Bursty Random dominant Write dominant 10 5 3 1 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 Inter-arrival time (ms) Probability of Probability of sequential access sequential access read access read access 22
Performance Improvement for Realistic Performance Improvement for Realistic Workloads Workloads o o oads oads • Variance of Response Times • Average Response Time me 1.0 1.0 on sponse tim d deviatio 0.8 0.8 d standard d ave. res 0.6 0.6 0.4 0.4 ormalized ormalized 0.2 0.2 0 0 0 0 No N Financial Cello TPC-H OpenMail Financial Cello TPC-H OpenMail Improvement of variance of Improvement of average response time by 49.8% and response time by 6.5% and 83.3% for Financial and Cello. 66.6% for Financial and Cello. 23
Recommend
More recommend