need for a deeper cross layer
play

Need for a Deeper Cross-Layer Optimization for Dense NAND SSD to - PowerPoint PPT Presentation

Need for a Deeper Cross-Layer Optimization for Dense NAND SSD to Improve Read Performance of Big Data Applications: A Case for Melded Pages Arpith K, Indian Institute of Science, Bangalore K. Gopinath, Indian Institute of Science, Bangalore


  1. Need for a Deeper Cross-Layer Optimization for Dense NAND SSD to Improve Read Performance of Big Data Applications: A Case for Melded Pages Arpith K, Indian Institute of Science, Bangalore K. Gopinath, Indian Institute of Science, Bangalore

  2. Organization of a Flash Packages Die   Smallest unit that can independently execute commands. Plane   Smallest unit to serve an I/O request in a parallel fashion. Block   Smallest unit that can be erased Page   Smallest unit that can be read or programed Cell 

  3. Floating Gate Transistors The presence of electrons in the  floating gate increases the threshold voltage of the cell

  4. STATE 1 STATE 0 0 1 Probability Density 0 1 Threshold Voltage Threshold Window

  5. Reads Number of threshold voltage states determines how many bits a transistor can store.  MLC TLC

  6. Reads LSB   V 3 CSB   V 1, V 5 MSB   V 0 , V 2 , V 4 , V 6 TLC

  7. Organization of Transistors in a Block Page (Smallest unit that can be read or programed) 

  8. Organization of Transistors in a Block MSB Page MSB MSB MSB MSB MSB MSB … CSB Page CSB CSB CSB CSB CSB CSB LSB Page LSB LSB LSB LSB LSB LSB

  9. Reads Latency for TLC Page Latency (µs) LSB Page 58 CSB Page 78 MSB Page 107 TLC

  10. D D i i e e 0 1 Page Sources of Read Overheads Block 0 Decoder Block 1 Address translation • • Accessing the wordline Block 2 Setting up the block that contains the • Decoder Block requested data • Post processing operations (such as detecting and correcting bit errors). Block n-1

  11. Block Setup V pass V pass . . V read V pass

  12. D D i i e e 0 1 Page Sources of Read Overheads Block 0 Decoder Block 1 Address translation • • Accessing the wordline Block 2 Setting up the block that contains the • Decoder Block requested data • Post processing operations (such as detecting and correcting bit errors). Block n-1

  13. Reads X → Overhead. Includes time to address a wordline, apply pass through  voltage (to other wordlines in that block) and post process data. Y → Time required to apply one read reference voltage and sense the cell’s  conductivity. Page Latency (us) X + Y LSB Page 58  X + 2Y CSB Page 78  MSB Page 107 X + 4Y  TLC

  14. Meded-Pages Total time to read all three pages reduces from (3X + 7Y) to (X + 7Y)  Page Latency (us) Latency MP (us) LSB Page 58 166 CSB Page 78 MSB Page 107 Melded Page MSB Page MSB MSB MSB MSB MSB MSB … CSB Page CSB CSB CSB CSB CSB CSB LSB Page LSB LSB LSB LSB LSB LSB

  15. Meded-Pages Schedule the writes in such a way that, later, while reading, requests for data  in LSB, CSB and MSB pages are all present in the read request queue. Melded Page MSB Page MSB MSB MSB MSB MSB MSB … CSB Page CSB CSB CSB CSB CSB CSB LSB Page LSB LSB LSB LSB LSB LSB

  16. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 0 1 2 3 4 5 Block Block WL 2 WL 1 WL 0 LSB Pg CSB Pg MSB Pg

  17. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 0 1 2 3 4 5 Block Block WL 2 WL 1 WL 0 LSB Pg CSB Pg MSB Pg

  18. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 0 1 2 3 4 5 Block Block WL 2 WL 1 WL 0 0 1 2 LSB Pg CSB Pg MSB Pg

  19. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 0 1 2 Block Block WL 2 WL 1 3 4 5 WL 0 0 1 2 LSB Pg CSB Pg MSB Pg

  20. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 0 1 2 Block Block WL 2 WL 1 3 4 5 WL 0 - - - LSB Pg CSB Pg MSB Pg

  21. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 0 1 2 3 4 5

  22. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 0 1 2 3 4 5 Block Block WL 2 WL 1 WL 0 LSB Pg CSB Pg MSB Pg

  23. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 1 2 3 4 5 Block Block WL 2 WL 1 WL 0 0 LSB Pg CSB Pg MSB Pg

  24. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 2 3 4 5 Block Block WL 2 1 WL 1 WL 0 0 LSB Pg CSB Pg MSB Pg

  25. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 3 4 5 Block Block WL 2 1 WL 1 2 WL 0 0 LSB Pg CSB Pg MSB Pg

  26. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 4 5 Block Block 3 WL 2 1 WL 1 2 WL 0 0 LSB Pg CSB Pg MSB Pg

  27. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 5 Block Block 3 WL 2 1 4 WL 1 WL 0 0 2 LSB Pg CSB Pg MSB Pg

  28. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) Block Block 3 WL 2 1 4 WL 1 WL 0 0 2 5 LSB Pg CSB Pg MSB Pg

  29. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 0 1 2 3 4 5 Block Block WL 2 WL 1 WL 0 LSB Pg CSB Pg MSB Pg

  30. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 0 1 2 3 4 5 1 0 3 4 5 2 Block WL 2 WL 1 WL 0 LSB Pg CSB Pg MSB Pg

  31. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 1 2 3 4 5 Block Block WL 2 WL 1 WL 0 0 LSB Pg CSB Pg MSB Pg

  32. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 1 2 4 5 Block Block WL 2 WL 1 3 WL 0 0 LSB Pg CSB Pg MSB Pg

  33. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 2 4 5 Block Block WL 2 WL 1 3 1 WL 0 0 LSB Pg CSB Pg MSB Pg

  34. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 2 5 Block Block WL 2 4 WL 1 3 1 WL 0 0 LSB Pg CSB Pg MSB Pg

  35. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 2 Block Block WL 2 4 WL 1 5 3 1 WL 0 0 LSB Pg CSB Pg MSB Pg

  36. Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) Block Block WL 2 4 WL 1 5 3 1 WL 0 0 2 LSB Pg CSB Pg MSB Pg

  37. It’s only beneficial to use melded pages when large amounts of data needs to  be read. How large is large enough? 

  38. 1 6 7 5 3 6 Number of channels: 8  Number of parallel units per channel: 8 5  Total number if parallel units: 64  4 Channel's operating frequency : 800 MT/s  Page Size: 4KB  3 2 1 5 8 LU N 0 6

  39. 50000 Normal TLC (us) Melded TLC (us) 2^12 63 183 45000 2^13 63 183 40000 Improvement of 2^14 63 183 41.3% Time to fulfill the request (us) 2^15 63 183 35000 2^16 69 183 30000 2^17 81 200 25000 2^18 104 218 2^19 188 270 20000 2^20 364 401 15000 2^21 708 636 2^22 1406 1134 10000 2^23 2791 2103 5000 2^24 5572 4068 2^25 11124 7971 0 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 2^26 22236 15803 Normal TLC SuperPaged TLC 2^27 44452 31440 Read Size(2^X)

  40. Normal TLC (us) Melded TLC (us) 2^12 63 183 1 6 2^13 63 183 7 5 3 2^14 63 183 1 2^15 63 183 6 4 2^16 69 183 1 5 2^17 81 200 3 2^18 104 218 1 4 2 2^19 188 270 2^20 364 401 1 3 1 2^21 708 636 1 2^22 1406 1134 2 0 2^23 2791 2103 1 9 2^24 5572 4068 2^25 11124 7971 5 LUN 8 2^26 22236 15803 0 6 2^27 44452 31440

  41. It’s only beneficial to use melded pages when large amounts of data needs to  be read. Problem: Decision to use melded pages needs to be done in program phase.  How does the scheduler know the read pattern during writes. 

  42. Directives (Hints) Host provides hints to the scheduler when submitting the write request.  NVMe's Directives support (1.3 and above)   Provides an ability to exchange extra metadata in the headers of ordinary NVMe commands.  Proposal is to add a new directive that enables the application to declare the read patterns.

  43. Generating Hints Host provides hints to the scheduler when submitting the write request.  These hints can be explicitly provided by the developer or automatically  generated by looking at the history.

  44. Hadoop Distributed File System Hadoop and Spark is an open-source cluster-computing framework.  Large-scale data processing.  Data itself is managed using HDFS.   HDFS is designed to store very large files across machines in a large cluster.

  45. Hadoop Distributed File System NameNodes   HDFS cluster consists of a single NameNode.  Manages metadata  Maintains mapping of blocks to DataNodes DataNodes   Usually one per node in the cluster.  Stores blocks of data.

Recommend


More recommend