md raid 456 write journal and cache
play

MD/RAID-456 Write Journal and Cache Shaohua Li & So Song g Liu - PowerPoint PPT Presentation

MD/RAID-456 Write Journal and Cache Shaohua Li & So Song g Liu Software Engineer, Facebook MD/RAID-456 Write Journal and Cache Write holes of RAID-456 Hardware RAID: benefits and challenges Write operation in MD/RAID-456


  1. MD/RAID-456 Write Journal and Cache Shaohua Li & So Song g Liu Software Engineer, Facebook

  2. MD/RAID-456 Write Journal and Cache • Write holes of RAID-456 • Hardware RAID: benefits and challenges • Write operation in MD/RAID-456 • RAID-456 write journal: plug the write hole • RAID-456 write cache: fast fsync(), more full stripe writes • Examples

  3. Write Hole of RAID-456

  4. Failure Recovery of RAID-456 • Disk failure recovery: rebuild data from parity • Power failure recovery: resync stripes with mis-matched data and parity

  5. RAID-5: Data and Parity in Sync xor xor = D D D P

  6. RAID-5 Disk Failure D D D P

  7. RAID-5 Degraded Mode D D P

  8. RAID-5 Rebuild Data for Disk Failure D = P-D 1 -D 2 D 1 D 2 D P

  9. RAID-5 Power Failure D D P D D D D P

  10. RAID-5 Power Failure D D P D D D D P

  11. RAID-5 after Power Failure D D D P

  12. RAID-5 Resync after Power Failure drop old D D D parity

  13. RAID-5 Resync after Power Failure calculate xor xor = D D D P new parity

  14. Write Hole: Disk Failure + Power Failure D D P

  15. Write Hole: Rebuild Wrong Data D D D P

  16. Hardware RAID

  17. Hardware RAID Battery-backed RAM D D D P

  18. Hardware RAID: No Write Hole Battery-backed RAM D D P D D P

  19. Hardware RAID: No Write Hole Battery-backed RAM D D P D D P

  20. Hardware RAID: No Write Hole Battery-backed RAM D D P D D P

  21. Hardware RAID: Fast fsync() Battery-backed RAM return writes from RAM D D D D P

  22. Hardware RAID: Fast fsync() Battery-backed RAM return writes from RAM D D D D D P

  23. Hardware RAID: Full Stripe Write Battery-backed RAM D D D D D D P

  24. Hardware RAID: Full Stripe Write Battery-backed RAM D D D P D D D P

  25. Hardware RAID: Full Stripe Write Battery-backed RAM D D D P D D D P

  26. Hardware RAID @ Facebook • RAID-6 • Haystack: photo storage • GlusterFS: scalable network filesystem • RAID-0 of single HDD, for fast fsync()

  27. Challenges with Hardware RAID • Black box solution • Low transparency; low flexibility • Vendor specific toolset

  28. Make Software RAID Better • Write journal: plug write hole • Write cache: accelerate fsync(); more full stripe writes

  29. MD/RAID-456 Write

  30. RAID-456: Stripe Cache Stripe Cache (System Memory) D D D P

  31. RAID-456 Write • Step 1: update data and parity in stripe cache • Option 1: Reconstruct • Option 2: R-M-W • Step 2: write data and parity to RAID disks • Step 3: bio_endio

  32. RAID-456 Reconstruct Write Stripe Cache D D D D P

  33. RAID-456 Reconstruct Write Stripe Cache D D D D D D P

  34. RAID-456 Reconstruct Write Stripe Cache xor xor = D D D P D D D P

  35. RAID-456 Reconstruct Write Stripe Cache D D D P D D D P

  36. RAID-456 R-M-W Write Stripe Cache D D D D P

  37. RAID-456 R-M-W Write Stripe Cache D D P D D D P

  38. RAID-456 R-M-W Write Stripe Cache P=P-D+D D D P P D D D P

  39. RAID-456 R-M-W Write Stripe Cache D P D D D P

  40. MD/RAID-456 Write Journal

  41. RAID-456 Write Journal • Use block device (SSD, NVM, etc.) as the journal • No change to read path • All writes (data and parity) hit journal before committing to RAID array • For each stripe, “commit all” or “commit nothing” • Journal replay after power failure (no need for resync)

  42. RAID-456 Write Journal: Write Path • Step 1: update data and parity in stripe cache • Step 2: write data and parity to journal device • Step 3: flush journal device cache • Step 4: write data and parity to RAID disks • Step 5: bio_endio

  43. RAID-456 Write Journal: Disk Format meta block magic, checksum, version, data block meta size, seq, position data block type: data/parity/flush … flags: discard, reshape size, sector, checksum data block type: data/parity/flush meta block flags: discard, reshape data block size, sector, checksum data block … … data block

  44. RAID-456 Write Journal Stripe Cache Journal D D D P

  45. RAID-456 Write Journal Stripe Cache D P Journal D D D P

  46. RAID-456 Write Journal Stripe Cache D P Journal D D D P D P

  47. RAID-456 Write Journal Stripe Cache D P Journal D D D P D P

  48. RAID-456 Write Journal: Reclaim Path • Step 1: update journal device super block • Step 2: issue discard to journal device

  49. RAID-456 Write Journal: Recovery Path • For complete stripe (with data and parity) in journal • Replay all data/parity to RAID disks • For partial stripe (data only) in journal • Drop the journal entry

  50. After Power Failure: Replay Writes Stripe Cache Journal D D D P D P

  51. After Power Failure: Replay Writes Stripe Cache Journal D D D P D P

  52. After Power Failure: Drop Partial Journal Stripe Cache Journal D D D P D

  53. After Power Failure: Drop Partial Journal Stripe Cache Journal D D D P

  54. MD/RAID-456 Write Cache

  55. RAID-456 Write Cache • Use same disk format as the write journal • Move bio_endio to a much earlier stage • Hold data in stripe cache • Read path must look up in stripe cache • Need reclaim and smarter recovery • Opportunity for more full stripe writes

  56. RAID-456 Write Cache: Read Path • Chunk aligned read (bypass stripe cache, optimal state) • Step 1: look up data in stripe cache • Step 2: when missed stripe cache, read from disk • Step 3: amend data from disk with latest data in stripe cache • None chunk aligned read • No changes

  57. RAID-456 Write Cache: Write Path • Step 1: write data to journal device • Step 2: flush journal device cache • Step 3: bio_endio • Step 4: update data and parity in stripe cache • Step 5: write parity to journal device • Step 6: flush journal device cache • Step 7: write data and parity to RAID disks

  58. RAID-456 Write Cache: Write Path Stripe Cache D Journal D D D P

  59. RAID-456 Write Cache: Write Path Stripe Cache D Journal D D D P D

  60. RAID-456 Write Cache: Write Path Stripe Cache bio_endio D Journal D D D P D

  61. RAID-456 Write Cache: Write Path Stripe Cache D D Journal D D D P D

  62. RAID-456 Write Cache: Write Path Stripe Cache D D Journal D D D P D D

  63. RAID-456 Write Cache: Write Path Stripe Cache bio_endio D D Journal D D D P D D

  64. RAID-456 Write Cache: Reclaim Path • Step 1: update data and parity in stripe cache • Step 2: write parity to journal device • Step 3: flush journal device cache • Step 4: write data and parity to RAID disks

  65. RAID-456 Write Cache: Reclaim Path Stripe Cache D D Journal D D D P D D

  66. RAID-456 Write Cache: Reclaim Path Stripe Cache D D D Journal D D D P D D

  67. RAID-456 Write Cache: Reclaim Path Stripe Cache xor xor = D D D P Journal D D D P D D

  68. RAID-456 Write Cache: Reclaim Path Stripe Cache D D D P Journal D D D P D D P

  69. RAID-456 Write Cache: Reclaim Path Stripe Cache D D D P Journal D D D P D D P

  70. RAID-456 Write Cache: Recover Path • Stripes with data and parity in journal • Replay writes of data and parity • Stripes with data in journal • Repeat full recontruct write or R-M-W write

  71. Recover Stripe Data: Reconstruct Stripe Cache Journal D D D P D

  72. Recover Stripe Data: Reconstruct Stripe Cache D Journal D D D P D

  73. Recover Stripe Data: Reconstruct Stripe Cache D D D Journal D D D P D

  74. Recover Stripe Data: Reconstruct Stripe Cache xor xor = D D D P Journal D D D P D

  75. Recover Stripe Data: Reconstruct Stripe Cache D D D P Journal D D D P D P

  76. Recover Stripe Data: Reconstruct Stripe Cache D D D P Journal D D D P D P

  77. Recover Stripe Data: R-M-W Stripe Cache D Journal D D P D

  78. Recover Stripe Data: R-M-W Stripe Cache P=P-D+D D D P P Journal D D P D

  79. Recover Stripe Data: R-M-W Stripe Cache D P Journal D D P D P

  80. Recover Stripe Data: R-M-W Stripe Cache D P Journal D D P D P

  81. Current Status • Write Journal • Kernel changes released with kernel 4.4 • mdadm changes released with mdadm-3.4 • Write Cache • Kernel changes in progress • No change required for mdadm

  82. Examples # create array with write journal mdadm --create -f /dev/md0 -c 64 --raid-devices=4 -- level=5 /dev/sd[b-e] --write-journal /dev/sdf # check array with journal cat /proc/mdstat Personalities : [raid6] [raid5] [raid4]md0 : active raid5 sdf[4](J) sde[3] sdd[2] sdc[1] sdb[0] # add journal to existing array mdadm --manage /dev/md0 --add-journal /dev/sdf

Recommend


More recommend