raidp replication with intra disk parity
play

RAIDP: ReplicAtion with Intra-Disk Parity Eitan Rosenfeld , Aviad - PowerPoint PPT Presentation

RAIDP: ReplicAtion with Intra-Disk Parity Eitan Rosenfeld , Aviad Zuck, Nadav Amit, Michael Factor, Dan Tsafrir Slide 1 of 41 Todays Datacenters Slide 2 of 41 Image Source: http://www.google.com/about/datacenters/gallery/#/tech/14 Problem:


  1. RAIDP: ReplicAtion with Intra-Disk Parity Eitan Rosenfeld , Aviad Zuck, Nadav Amit, Michael Factor, Dan Tsafrir Slide 1 of 41

  2. Today’s Datacenters Slide 2 of 41 Image Source: http://www.google.com/about/datacenters/gallery/#/tech/14

  3. Problem: Disks fail • So storage systems use redundancy when storing data • Two forms of redundancy: – Replication, or – Erasure codes Slide 3 of 41

  4. Replication vs. Erasure Coding a=2 b=3 Slide 4 of 41

  5. Replication vs. Erasure Coding Replication a=2 a=2 b=3 b=3 a=2 b=3 Slide 5 of 41

  6. Replication vs. Erasure Coding Replication X a=2 a=2 b=3 b=3 a=2 b=3 Slide 6 of 41

  7. Replication vs. Erasure Coding Replication X a=2 a=2 b=3 b=3 a=2 b=3 Slide 7 of 41

  8. Replication vs. Erasure Coding Erasure coding Replication X a=2 a=2 a=2 b=3 b=3 b=3 a=2 a+b=5 b=3 Slide 8 of 41

  9. Replication vs. Erasure Coding Erasure coding Replication X X a=2 a=2 a=2 b=3 b=3 b=3 a=2 a+b=5 b=3 Slide 9 of 41

  10. Replication vs. Erasure Coding Erasure coding Replication X X a=2 a=2 a=2 b=3 b=3 b=3 a=2 a+b=5 b=3 a+2b = 8 Slide 10 of 41

  11. Replication vs. Erasure Coding Erasure coding Replication X X a=2 a=2 X a=2 b=3 b=3 b=3 a=2 a+b=5 b=3 a+2b = 8 Slide 11 of 41

  12. Many modern systems replicate warm data • Amazon’s storage services • Google File System (GFS) • Facebook’s Haystack • Windows Azure Storage (WAS) • Microsoft’s Flat Datacenter Storage (FDS) • HDFS (open-source file-system for Hadoop) • Cassandra • ... Slide 12 of 41

  13. Why is replication advantageous for warm data? Better for reads: Load balancing ✓ 1. Load balancing Parallelism ✓ 2. Parallelism Avoids degraded reads ✓ 3. Avoids degraded reads Better for writes : Lower sync latency ✓ 4. Lower sync latency Better for reads and writes : 5. Increased sequentiality ✓ 5. Increased sequentiality 6. Avoids the CPU processing used for encoding ✓ 6. Avoids the CPU processing used for encoding 7. Lower repair traffic ✓ 7. Lower repair traffic Slide 13 of 41

  14. Recovery in replication based systems is efficient 1 2 1 3 4 4 4 2 3 1 1 5 6 5 6 5 Disk 1 Disk 2 Disk 3 Disk 4 Slide 14 of 41

  15. Recovery in replication based systems is efficient X 1 2 1 3 4 4 4 2 3 1 1 5 6 5 6 5 Disk 1 Disk 2 Disk 3 Disk 4 Slide 15 of 41

  16. Recovery in replication based systems is efficient X 1 2 1 3 4 4 4 2 3 1 1 5 6 5 6 5 Disk 1 Disk 2 Disk 3 Disk 4 Slide 16 of 41

  17. Erasure coding, on the other hand… A 1 A 2 A 3 A PARITY A PARITY B 1 B PARITY B 3 B 2 1 5 C 3 C 1 C PARITY C 2 D 3 D PARITY D 1 D 2 Disk 1 Disk 2 Disk 3 Disk 4 Slide 17 of 41

  18. Erasure coding, on the other hand… X A 1 A 2 A 3 A PARITY A PARITY B 1 B PARITY B 3 B 2 1 5 C 3 C 1 C PARITY C 2 D 3 D PARITY D 1 D 2 Disk 1 Disk 2 Disk 3 Disk 4 Slide 18 of 41

  19. Erasure coding, on the other hand… X A 1 A 2 A 3 A PARITY B 1 B PARITY B 3 B 2 1 5 C 3 C 1 C PARITY C 2 D 3 D PARITY D 1 D 2 Disk 1 Disk 2 Disk 3 Disk 4 Slide 19 of 41

  20. Erasure coding, on the other hand… Facebook “estimate[s] that if 50% of the X cluster was Reed-Solomon encoded, the A 1 A 2 A 3 A PARITY repair network traffic would completely B 1 B PARITY B 3 B 2 1 saturate the cluster network links” 5 C 3 C 1 C PARITY C 2 D 3 D PARITY D 1 D 2 Disk 1 Disk 2 Disk 3 Disk 4 Slide 20 of 41

  21. Modern replicating systems triple-replicate warm data • Amazon’s DynamoDB • Facebook’s Haystack • Google File System (GFS) • Windows Azure Storage (WAS) • Microsoft’s Flat Datacenter Storage (FDS) • HDFS (open-source file-system for Hadoop) • Cassandra • ... Slide 21 of 41

  22. Bottom Line • Replication is used for warm data only • It’s expensive! (Wastes storage, energy, network) • Erasure coding used for the rest ( cold data ) Our goal: Quickly recover from two simultaneous disk failures without resorting to a third replica for warm data Slide 22 of 41

  23. RAIDP - ReplicAtion with Intra-Disk Parity • Hybrid storage system for warm data with only two* copies of each data object. • Recovers quickly from a simultaneous failure of any two disks • Largely enjoys the aforementioned 7 advantages of replication Slide 23 of 41

  24. System Architecture Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Slide 24 of 41

  25. System Architecture • Each of the N disks is divided into N-1 superchunks – e.g. 4GB each Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Slide 25 of 41

  26. System Architecture • Each of the N disks is divided into N-1 superchunks – e.g. 4GB each • 1-Mirroring : Superchunks must be 2-replicated 1 3 2 4 5 5 2 4 3 1 6 8 9 10 7 8 9 10 6 7 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Slide 26 of 41

  27. System Architecture • Each of the N disks is divided into N-1 superchunks – e.g. 4GB each • 1-Mirroring : Superchunks must be 2-replicated • 1-Sharing : Any two disks share at most one superchunk 1 3 2 4 5 5 2 4 3 1 6 8 9 10 7 8 9 10 6 7 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Slide 27 of 41

  28. Introducing “disk add-ons” SATA/SAS 1 ⨁ 2 ⨁ 3 ⨁ 4 Add-on Disk Drive 2 4 1 3 Power • Associated with a specific disk – Interposes all I/O to disk – Stores an erasure code of the local disk’s superchunks – Fails separately from the associated disk Slide 29 of 41

  29. RAIDP Recovery Add-on 1 ⨁ 2 ⨁ 6 ⨁ 8 1 ⨁ 2 ⨁ 6 ⨁ 8 2 ⨁ 3 ⨁ 7 ⨁ 9 Add-on 3 ⨁ 4 ⨁ 8 ⨁ 10 Add-on 4 ⨁ 5 ⨁ 9 ⨁ 6 Add-on 5 ⨁ 1 ⨁ 10 ⨁ 7 Add-on X X 1 2 4 3 5 2 2 3 4 5 1 1 6 7 8 9 10 8 8 6 9 6 7 10 XOR Add-on 1 with the surviving superchunks ⊕ ⊕ ⊕ = 8 6 1 1 ⨁ 2 ⨁ 6 ⨁ 8 2 from Disk 1. Slide 30 of 41

  30. warm data (less) triple replication RAIDP (single failure) ! RAIDP (double failure) repair traffic cold data erasure coding (more) ! storage capacity Slide 31 of 41

  31. Lstor Feasability Goal : Replace a third replica disk with 2 Lstors Lstors need to be cheap, fast, and fail separately from disk. Storage: Enough to maintain parity (~$9) [1] - Processing: Microcontroller for local machine independence (~$5) [2] - Power: Several hundred Amps for 2–3 min from small supercapacitor - to read data from the Lstor Commodity 2.5” 4TB disk for storing an additional replica costs $100: 66% more than a conservative estimate of the cost of two Lstors Slide 32 of 41

  32. Implementation in HDFS RAIDP implemented in in Hadoop 1.0.4 • – Two variants: • Append-only • Updates-in-place • 3K LOC extension to HDFS – Pre-allocated block files to simulate superchunks – Lstors simulated in memory – Added crash consistency and several optimizations Slide 33 of 41

  33. Evaluation • RAIDP vs. HDFS with 2 and 3 replicas • Tested on a 16-node cluster – Intel Xeon CPU E3-1220 V2 @ 3.10GHz – 16GB RAM – 7200 RPM disks • 10Gbps Ethernet • 6GB superchunks, ~800GB cluster capacity Slide 34 of 41

  34. Hadoop write throughput (Runtime of writing 100GB) HDFS RAIDP completes the workload 22% faster! RAIDP For Updates in place: RAIDP performs 4 I/Os for each write à Both replicas are read before they are overwritten S H H U L u D D p s p F F d t e o S S a r r c - t - s 2 e h 3 s u - n i n k - s p l o a n c l e y Slide 35 of 41

  35. Hadoop read throughput (Runtime of reading 100GB) HDFS RAIDP S H H U L u D D p s p F F d t e o S S a r r c - t - s 2 e h 3 s u - n i n k - s p l o a n c l e y Slide 36 of 41

  36. Write Runtime vs. Network Usage Network usage in GB Runtime of writing 100GB when writing 100GB HDFS-3 HDFS-3 RAIDP RAIDP Slide 37 of 41

  37. TeraSort Runtime vs. Network Usage Network usage in GB Runtime of sorting 100GB when sorting 100GB HDFS-3 HDFS-3 RAIDP RAIDP Slide 38 of 41

  38. Recovery time in RAIDP System 1Gbps Network 10Gbps Network RAIDP 827 s 125 s RAID-6 12,300 s 1,823 s 16 node cluster with 6GB superchunk RAIDP recovers 14x faster! For erasure coding, such a recovery is required for every disk failure. For RAIDP, such a recovery is only required after the 2nd failure. Slide 39 of 41

  39. Vision and Future work • Survives two simultaneous failures with only two replicas • Can be augmented to withstand more than two simultaneous failures – “Stacked” LSTORs • Building Lstors instead of simulating them • Equipping Lstors with network interfaces so that they can withstand rack failures • Experiment with SSDs Slide 40 of 41

  40. Summary • RAIDP achieves similar failure tolerance as 3-way replicated systems – Better performance when writing new data – Small performance hit during updates • Yet: – Requires 33% less storage – Uses considerably less network bandwidth for writes – Recovery is much more efficient than EC • Opens the way for storage vendors and cloud providers to use 2 (instead of 3, or more) replicas – Potential savings in size, energy, and capacity Slide 41 of 41

Recommend


More recommend