a tale of two erasure codes in hdfs
play

A Tale of Two Erasure Codes in HDFS Dynamo Mingyuan Xia * , Mohit - PowerPoint PPT Presentation

A Tale of Two Erasure Codes in HDFS Dynamo Mingyuan Xia * , Mohit Saxena + , Mario Blaum + , and David A. Pease + * McGill University, + IBM Research Almaden FAST 15 2015-04-30 1 Outline Introduction & Motivation Design


  1. A Tale of Two Erasure Codes in HDFS Dynamo Mingyuan Xia * , Mohit Saxena + , Mario Blaum + , and David A. Pease + * McGill University, + IBM Research Almaden FAST ’ 15 何军权 2015-04-30 1

  2. Outline  Introduction & Motivation  Design  Evaluation  Conclustions  Related work 2

  3. Introduction & Motivation 3

  4. Big Data Storage  Reliability and Availability  Replication: 3-way replication  Erasure Code: Reed-Solomon(RS), LRC GFS GFS v2 FB HDFS 3-way replication RS, 1.5x, 2012 LRC, 1.66x, 2013 3x, 2003 FB HDFS Azure RS, 1.4x, 2011 LRC, 1.33x, 2012 4 4

  5. Popular Erasure Code Families  Product Code(PC) Reed-Solomon(RS)  Local Reconstruction Code(LRC)  Other a 0 a 1 a 2 a 3 a 4 h a a 0 a 1 a 2 a 3 a 4 a 5 G 1 a 6 a 7 a 8 a 9 a 10 a 11 G 2 b 0 b 1 b 2 b 3 b 4 h b L 0 L 1 L 2 L 3 L 4 L 5 P 0 P 1 P 2 P 3 P 4 h LRC PC 5 5

  6. Erasure Code  Facebook HDFS RS(10,4)  Compute 4 parities per 10 data blocks  All blocks store in different storage nodes  Storage Overhead: 1.4x D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 P1 P2 P3 P4 6

  7. Erasure Code  High Degraded Read Latency  Read to an unavailable block requires  Multiple disk reads, network transfers and compute cycles to decode Client Read exception HDFS … 7

  8. Erasure Code  Long Reconstruction Time  Facebook's Cluster:  100K blocks lost per day  50 machine-unavailablility events per day  Reconstruction traffic: 180TB per day Reconstruction Job HDFS … 8

  9. Erasure Code Recover Cost Degraded Reconstruction Time Read Latency Recover Cost: the total number of blocks required to reconstruction a data block after failure 9

  10. Recovery Cost vs. Storage Overhead  Conclusion  Storage Overhead and Reconstruction Cost are a tradeoff in single erasure code. FB HDFS RS Azure LRC GFS v2 RS FB HDFS LRC GFS 3-way Repl 10

  11. How to balance? Recovery Cost Storage Overhead 11 11

  12. Data Access Skew  Conclusions  Only few data are "hot"  P(freq > 10) ~= 1%  Most data are "cold"  P(freq <= 10) ~= 99% 12 12

  13. Data Access Skew  Hot data  High access frequency A little improvement on read can Hot Data: Decrease the Recovery Cost gain a high read performance  A small fraction of data  Cold data A few less of data to store can  Low access frequency save huge storage space  A major fraction of data Cold Data: High Storage Efficiency 13 13

  14. HACFS  System State  Tracks file states  File size, last mTime  Read count and coding state  Adapting Coding  Tracks system states  Choose coding scheme based on read count and mTime  Erasure Coding  Providing four coding interfaces  Encode/Decode  Upcode/Downcode 14

  15. Erasure Coding Algorithms  Two different erasure codes  Fast code:  Encode the frequently accessed blocks to reduce the read latency and reconstruction time  Provide overall low recovery cost  Compact code:  Encode the less frequently accessed blocks to get low storage overhead  Maintain a low and bounded storage overhead 15 15

  16. State Transition HACFS Fast Code Recently COND created Write cold COND 3-way COND' replication COND' Compact COND : Read Hot and Bounded Code COND': Read Cold or Not Bounded 16

  17. Fast and Compact Product Codes(1) a 0 a 1 a 2 a 3 a 4 h a1 • h a1 =RS(a 0 ,a 1 ,a 2 ,a 3 ,a 4 ) • Pa 0 =XOR(a 0 ,a 5 ) a 5 a 6 a 7 a 8 a 9 h a2 b 0 b 1 b 2 b 3 b 4 h b1 a 0 a 1 a 2 a 3 a 4 h a1 b 5 b 6 b 7 b 8 b 9 h b2 a 5 a 6 a 7 a 8 a 9 h a2 c 0 c 1 c 2 c 3 c 4 h c1 Pa 0 Pa 1 Pa 2 Pa 3 Pa 4 Ph a c 5 c 6 c 7 c 8 c 9 h c2 P 0 P 1 P 2 P 3 P 4 Ph Fast Code Compact Code (Product Code 2x5) (Product Code 6x5) Storage overhead: 1.8x Storage overhead: 1.4x Recovery Cost: 2 17 17

  18. Fast and Compact Product Codes(2) • P 0 =XOR(a 0 ,a 5 ,b 0 ,b 5 ,c 0 ,c 5 ) a 0 a 1 a 2 a 3 a 4 h a1 • h a1 =RS(a 0 ,a 1 ,a 2 ,a 3 ,a 4 ) • Pa 0 =XOR(a 0 ,a 5 ) a 5 a 6 a 7 a 8 a 9 h a2 b 0 b 1 b 2 b 3 b 4 h b1 a 0 a 1 a 2 a 3 a 4 h a1 b 5 b 6 b 7 b 8 b 9 h b2 a 5 a 6 a 7 a 8 a 9 h a2 c 0 c 1 c 2 c 3 c 4 h c1 Pa 0 Pa 1 Pa 2 Pa 3 Pa 4 Ph a c 5 c 6 c 7 c 8 c 9 h c2 P 0 P 1 P 2 P 3 P 4 Ph Fast Code Compact Code (Product Code 2x5) (Product Code 6x5) Storage overhead: 1.8x Storage overhead: 1.4x Recovery Cost: 2 Recovery Cost: 5 18 18

  19. Fast and Compact LRC(1) {G 1 ,G 2 }=RS(a 0 ,a 1 ,..,a 11 ) {G 1 ,G 2 }=RS(a 0 ,a 1 ,..,a 11 ) L i =XOR(a i , a i+6 ) L i =RS'(a 0 , a 1 , a 2 , a6, a 7 , a 8 ) a 0 a 1 a 2 a 3 a 4 a 5 G 1 a 0 a 1 a 2 a 3 a 4 a 5 G 1 a 6 a 7 a 8 a 9 a 10 a 11 G 2 a 6 a 7 a 8 a 9 a 10 a 11 G 2 L 0 L 1 L 2 L 3 L 4 L 5 L 0 L 1 Fast Code Compact Code (LRC(12,6,2)) (LRC(12,2,2)) Storage overhead: 20/12=1.67x Storage overhead: 16/12=1.33x Recovery Cost: 6 Recovery Cost: 2 19 19

  20. Upcoding for Product Codes Compact Code Fast Code PC(6x5) PC(2x5) a 0 a 1 a 2 a 3 a 4 h a1 a 0 a 1 a 2 a 3 a 4 h a1 a 5 a 6 a 7 a 8 a 9 h a2 a 5 a 6 a 7 a 8 a 9 h a2 b 0 b 1 b 2 b 3 b 4 h b1 Pa 0 Pa 1 Pa 2 Pa 3 Pa 4 Ph a b 5 b 6 b 7 b 8 b 9 h b2 c 0 c 1 c 2 c 3 c 4 h c1 b 0 b 1 b 2 b 3 b 4 h b1 c 5 c 6 c 7 c 8 c 9 h c2 b 5 b 6 b 7 b 8 b 9 h b2 P 0 P 1 P 2 P 3 P 4 Ph Pb 0 Pb 1 Pb 2 Pb 3 Pb 4 Ph b c 0 c 1 c 2 c 3 c 4 h c1 c 5 c 6 c 7 c 8 c 9 h c2 • Parities h require no re-construction • Parities P require no data block transfer Pc 0 Pc 1 Pc 2 Pc 3 Pc 4 Ph c • All parities updates can be done in parallel 20 20

  21. Downcoding for Product Codes Fast Code Compact Code PC(2x5) PC(6x5) a 0 a 1 a 2 a 3 a 4 h a1 a 5 a 6 a 7 a 8 a 9 h a2 a 0 a 1 a 2 a 3 a 4 h a1 Pa 0 Pa 1 Pa 2 Pa 3 Pa 4 Ph a a 5 a 6 a 7 a 8 a 9 h a2 b 0 b 1 b 2 b 3 b 4 h b1 b 0 b 1 b 2 b 3 b 4 h b1 b 5 b 6 b 7 b 8 b 9 h b2 b 5 b 6 b 7 b 8 b 9 h b2 c 0 c 1 c 2 c 3 c 4 h c1 c 5 c 6 c 7 c 8 c 9 h c2 Pb 0 Pb 1 Pb 2 Pb 3 Pb 4 Ph b P 0 P 1 P 2 P 3 P 4 Ph c 0 c 1 c 2 c 3 c 4 h c1 c 5 c 6 c 7 c 8 c 9 h c2 • Pa 0 =XOR(a 0 ,a 5 ) Pc 0 Pc 1 Pc 2 Pc 3 Pc 4 Ph c • Pc 0 =XOR(P 0 ,Pa 0 ,Pb 0 ) 21 21

  22. Evaluation  Platform  CPU: Intel Xeon E5645 24 cores, 2.4GHz  Disk: 7.2K RPM, 6*2TB  Memory: 96GB  Network: 1Gbps NIC  Cluster size: 11 nodes  Workload CC: Cloudera Customer FB: Facebook 22 22

  23. Evaluation Metrics  Degraded read latency  Foreground read request latency  Reconstruction time  Background recovery for failures  Storage overhead 23 23

  24. Degraded Read Latency  The Production systems: 16-21 seconds  HACFS: 10-14 seconds 24 Bounded the storage overhead of HACFS LRC and PC to 1.4 and 1.5

  25. Reconstruction Time  A disk with 100GB data failed  HACFS-PC takes about 10-35 minutes less than Production systems  HACFS-LRC is worse than RS(6,3) in GFS v2  To reconstruction global parities, HACFS-LRC need to read 12 blocks, but GFS v2 only 6 blocks 25

  26. System Comparison  HACFS-LRC:  Colossus FS:RS(6,3)-1.5x  HACFS-PC:  LRC(12,6,2)-1.67x  HDFS-Raid: RS(10,4)-1.4x  PC(2x5)-1.8x  LRC(12,2,2)-1.33x Azure: LRC(12,2,2)-1.33x  PC(6x5)-1.4x  26 26

  27. System Comparison  HACFS-LRC:  Colossus FS:RS(6,3)-1.5x  HACFS-PC:  LRC(12,6,2)-1.67x  HDFS-Raid: RS(10,4)-1.4x  PC(2x5)-1.8x  LRC(12,2,2)-1.33x Azure: LRC(12,2,2)-1.33x  PC(6x5)-1.4x  lost block HACFS-PC HACFS-LRC Colossus FS HDFS-RAID Azure type fast: 2 fast: 2 data block 6 10 6 comp: 5 comp: 6 fast: 5 fast: 12 global parity 6 10 12 comp: 6 comp: 12 27 27

  28. System Comparison  HACFS-LRC:  Colossus FS:RS(6,3)-1.5x  HACFS-PC:  LRC(12,6,2)-1.67x  HDFS-Raid: RS(10,4)-1.4x  PC(2x5)-1.8x  LRC(12,2,2)-1.33x Azure: LRC(12,2,2)-1.33x  PC(6x5)-1.4x  lost block HACFS-PC HACFS-LRC Colossus FS HDFS-RAID Azure type fast: 2 fast: 2 data block 6 10 6 comp: 5 comp: 6 fast: 5 fast: 12 global parity 6 10 12 comp: 6 comp: 12 28 28

  29. Conclusions  By using Erasure code, a lot of storage space can be saved.  The production systems using a single erasure code can not balance the tradeoff between recovery cost and storage overhead very well.  HACFS by using a dynamically adaptive coding can provide both low recovery cost and storage overhead. 29

  30. Related Work  f4 OSDI'14  Divide the cold and hot by the data age  XOR-based Erasure Code--FAST ’12  Combination RS with XOR.  Minimum-Storage-Regeneration( MSR )  Minimizes network transfers during reconstruction.  Product-Matrix-Reconstruct-By-Transfer( PM-RBT ) FAST ’15  Optimal in terms of I/O, storage, and network bandwidth. 30 30

  31. Thank You! 31 31

  32. Acknowledgment  Prof. Xiong  Zigang Zhang  Biao Ma 32 32 CAS – ICT – Storage System Group

Recommend


More recommend