A Tale of Two Erasure Codes in HDFS Dynamo Mingyuan Xia * , Mohit Saxena + , Mario Blaum + , and David A. Pease + * McGill University, + IBM Research Almaden FAST ’ 15 何军权 2015-04-30 1
Outline Introduction & Motivation Design Evaluation Conclustions Related work 2
Introduction & Motivation 3
Big Data Storage Reliability and Availability Replication: 3-way replication Erasure Code: Reed-Solomon(RS), LRC GFS GFS v2 FB HDFS 3-way replication RS, 1.5x, 2012 LRC, 1.66x, 2013 3x, 2003 FB HDFS Azure RS, 1.4x, 2011 LRC, 1.33x, 2012 4 4
Popular Erasure Code Families Product Code(PC) Reed-Solomon(RS) Local Reconstruction Code(LRC) Other a 0 a 1 a 2 a 3 a 4 h a a 0 a 1 a 2 a 3 a 4 a 5 G 1 a 6 a 7 a 8 a 9 a 10 a 11 G 2 b 0 b 1 b 2 b 3 b 4 h b L 0 L 1 L 2 L 3 L 4 L 5 P 0 P 1 P 2 P 3 P 4 h LRC PC 5 5
Erasure Code Facebook HDFS RS(10,4) Compute 4 parities per 10 data blocks All blocks store in different storage nodes Storage Overhead: 1.4x D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 P1 P2 P3 P4 6
Erasure Code High Degraded Read Latency Read to an unavailable block requires Multiple disk reads, network transfers and compute cycles to decode Client Read exception HDFS … 7
Erasure Code Long Reconstruction Time Facebook's Cluster: 100K blocks lost per day 50 machine-unavailablility events per day Reconstruction traffic: 180TB per day Reconstruction Job HDFS … 8
Erasure Code Recover Cost Degraded Reconstruction Time Read Latency Recover Cost: the total number of blocks required to reconstruction a data block after failure 9
Recovery Cost vs. Storage Overhead Conclusion Storage Overhead and Reconstruction Cost are a tradeoff in single erasure code. FB HDFS RS Azure LRC GFS v2 RS FB HDFS LRC GFS 3-way Repl 10
How to balance? Recovery Cost Storage Overhead 11 11
Data Access Skew Conclusions Only few data are "hot" P(freq > 10) ~= 1% Most data are "cold" P(freq <= 10) ~= 99% 12 12
Data Access Skew Hot data High access frequency A little improvement on read can Hot Data: Decrease the Recovery Cost gain a high read performance A small fraction of data Cold data A few less of data to store can Low access frequency save huge storage space A major fraction of data Cold Data: High Storage Efficiency 13 13
HACFS System State Tracks file states File size, last mTime Read count and coding state Adapting Coding Tracks system states Choose coding scheme based on read count and mTime Erasure Coding Providing four coding interfaces Encode/Decode Upcode/Downcode 14
Erasure Coding Algorithms Two different erasure codes Fast code: Encode the frequently accessed blocks to reduce the read latency and reconstruction time Provide overall low recovery cost Compact code: Encode the less frequently accessed blocks to get low storage overhead Maintain a low and bounded storage overhead 15 15
State Transition HACFS Fast Code Recently COND created Write cold COND 3-way COND' replication COND' Compact COND : Read Hot and Bounded Code COND': Read Cold or Not Bounded 16
Fast and Compact Product Codes(1) a 0 a 1 a 2 a 3 a 4 h a1 • h a1 =RS(a 0 ,a 1 ,a 2 ,a 3 ,a 4 ) • Pa 0 =XOR(a 0 ,a 5 ) a 5 a 6 a 7 a 8 a 9 h a2 b 0 b 1 b 2 b 3 b 4 h b1 a 0 a 1 a 2 a 3 a 4 h a1 b 5 b 6 b 7 b 8 b 9 h b2 a 5 a 6 a 7 a 8 a 9 h a2 c 0 c 1 c 2 c 3 c 4 h c1 Pa 0 Pa 1 Pa 2 Pa 3 Pa 4 Ph a c 5 c 6 c 7 c 8 c 9 h c2 P 0 P 1 P 2 P 3 P 4 Ph Fast Code Compact Code (Product Code 2x5) (Product Code 6x5) Storage overhead: 1.8x Storage overhead: 1.4x Recovery Cost: 2 17 17
Fast and Compact Product Codes(2) • P 0 =XOR(a 0 ,a 5 ,b 0 ,b 5 ,c 0 ,c 5 ) a 0 a 1 a 2 a 3 a 4 h a1 • h a1 =RS(a 0 ,a 1 ,a 2 ,a 3 ,a 4 ) • Pa 0 =XOR(a 0 ,a 5 ) a 5 a 6 a 7 a 8 a 9 h a2 b 0 b 1 b 2 b 3 b 4 h b1 a 0 a 1 a 2 a 3 a 4 h a1 b 5 b 6 b 7 b 8 b 9 h b2 a 5 a 6 a 7 a 8 a 9 h a2 c 0 c 1 c 2 c 3 c 4 h c1 Pa 0 Pa 1 Pa 2 Pa 3 Pa 4 Ph a c 5 c 6 c 7 c 8 c 9 h c2 P 0 P 1 P 2 P 3 P 4 Ph Fast Code Compact Code (Product Code 2x5) (Product Code 6x5) Storage overhead: 1.8x Storage overhead: 1.4x Recovery Cost: 2 Recovery Cost: 5 18 18
Fast and Compact LRC(1) {G 1 ,G 2 }=RS(a 0 ,a 1 ,..,a 11 ) {G 1 ,G 2 }=RS(a 0 ,a 1 ,..,a 11 ) L i =XOR(a i , a i+6 ) L i =RS'(a 0 , a 1 , a 2 , a6, a 7 , a 8 ) a 0 a 1 a 2 a 3 a 4 a 5 G 1 a 0 a 1 a 2 a 3 a 4 a 5 G 1 a 6 a 7 a 8 a 9 a 10 a 11 G 2 a 6 a 7 a 8 a 9 a 10 a 11 G 2 L 0 L 1 L 2 L 3 L 4 L 5 L 0 L 1 Fast Code Compact Code (LRC(12,6,2)) (LRC(12,2,2)) Storage overhead: 20/12=1.67x Storage overhead: 16/12=1.33x Recovery Cost: 6 Recovery Cost: 2 19 19
Upcoding for Product Codes Compact Code Fast Code PC(6x5) PC(2x5) a 0 a 1 a 2 a 3 a 4 h a1 a 0 a 1 a 2 a 3 a 4 h a1 a 5 a 6 a 7 a 8 a 9 h a2 a 5 a 6 a 7 a 8 a 9 h a2 b 0 b 1 b 2 b 3 b 4 h b1 Pa 0 Pa 1 Pa 2 Pa 3 Pa 4 Ph a b 5 b 6 b 7 b 8 b 9 h b2 c 0 c 1 c 2 c 3 c 4 h c1 b 0 b 1 b 2 b 3 b 4 h b1 c 5 c 6 c 7 c 8 c 9 h c2 b 5 b 6 b 7 b 8 b 9 h b2 P 0 P 1 P 2 P 3 P 4 Ph Pb 0 Pb 1 Pb 2 Pb 3 Pb 4 Ph b c 0 c 1 c 2 c 3 c 4 h c1 c 5 c 6 c 7 c 8 c 9 h c2 • Parities h require no re-construction • Parities P require no data block transfer Pc 0 Pc 1 Pc 2 Pc 3 Pc 4 Ph c • All parities updates can be done in parallel 20 20
Downcoding for Product Codes Fast Code Compact Code PC(2x5) PC(6x5) a 0 a 1 a 2 a 3 a 4 h a1 a 5 a 6 a 7 a 8 a 9 h a2 a 0 a 1 a 2 a 3 a 4 h a1 Pa 0 Pa 1 Pa 2 Pa 3 Pa 4 Ph a a 5 a 6 a 7 a 8 a 9 h a2 b 0 b 1 b 2 b 3 b 4 h b1 b 0 b 1 b 2 b 3 b 4 h b1 b 5 b 6 b 7 b 8 b 9 h b2 b 5 b 6 b 7 b 8 b 9 h b2 c 0 c 1 c 2 c 3 c 4 h c1 c 5 c 6 c 7 c 8 c 9 h c2 Pb 0 Pb 1 Pb 2 Pb 3 Pb 4 Ph b P 0 P 1 P 2 P 3 P 4 Ph c 0 c 1 c 2 c 3 c 4 h c1 c 5 c 6 c 7 c 8 c 9 h c2 • Pa 0 =XOR(a 0 ,a 5 ) Pc 0 Pc 1 Pc 2 Pc 3 Pc 4 Ph c • Pc 0 =XOR(P 0 ,Pa 0 ,Pb 0 ) 21 21
Evaluation Platform CPU: Intel Xeon E5645 24 cores, 2.4GHz Disk: 7.2K RPM, 6*2TB Memory: 96GB Network: 1Gbps NIC Cluster size: 11 nodes Workload CC: Cloudera Customer FB: Facebook 22 22
Evaluation Metrics Degraded read latency Foreground read request latency Reconstruction time Background recovery for failures Storage overhead 23 23
Degraded Read Latency The Production systems: 16-21 seconds HACFS: 10-14 seconds 24 Bounded the storage overhead of HACFS LRC and PC to 1.4 and 1.5
Reconstruction Time A disk with 100GB data failed HACFS-PC takes about 10-35 minutes less than Production systems HACFS-LRC is worse than RS(6,3) in GFS v2 To reconstruction global parities, HACFS-LRC need to read 12 blocks, but GFS v2 only 6 blocks 25
System Comparison HACFS-LRC: Colossus FS:RS(6,3)-1.5x HACFS-PC: LRC(12,6,2)-1.67x HDFS-Raid: RS(10,4)-1.4x PC(2x5)-1.8x LRC(12,2,2)-1.33x Azure: LRC(12,2,2)-1.33x PC(6x5)-1.4x 26 26
System Comparison HACFS-LRC: Colossus FS:RS(6,3)-1.5x HACFS-PC: LRC(12,6,2)-1.67x HDFS-Raid: RS(10,4)-1.4x PC(2x5)-1.8x LRC(12,2,2)-1.33x Azure: LRC(12,2,2)-1.33x PC(6x5)-1.4x lost block HACFS-PC HACFS-LRC Colossus FS HDFS-RAID Azure type fast: 2 fast: 2 data block 6 10 6 comp: 5 comp: 6 fast: 5 fast: 12 global parity 6 10 12 comp: 6 comp: 12 27 27
System Comparison HACFS-LRC: Colossus FS:RS(6,3)-1.5x HACFS-PC: LRC(12,6,2)-1.67x HDFS-Raid: RS(10,4)-1.4x PC(2x5)-1.8x LRC(12,2,2)-1.33x Azure: LRC(12,2,2)-1.33x PC(6x5)-1.4x lost block HACFS-PC HACFS-LRC Colossus FS HDFS-RAID Azure type fast: 2 fast: 2 data block 6 10 6 comp: 5 comp: 6 fast: 5 fast: 12 global parity 6 10 12 comp: 6 comp: 12 28 28
Conclusions By using Erasure code, a lot of storage space can be saved. The production systems using a single erasure code can not balance the tradeoff between recovery cost and storage overhead very well. HACFS by using a dynamically adaptive coding can provide both low recovery cost and storage overhead. 29
Related Work f4 OSDI'14 Divide the cold and hot by the data age XOR-based Erasure Code--FAST ’12 Combination RS with XOR. Minimum-Storage-Regeneration( MSR ) Minimizes network transfers during reconstruction. Product-Matrix-Reconstruct-By-Transfer( PM-RBT ) FAST ’15 Optimal in terms of I/O, storage, and network bandwidth. 30 30
Thank You! 31 31
Acknowledgment Prof. Xiong Zigang Zhang Biao Ma 32 32 CAS – ICT – Storage System Group
Recommend
More recommend