Background Semantically-informed byte-level compression User-level semantic compression Compressing Intermediate Keys between Mappers and Reducers in SciHadoop Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt { adamcrume,buck,carlosm,scott } @cs.ucsc.edu University of California, Santa Cruz November 12, 2012 Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 1 / 22
Background Semantically-informed byte-level compression User-level semantic compression Outline Background 1 Semantically-informed byte-level compression 2 User-level semantic compression 3 Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 2 / 22
Background Semantically-informed byte-level compression User-level semantic compression MapReduce overview Scheduler Input Mapper Mapper Output Reducer Mapper Mapper Output Reducer Mapper Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 3 / 22
Background Semantically-informed byte-level compression User-level semantic compression Hadoop internal data flow Mapper Combiner Sort Reducer network transfer Disk Disk Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 4 / 22
Background Semantically-informed byte-level compression User-level semantic compression Array key/value pairs (0, 0) → 0 (2, 0) → 8 (0, 1) → 1 (2, 1) → 9 0 1 2 3 (0, 2) → 2 (2, 2) → 10 4 5 6 7 (0, 3) → 3 (2, 3) → 11 (1, 0) → 4 (3, 0) → 12 8 9 10 11 (1, 1) → 5 (3, 1) → 13 12 13 14 15 (1, 2) → 6 (3, 2) → 14 (1, 3) → 7 (3, 3) → 15 Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 5 / 22
Background Semantically-informed byte-level compression User-level semantic compression Outline Background 1 Semantically-informed byte-level compression 2 User-level semantic compression 3 Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 6 / 22
Background Semantically-informed byte-level compression User-level semantic compression Linear sequences 00000000 14 04 00 00 00 0d 00 00 00 03 00 00 00 00 00 00 00000010 00 00 00 00 00 01 c2 11 37 34 14 04 00 00 00 0d 00000020 00 00 00 03 00 00 00 00 00 00 00 01 00 00 00 01 00000030 9c 65 aa 33 14 04 00 00 00 0d 00 00 00 03 00 00 00000040 00 00 00 00 00 02 00 00 00 01 8d fc 61 b2 14 04 00000050 00 00 00 0d 00 00 00 03 00 00 00 00 00 00 00 03 00000060 00 00 00 01 f9 3c 62 ab 14 04 00 00 00 0d 00 00 00000070 00 03 00 00 00 00 00 00 00 04 00 00 00 01 a4 ba Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 7 / 22
Background Semantically-informed byte-level compression User-level semantic compression Sequence detection 1 1;0 2;0 3;1 2 Stride 3 -1;2 0;9 1;1 0;5 2;4 0;5 1;5 4 5 -1;0 -2;1 0;1 2;1 3;0 0 1 2 3 4 Phase δ ; r ≡ increment= δ , run length= r Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 8 / 22
Background Semantically-informed byte-level compression User-level semantic compression Predictive coding (1,1) (1,2) (1,3) (1,4) (1,5) Keys: (2,1) (2,2) (2,3) (2,4) (2,5) Original: 1 1 1 2 1 3 1 4 1 5 2 1 Predictions: 1 4 1 5 1 6 Delta (output): 1 1 1 2 1 3 0 0 0 0 1 -7 Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 9 / 22
Background Semantically-informed byte-level compression User-level semantic compression Semantically-informed byte-level compression (results) File size by compression method 100% 12 10 8 Megabytes 6 4 13.6% 2 4.27% 0.28% 0.0039% 0 O g t b t r r z z r a a i i i g p n p n i s 2 s n f f o o a r r l m m + + g b z z i i p p 2 Tested on grid points from a 100 × 100 × 100 rectangle Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 10 / 22
Background Semantically-informed byte-level compression User-level semantic compression Outline Background 1 Semantically-informed byte-level compression 2 User-level semantic compression 3 Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 11 / 22
Background Semantically-informed byte-level compression User-level semantic compression Key redundancy Key/value pairs are independent in MapReduce 1 1 2 4 Mapper Reducer 3 9 4 16 Pairs are not independent conceptually 1 2 1 4 Mapper Reducer 3 4 9 16 Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 12 / 22
Background Semantically-informed byte-level compression User-level semantic compression SciHadoop semantic compression 1 2 3 4 Address per cell (0, 0) → 1 Address range per block (0, 1) → 2 vs (0, 0) - (1, 1) → { 1, 2, 3, 4 } (1, 0) → 3 (1, 1) → 4 Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 13 / 22
Background Semantically-informed byte-level compression User-level semantic compression N-dimensional aggregation Optimal choice is not obvious Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 14 / 22
Background Semantically-informed byte-level compression User-level semantic compression Linearizing with a space-filling curve 1 2 5 6 3 4 7 8 9 10 13 14 11 12 15 16 1–5, 7, 9–10, 13 Cells are numbered with a space-filling curve, and contiguous numbers are collapsed into ranges Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 15 / 22
Background Semantically-informed byte-level compression User-level semantic compression Overlapping keys problem Ranges are unequal, so reducer won’t reduce Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 16 / 22
Background Semantically-informed byte-level compression User-level semantic compression Unavoidable overlap Alignment is insufficient Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 17 / 22
Background Semantically-informed byte-level compression User-level semantic compression Key splitting Overlapping ranges are split on the overlap boundaries Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 18 / 22
Background Semantically-informed byte-level compression User-level semantic compression Effect of key aggregation 25 Values Keys File overhead 20 3.81 MB Total dataset size (MB) 15 10 15.26 MB 5 3.81 MB 25.05 KB 1.91 MB 5.84 KB 0 Original Compressed Data size is reduced by 84.5% for a 100 × 100 × 100 grid of integers Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 19 / 22
Background Semantically-informed byte-level compression User-level semantic compression Result Query: median of a sliding 3 × 3 × 3 window in an 800 × 800 × 800 grid of integers Cluster: 5 nodes, with 5 reducers and 10 map slots. Intermediate data (“Map output materialized bytes”) was reduced by 60.7% Intermediate key/value pair count (“Reduce input records”) was reduced by 73.3% Total runtime was reduced by 28.5% Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 20 / 22
Background Semantically-informed byte-level compression User-level semantic compression Conclusion Compression must be fast to be useful Semantic compression has an advantage with multiple read/write cycles Scientific processing in Hadoop is becoming more feasible Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 21 / 22
Background Semantically-informed byte-level compression User-level semantic compression Questions? Adam Crume, Joe Buck, Carlos Maltzahn, Scott Brandt Compressing Intermediate Keys in SciHadoop 22 / 22
Recommend
More recommend