Efficient Lightweight Compression Alongside Fast Scans Orestis Polychroniou Kenneth A. Ross DaMoN 2015, Melbourne, Victoria, Australia
Databases & Compression Process data on disk ❖ Nearly unlimited capacity ❖ Affects query optimization ❖ 0.6 Minimize # of blocks fetched ❖ 0.45 read GB/s Minimize # of random block accesses ❖ 0.3 Compress to improve disk speed ❖ Focused on compression rate since disks are “slow” ❖ 0.15 0 M M ) 5 1 P P 0 R R 2 - 0 0 4 0 0 1 4 2 0 5 7 2 ( D D D D D S H H S
Databases & Compression Process data on disk ❖ Nearly unlimited capacity ❖ Affects query optimization ❖ 60 Minimize # of blocks fetched ❖ 45 read GB/s Minimize # of random block accesses ❖ 30 Compress to improve disk speed ❖ Focused on compression rate since disks are “slow” ❖ 15 0 Process data on RAM ❖ 3 3 4 R R R D D D Always limited capacity ❖ D D D l l l e e e Affects query optimization & query execution n n n ❖ n n n a a a h h h Minimize # of accesses (e.g. column stores & late materialization) c c c ❖ - - - 2 4 4 Minimize # of random (out of CPU cache) accesses (e.g. partitioned join) ❖ Compress to improve RAM speed & avoid disk ❖ Focused on (de-) compression efficiency as RAM is “fast” ❖
Lightweight Compression Compression schemes ❖ Entropy compression ❖ Group nearby similar values ❖ e.g. run-length-encoding, frame-of-reference ❖ 14 2 * 32 21 15 1 17 min = 14 3 21 max = 21 7 8 * 32 + 8 * b 14 0 b = log (max-min+1) 19 5 = 256 bits = 3 bits per code) = 88 bits 14 0 20 6 17 3
Lightweight Compression Compression schemes ❖ Entropy compression ❖ Group nearby similar values ❖ e.g. run-length-encoding, frame-of-reference ❖ Symbol compression ❖ Assign a symbol to each distinct value ❖ e.g. dictionary compression + ❖ A 0 C A 2 original compressed A B 0 data data B C 1 A D 0 n*W bits n* b bits D 3 dictionary with 2 C D distinct values B 1 ( b = logD)
Lightweight Compression Compression schemes ❖ Entropy compression ❖ Group nearby similar values ❖ e.g. run-length-encoding, frame-of-reference ❖ Symbol compression ❖ Assign a symbol to each distinct value ❖ e.g. dictionary compression ❖ Frequency (symbol) compression ❖ Compress frequent symbols with less bits ❖ e.g. Huffman coding (slow), multiple dictionaries (fast) ❖
Lightweight Compression Compression schemes ❖ Entropy compression ❖ Group nearby similar values ❖ e.g. run-length-encoding, frame-of-reference ❖ Symbol compression ❖ Assign a symbol to each distinct value ❖ e.g. dictionary compression ❖ Frequency (symbol) compression ❖ Compress frequent symbols with less bits ❖ e.g. Huffman coding (slow), multiple dictionaries (fast) ❖ DBMS integration ❖ Decompress during execution ❖ In CPU cache (non-integrated) or in registers (integrated) ❖
Lightweight Compression Compression schemes ❖ Entropy compression ❖ Group nearby similar values ❖ e.g. run-length-encoding, frame-of-reference ❖ Symbol compression ❖ Assign a symbol to each distinct value ❖ e.g. dictionary compression ❖ Frequency (symbol) compression ❖ Compress frequent symbols with less bits ❖ e.g. Huffman coding (slow), multiple dictionaries (fast) ❖ DBMS integration ❖ Decompress during execution ❖ In CPU cache (non-integrated) or in registers (integrated) ❖ Process compressed data without decompressing ❖
Bit Packing Definition ❖ Input code width is hardware-supported ❖ 8-bit, 16-bit, 32-bit, 64-bit ❖ Output code width b must be (almost) constant ❖ Either constant across the entire input ❖ Or constant for the next group of items (e.g. frame-of-reference) ❖ A 0 0 dictionary C 2 2 A A 0 0 B B 1 1 bit A C 0 0 packing D D mapping 3 3 original C 2 (not mat- 2 data B 1 1 erialized)
Bit Packing Layouts ❖ Horizontal bit packing ❖ Bits per code are contiguous ❖ 00010 000 10100 000 11000 000 11111 000 01010 000 11001 000 10001 000 00100 000 00010101 00110001 11110101 01100110 00100100
Bit Packing Layouts ❖ Horizontal bit packing ❖ Bits per code are contiguous ❖ Vertical bit packing ❖ Bits of codes are interleaved ❖ 00010 000 10100 000 11000 000 11111 000 01010 000 11001 000 10001 000 00100 000 b = 5 k = 4 0111 0011 0101 1001 0001 0110 1100 0001 1000 0110
Bit Packing Layouts ❖ Horizontal bit packing ❖ Bits per code are contiguous ❖ Vertical bit packing ❖ Bits of codes are interleaved ❖ 00010 000 10100 000 11000 000 11111 000 01010 000 11001 000 10001 000 00100 000 b = 5 k = 4 0111 0011 0101 1001 0001 0110 1100 0001 1000 0110 00010 000 10100 000 11000 000 11111 000 01010 000 11001 000 10001 000 00100 000 b = 5 k = 8 01110110 00111100 01010001 10011000 00010110
Outline Operations ❖ Packing ❖ Unpacking ❖ Scanning ❖
Outline Operations ❖ Packing ❖ Unpacking ❖ Scanning ❖ Horizontal layouts ❖ Fully packed ❖ Fast unpacking & scanning ❖ Word aligned ❖ Faster scanning ❖
Outline Operations ❖ Packing ❖ Unpacking ❖ Scanning ❖ Horizontal layouts ❖ Fully packed ❖ Fast unpacking & scanning ❖ Word aligned ❖ Faster scanning ❖ Vertical layout ❖ Known traits ❖ Fastest scanning ❖ New traits ❖ Fast packing & unpacking ❖
Horizontal Layout Fully packed ❖ No space wasted ❖ Codes can span across 2 packed words ❖
Horizontal Layout Fully packed ❖ Pack Unpack No space wasted ❖ 6 Codes can span across 2 packed words ❖ Packing ❖ 5 Thoughput (GB/s) Process 1 unpacked code per iteration ❖ 4 Branch to store output packed word ❖ Unpacking 3 ❖ Process 1 output code per iteration ❖ 2 Branch to load input packed word ❖ 1 0 1 6 11 16 21 26 31 Number of bits
Horizontal Layout Fully packed ❖ LSB MSB No space wasted ❖ 00010101 00110001 11110101 01100110 Codes can span across 2 packed words ❖ 8-bit —> 4-bit Packing ❖ 0001 0101 0011 0001 1111 0101 0110 0110 Process 1 unpacked code per iteration ❖ shuffle Branch to store output packed word ❖ Unpacking ❖ 0001 0101 0101 0011 0011 0001 0001 1111 Process 1 output code per iteration 4-bit —> 8-bit ❖ Branch to load input packed word ❖ 00010101 01010011 00110001 00011111 Can be written in SIMD ! ❖ shift << << << << Based on paper by 00010101 1010011 0 110001 00 11111 000 T. Willhalm et al. mask & & & & @ VLDB 2009 (& improved using 00010 000 10100 000 11000 000 11111 000 latest SIMD ISA)
Horizontal Layout Fully packed ❖ No space wasted ❖ Codes can span across 2 packed words ❖ Scalar SIMD Packing ❖ 60 Unpacking thoughput (GB/s) Process 1 unpacked code per iteration ❖ 50 Branch to store output packed word ❖ up to 7X improvement from SIMD Unpacking 40 ❖ Process 1 output code per iteration ❖ 30 Branch to load input packed word ❖ 20 Can be written in SIMD ! ❖ 10 0 1 6 11 16 21 26 31 Number of bits
Horizontal Layout Fully packed ❖ No space wasted ❖ Codes can span across 2 packed words ❖ Packing ❖ Process 1 unpacked code per iteration ❖ Branch to store output packed word ❖ Unpacking ❖ Process 1 output code per iteration ❖ Branch to load input packed word ❖ Can be written in SIMD ! ❖ Scanning ❖ Unpack the codes in CPU registers ❖ Evaluate selective predicates and append to bitmap ❖ Must unpack first thus bounded by O(n) ❖
Horizontal Layout Fully packed ❖ select … where column < C … No space wasted ❖ 00010101 00110001 11110101 01100110 Codes can span across 2 packed words ❖ Packing ❖ Process 1 unpacked code per iteration ❖ 00010 000 10100 000 11000 000 11111 000 Branch to store output packed word ❖ Unpacking ❖ compare with C Process 1 output code per iteration ❖ 01100 000 01100 000 01100 000 01100 000 Branch to load input packed word ❖ Can be written in SIMD ! ❖ Scanning ❖ 0000000 0 1111111 1 1111111 1 0000000 0 Unpack the codes in CPU registers ❖ extract Evaluate selective predicates and append to bitmap ❖ 0110 Must unpack first thus bounded by O(n) ❖ Can also be written in SIMD via SIMD unpacking ❖
Horizontal Layout Fully packed ❖ Pack (scalar) No space wasted ❖ Unpack (SIMD) Codes can span across 2 packed words Scan (SIMD) ❖ Packing ❖ 60 Process 1 unpacked code per iteration C1 <= column <= C2 ❖ 50 Branch to store output packed word ❖ Thoughput (GB/s) Unpacking 40 ❖ slower than unpacking Process 1 output code per iteration ❖ 30 Branch to load input packed word ❖ 20 Can be written in SIMD ! ❖ Scanning ❖ 10 Unpack the codes in CPU registers ❖ 0 Evaluate selective predicates and append to bitmap ❖ 1 6 11 16 21 26 31 Must unpack first thus bounded by O(n) ❖ Number of bits Can also be written in SIMD via SIMD unpacking ❖
Horizontal Layout Word aligned ❖ Waste space to get alignment ❖ fully packed Pack b’ = w / (b+1) codes per processor word ❖ 01 10 11 00 Extra bit per word used for scanning ❖ 01 0 10 0 00 11 0 00 0 00 word aligned unused high order bits per word 01 0 10 0 00 unused extra bit per code
Recommend
More recommend