Conflict Detection-based Run-Length Encoding – AVX-512 CD Instruction Set in Action Annett Ungethüm, Johannes Pietrzyk, Patrick Damme, Dirk Habich, Wolfgang Lehner HardBD & Active'18 Workshop in Paris, France on April 16, 2018
Challenges for Data Processing Nowadays Application Side System Side Increasingly fast proc In ocessor ors for processing the data DB DB … IR IR … ML ML So Solution Proble Pr lem Growing gap Gr Lightweight Com Li ompression on Mu Mult ltip iple le applic licatio ion area eas between processor speed Access less bytes for the same and main memory bandwidth logical information - Re Reduce ced transfer times Ma Main bottleneck ck for effici cient - Be Better cache ut utilization data data pr processi ssing nowaday adays - Le Less TLB LB misses DA DATA Rapidly Ra y growing ng data vo volumes Gr Growing main memory to be processed efficiently to store the data 2
Lightweight Compression Techniques Technique = abstract idea of how compression works Te RLE RL DEL DELTA FO FOR DI DICT NS NS Run Length Enco Ru coding Differenti Di tial al Codi ding Frame-of Fr of-Re Reference ce Dictionary y Coding Nu Null Suppr ppressi ssion Replace run by Replace data elem. Replace data elem. Replace data elem. Eliminate value & length by difference to by difference to by 0-based key leading zeroes in predecessor reference value in dictionary binary representation 1200 1200 1000 1000 1200 200 1000 0 00...001011 1200 4 1100 100 1100 100 1200 1 1200 300 1150 50 1000 0 1000 0 1200 2 1350 200 1050 50 1050 2 1011 300 1355 5 1050 2 300 Vectorization is crucial from performance perspective 3
Vectorization using SIMD Single Instruction Multiple Data (SIMD) ▪ same instruction on multiple data points simultaneously Development of Intel’s SIMD Extension ▪ Trend to larger vector registers - 128-bit (SSE) - 256-bit (AVX and AVX2) Counted using specification - 512-bit (AVX-512) ▪ Trend to more instructions 4
Vectorization and Lightweight Data Compression Most algorithms have been proposed for 128-bit SIMD registers ▪ Processing 4 elements (32 bit integers) at one Example Run-Length Encoding ▪ View subsequent occurrences of the same value as a run ▪ Each run representable by its value and length → just two integers RLE-SIMD ▪ Uses SIMD instructions to parallelize comparisons Read this way 5
RLE-SIMD: Compression 6
Evaluation using Different Vector Sizes Compression Speed Speedup ▪ Measured in million integers per second (mis) ▪ Compared to baseline of 128-bit non-well performing area well-performing area 7
Non-Well Performing Area Reasons ▪ For large run lengths, the number of loaded integers approaches more or less 100%, i.e. every value is only processed once. ▪ RLE vectorization uses a significantly higher number of load operations for sequences with short runs. ▪ The redundant processing dramatically increases with increasing vector widths. 8
SIMD – New Instruction Sets 9
Conflict Detection using AVX512CD Read direction 4 3 2 1 0 Vector Position _mm512_conflict … A A C B A _epi32(...) Input register … … b4 b3 b2 b1 b0 Output register No equal previous elements à bitmasks are zero Previous Previous A C B A C B A elements elements = " ≠ " ≠ " = " ≠ " ≠ " = " filled filled 1 0 0 1 b4 0 0 1 b3 with 0’s with 0’s 10
Step 1: Run Detection A B A A In Input Co Conflict De Dete tecti tion Resulting bi Re bitm tmask ask …011 …000 …001 …000 Co Count le leadin ing ze zero ros Ar Are le leadin ing ze zero ros de desc scendi ding? ? New run New run 11
Step 2: Run Length Detection 00000000 00000000 00000000 00000001 A B A A sllv_epi32 10000000 00000000 00000000 00000000 …011 …000 …001 …000 andnot_epi32 01111111 11111111 11111111 11111111 lzcnt_epi32 + 1 New run New run 2 12
Step 3: Storing A B A X 1 2 Store Scatter (RLE512CD) Continuous (RLE512CDAligned) • Classical storage layout: (run value, run length)-pair • Vector wise • Independent of vector width • Run length and run value to different memory locations 13
Evaluation – Load Instructions 14
Evaluation- Vector Instructions 15
Evaluation Runtime Comparison ▪ Intel Xeon Phi Knights Landing Processor ▪ RLE512CD (Aligned) outperforms state-of-the-art for small average run lengths 16
Evaluation Runtime Comparison ▪ Intel Xeon 6130 Processor ▪ Similar results 17
Summary Development of Intel’s SIMD Extension ▪ Trend to larger vector registers - 128-bit (SSE) - 256-bit (AVX and AVX2) - 512-bit (AVX-512) ▪ Trend to more instructions Robust stness ss vs. s. Run Length Encoding Ma Maximal Performance ▪ Proposed novel implementation using AVX512-CD functionality 18
Conflict Detection-based Run-Length Encoding – AVX-512 CD Instruction Set in Action Annett Ungethüm, Johannes Pietrzyk, Patrick Damme, Dirk Habich, Wolfgang Lehner HardBD & Active'18 Workshop in Paris, France on April 16, 2018
Recommend
More recommend