A Hybrid Implementation of Hamming Weight Enric Morancho Computer Architecture Department Universitat Politècnica de Catalunya, BarcelonaTech Barcelona, Spain enricm@ac.upc.edu 22 nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Torino, Italy, Feb. 12 nd − 14 th , 2014 Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 1 / 35
Outline Introduction 1 Algorithms for computing hamming weight 2 Evaluation of existing implementations 3 Proposed hybrid implementation 4 Conclusion and future work 5 Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 2 / 35
Outline Introduction 1 Algorithms for computing hamming weight 2 Evaluation of existing implementations 3 Proposed hybrid implementation 4 Conclusion and future work 5 Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 3 / 35
Introduction What is hamming weight? The hamming weight of a bitstring is the number of bits set to one in the bitstring Hamming weight is also known as population count, sideways addition or bit counting Applications: cryptography, chemical informatics, information theory Bitstring lengths up to several thousands of bits Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 4 / 35
Introduction Algorithms for computing hamming weight Several algorithms have been proposed: Naïve, memoization, parallel reduction, merged parallel reduction, bitslicing, . . . Some algorithms admit both scalar and vector implementations However, the existing implementations expose either scalar parallelism or vector parallelism. This work proposes an hybrid scalar-vector implementation Exposes both parallelisms simultaenously Useful on platforms that can exploit both parallelisms simultaneously Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 5 / 35
Outline Introduction 1 Algorithms for computing hamming weight 2 Evaluation of existing implementations 3 Proposed hybrid implementation 4 Conclusion and future work 5 Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 6 / 35
Existing algorithms Naïve Iterates through the bits of the bitstring and accumulates each bit value Can be specialized to deal with sparse/dense bitstrings Poor performance due to not exploiting parallelism uint8_t hw_naive(uint32_t w) { uint8_t i, cnt=0; for (i=0; i<32; i++, w = w>>1) cnt += w&0x1; return(cnt); } Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 7 / 35
Existing algorithms Memoization Steps: Defines a subword size (e.g. 8 bits) Precomputes the hamming weight of all possible subwords Looks up the precomputacion table for each subword of the bitstring and accumulates the results Admits both scalar and vector implementations Exposes more parallelism than naïve implementation uint8_t T8[256] = {0, 1, 1, 2, ..., 7, 8}; uint8_t hw_memoization8(uint32_t w) { return(T8[w&0xFF] + T8[(w>>8)&0xFF] + T8[(w>>16)&0xFF] + T8[w>>24]); } Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 8 / 35
Existing algorithms Parallel reduction at bit level Tree reduction of the input word in ⌈ log 2 bits per word ⌉ levels. Input 0 1 1 1 0 0 1 0 Parallel reduction: level 1 01 10 00 01 Parallel reduction: level 2 0011 0001 Parallel reduction: level 3 00000100 Admits both scalar and vector implementations uint32_t hw_parallel(uint32_t w) { w = (w & 0x55555555) + ((w>> 1) & 0x55555555); /*Lev. 1*/ w = (w & 0x33333333) + ((w>> 2) & 0x33333333); /*L2*/ w = (w & 0x0F0F0F0F) + ((w>> 4) & 0x0F0F0F0F); /*L3*/ w = (w & 0x00FF00FF) + ((w>> 8) & 0x00FF00FF); /*L4*/ w = (w & 0x0000FFFF) + ((w>>16) & 0x0000FFFF); /*L5*/ return(w); } Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 9 / 35
Existing algorithms Merged parallel reduction (or tree merging) Deals with bitstrings larger than a word Merges the intermediate results of several parallel reductions keeps processing just the combined result. The degree of merging is limited by the widths of the accumulators Admits both scalar and vector implementations Example: merged parallel reduction of 3 words ( wa wb bc ) wa = (wa & 0x55555555) + ((wa>> 1) & 0x55555555); /*L1*/ wb = (wb & 0x55555555) + ((wb>> 1) & 0x55555555); wa = wa + ( wc & 0x55555555); wb = wb + ((wc>>1) & 0x55555555); wa = (wa & 0x33333333) + ((wa>> 2) & 0x33333333); /*L2*/ wb = (wb & 0x33333333) + ((wb>> 2) & 0x33333333); wa = wa + wb; wa = (wa & 0x0F0F0F0F) + ((wa>> 4) & 0x0F0F0F0F); /*L3*/ ... Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 10 / 35
Existing algorithms Bitslicing Transforms a (2 n − 1)-word bitstring into n words, preserving indeed the hamming weight of the original bitstring. The implementation relies on the parallel emulation of bits_per_word bit adders by using bit-wise logical instructions. Admits both scalar and vector implementations 2 n − 2 n − 1 2 j · hw ( s j ) � � hw ( w i ) = i = 0 j = 0 Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 11 / 35
Existing algorithms Processor support Some processors offer a machine instruction to compute the hamming weight of a machine word For instance: Mark II (1954), IBM Stretch (1961), CDC 6600 (1964), Cray 1 (1976), Sun SPARCv9 (1995), Alpha 21264A (1999), IBM Power5 (2004) and ARM Cortex-A8 (2005) Since 2007, x86 processors supporting SSE4.2 offer popcnt instruction Computes the hamming weight of a scalar 32-bit or a 64-bit register AMD 15h Intel Nehalem Sandy Bridge/Haswell 32-bit 64-bit 32/64 bit Latency (cycles) 4 6 3 Dispatch rate (inst/cyc) 1 0.25 1 Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 12 / 35
Outline Introduction 1 Algorithms for computing hamming weight 2 Evaluation of existing implementations 3 Proposed hybrid implementation 4 Conclusion and future work 5 Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 13 / 35
Evaluation of existing implementations Evaluation environment Our benchmark consists in computing the hamming weight of several randomly initialized bitstrings Bitstring words are located in consecutive memory locations We evaluate two scenarios: Uncached Cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 14 / 35
Evaluation of existing implementations Evaluation environment Intel Core Intel Xeon i5-650 E5-2630L Microarchitecture Nehalem Sandy Bridge Frequency (max turbo) 3.2(3.46) GHz 2(2.5) GHz Cores 2 6 Reorder Buffer entries 128 µ -ops 168 µ -ops Scheduler entries 36 µ -ops 54 µ -ops Peak dispatch rate 6 µ -ops/cycle Size and assoc. 32KB, 8-way, 64Byte lines Bandwidth 128 bits/cycle 256 bits/cycle DL1 In-flight loads 48 64 Simult. misses 10 L2 256KB, 8-way, 64Byte lines L3 4MB, 16-way, 64B 15MB, 20-way, 64B Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 15 / 35
Evaluation of existing implementations Evaluated implementations Single-word wide implementations Naïve hw_naive implementation Memoization, 2 8 -entry lookup table Mem-8 Memoization, 2 16 -entry lookup table Mem-16 Par.Red. Parallel reduction at bit level over 64-bit words SSE4.2 Uses 64-bit scalar instruction popcnt Multi-word wide implementations Merged Scalar merged par.red. on 30 64-bit words at level 3 Merged-V Vector merged par.red. on 30 128-bit words at level 3 (SSE2) Slice Scalar bit slicing on 7 64-bit words Slice-V Vector bit slicing on 7 128-bit words (SSE2) Vector memoization, 2 4 -entry lookup table (SSSE3) Mem-4 Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 16 / 35
Evaluation of existing implementations Results on Nehalem platform: single-word wide/cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 17 / 35
Evaluation of existing implementations Results on Nehalem platform: cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 18 / 35
Evaluation of existing implementations Results on Sandy Bridge platform: cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 19 / 35
Evaluation of existing implementations Results SSE4.2 performs best Multi-word wide implementations outperform single-word implementations (but SSE4.2) Vector implementation outperform scalar implementation of the same algorithm Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 20 / 35
Evaluation of existing implementations Conclusions Although scalar SSE4.2 implementation performs best... The dispatch rate of popcnt instruction is just 1 inst./cycle, that is, SSE4.2’s peak performace is 8 bytes/cycle But DL1 bandwidht is 16 bytes/cycle (Nehalem) and 32 bytes/cycle (Sandy Bridge) SSE4.2 implementation is fully scalar and can not exploit the unused dispatch ports to dispatch vector instructions We wonder if SSE4.2 implementation may be outperformed by a hybrid implementation that makes use of both vector and scalar instructions Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb 2014 21 / 35
Recommend
More recommend