Technische Universität Dresden HASHI: An Application-Specific Instruction Set Extension for Hashing Oliver Arnold, Sebastian Haas, Gerhard Fettweis, Benjamin Schlegel, Thomas Kissinger, Tomas Karnagel, Wolfgang Lehner Technische Universität Dresden Dresden, Germany
Motivation Today’s Database Systems Database Processors � Fat Cores (area & power) � Processors build from scratch � Few HW adaptions � Long development cycle � CMOS Scaling � High development costs Our Approach � HW/SW codesign � Customizable processor � Hashing-specific ISA extensions � Tool flow � short HW development cycles TU Dresden 2
Application Scenario 1: Integer Hash Function <32 Bit Key> � Bit Extraction Bit Selection (32 Bit ->n Bit) Selection of specific bits in a 32-bit key via Shuffle Network arbitrary hash mask Result (n Bit) Bit Extraction � Sampling Bit 0 Bit 0 Bit 1 Bit 1 <32 Bit Key> <32 Bit Key> <32 Bit Key> Scanning a subset of the <32 Bit Key> Bit 2 Bit 2 data set to choose the Bit 3 Bit 3 … most efficient hash mask Bit 4 Bit 4 … Bit 31 Bit 31 Histogram Sampling TU Dresden 3
Application Scenario 2 unsigned int CityHash32(char *s, int len){ � CityHash32 int hash = comp_1(s+len-20); int i = (len-1)/20; Non-cryptographic hash � do { function for strings hash = comp_2(s, hash); s += 20; Returns 32-bit hash value } while(--i != 0); � return comp_3(hash); } � Hash Table Operators (Insert, Lookup) Operate on 32-bit keys � Apply integer hash function � TU Dresden 4
Customizable Processor Model Processor Instruction Set Basic Registers Basic RISC ISA Hash-Specific Registers Basic Core: Tensilica LX5 Hash-Specific ISA Hash-Specific States Inst. Fetch L/S Unit 0 L/S Unit 1 Local Local Local Memory Memory Memory Inst. Data0 Data1 Data Prefetcher Interconnection TU Dresden 5
Integer Hash Function: C code unsigned int hash, shVal, shVal_neg; unsigned int mask = 0xFFFFFFFF; for(i=0; i<keySize; i++){ //load key, bit selection hash = key[i] & hashFunc; //extract bits for(j=30; j>=0; j--){ if(!(hashFunc & (0x1<<j))){ //partial shift right shVal = hash & (mask<<j); shVal_neg = hash & ~(mask<<j); hash = (shVal>>1) | shVal_neg; } } //store hash value hashValue[i] = hash; } Pure C code TU Dresden 6
Integer Hash Function: C code //init pointer, variables unsigned int hash, shVal, shVal_neg; unsigned int mask = 0xFFFFFFFF; init_states(key, hashValue, hashFunc); LD_0(); LD_1(); for(i=0; i<keySize; i++){ //load key, bit selection //load keys, extract bits, store hash values hash = key[i] & hashFunc; for(i=0; i<(keySize/16); i++){ //extract bits LD_0(); LD_1(); HOP(); for(j=30; j>=0; j--){ LD_0(); LD_1(); if(!(hashFunc & (0x1<<j))){ //partial shift right HOP(); ST_0(); ST_1(); shVal = hash & (mask<<j); } shVal_neg = hash & ~(mask<<j); hash = (shVal>>1) | shVal_neg; HOP(); } ST_0(); ST_1(); } //store hash value hashValue[i] = hash; } Pure C code C code with new instructions TU Dresden �
Integer Hash Function: C code //init pointer, variables unsigned int hash, shVal, shVal_neg; unsigned int mask = 0xFFFFFFFF; init_states(key, hashValue, hashFunc); LD_0(); LD_1(); for(i=0; i<keySize; i++){ //load key, bit selection //load keys, extract bits, store hash values hash = key[i] & hashFunc; for(i=0; i<(keySize/16); i++){ //extract bits LD_0(); LD_1(); HOP(); 1 cycle for(j=30; j>=0; j--){ LD_0(); LD_1(); 1 cycle if(!(hashFunc & (0x1<<j))){ //partial shift right HOP(); ST_0(); ST_1(); 1 cycle shVal = hash & (mask<<j); } shVal_neg = hash & ~(mask<<j); hash = (shVal>>1) | shVal_neg; HOP(); } ST_0(); ST_1(); } //store hash value hashValue[i] = hash; } Pure C code C code with new instructions TU Dresden �
Integer Hash Function: ISA Extensions Local Data Memory 0 LD_0 Load Load-Store Unit 0 Dataflow Key_0 Key_1 Key_2 Key_3 ST HASH Op. Result_0 Hash Func HASH Op. Result_1 HASH Op. Result_2 HASH Op. Result_3 HOP Execution HASH Op. Result_4 HASH Op. Result_5 HASH Op. Result_6 HASH Op. Result_7 Dataflow Key_4 Key_5 Key_6 Key_7 LD_1 Load-Store Unit 1 Load Local Data Memory 1 TU Dresden �
Integer Hash Function: Pipeline Snippet Cycle (n+1) Cycle (n+2) Cycle (n+3) Cycle (n+4) Cycle (n+5) Cycle (n+6) Cycle (n+7) Cycle (n+8) Cycle n … ST_0 … HOP ST_1 LD_0 HOP ST_0 LD_1 LD_0 HOP ST_1 LD_1 Latency: 6 cycles LD_0 HOP ST_0 LD_1 LD_0 HOP ST_1 LD_1 LD_0 HOP LD_1 LD_0 … LD_1 TU Dresden ��
Integer Hash Function: Throughput Final processor +1 Load-Store unit (2x) + Extended ISA (500x) Data bus: 32->128 bit (2x) � = � ��� Throughput n key : number of keys � t : time to perform the operation TU Dresden ��
Results: Throughput Final processor Speedup: HASHI vs. 108Mini 386x 354x 2303x 1288x 125x TU Dresden 1 �
Results: Timing and Area Final processor Relative Area Consumption (HASHI) TU Dresden 1 �
Results: Comparison Measures: HASHI vs. I NTEL 3x/7x lower 57x/176x lower 113x/271x lower TU Dresden 1 �
Conclusion � Hardware/Software Codesign approach � Results High database throughput � Highly reduced area and power consumption � 170x less energy consumption than a high-end � x86 processor (@ same performance) � Silicon Prototype Tape-out April 2014 � 28 nm LP process: Globalfoundries � [1] ISA: Hash Functions, Hash Table Operators etc. � 1 Nöthen et al., A 105GOPS 36mm2 Heterogeneous SDR MPSoC with Energy-Aware Dynamic Scheduling and Iterative Detection-Decoding for 4G in 65nm CMOS, ISSCC. 2014 TU Dresden 1 �
Recommend
More recommend