Overview on Hardware Optimizations for Database Engines Annett Ungethüm, Dirk Habich, Tomas Karnagel, Sebastian Haas, Eric Mier, Gerhard Fettweis, Wolfgang Lehner BTW 2017, Stuttgart, Germany, 2017-03-09
Interaction DB-Engine and Hardware Applications/Database Engines We Well ll-Kno Known n Cha halleng nge: Exploit hardware technology by specific data management techniques (indexing, data storage, query & transaction processing) Main Memory CPU 10 1e+07 memory (KByte) #cores 1e+06 8 1e+05 6 10000 Modern Hardware 4 1000 100 2 10 0 1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020 2
Era of Dark Silicon M OORE ‘ S L AW D ARK S ILICON § Number of transistors in a dense integrated § We can no longer power the transistors that circuit doubles approximately every two Moore is giving us years. 1e+07 #transistors (x1000) 1e+06 process (nm) 1e+05 10000 1000 100 10 1 http://engineering.nyu.edu/garg/node/31 1970 1980 1990 2000 2010 2020 3
HW/SW Co-Design for DB-Engines Applications/Database Engines Ch Challenge: HW/SW Co-Design for Database Engines Specialization of Hardware to overcome Dark Silicon Modern Hardware 4
Outline H ARDWARE F OUNDATION I NTELLIGENT DMA C ONTROLLER E XTENSIONS F OR P ROCESSING E LEMENTS 5
Hardware Foundation T OMAHAWK P LATFORM 6
Hardware Foundation – Zoom In 7
Hardware Foundation – Zoom In (2) C ORE M ANAGER (CM) Co Control-Pl Plane § Extended Xtensa-LX5 from Tensilica (now Cadence) § 32KB for code § 64KB for data P ROCESSING E LEMENTS (PE) § Xtensa-LX5 from Tensilica (now Cadence) § 32KB for code § 2x32KB for data on PE A PPLICATION C ORE (APP) § 570T core from Tensilica (now Cadence) Control-Pl Co Plane 8
Outline Co Control-Pl Plane PART I: E XTENSIONS OF P ROCESSING E LEMENTS Co Control-Pl Plane 9
int res= (v0 + v1 + v2) >> shift8; Development Flow D EVEL EVELOPMEN ENT OF OF IN INSTRUCTIO ION SE SET EX EXTEN TENSIONS WI WITH TH T EN ENSILICA TO TOOLS § Tensilica Instruction Extension (TIE) language § C/TIE compiler § Cycle accurate simulator/debugger § Processor generator // shift8 -> internal state int res= ad add3_shift (v0, v1, v2); S YN OF RT RTL COD YNTHE HESIS OF CODE § Synopsys Design Compiler, PrimeTime PX § TSMC CMOS LP 65nm libraries 10
Investigated Database Primitives Primivites Co Compression and Processin Pr WAH OR, XOR OR Bitmap Bi PLWAH ing ( OR) (AND, p COMPAX Hash + Lookup Hash + Insert Hashing Ha Hash Keys Hash Sampling CityHash32 Merge Sort Intersection Sorted Se So Union Set Operat Difference 2014 ations Sort-Merge Join ns Sort-Merge Aggregation (SUM) 11
General Approach for all Extensions Extended Te Ex Tensilic ilica LX LX5 Pro rocessor Data Prefetcher Instruction Set Basic RISC Instruction Set Instruction Local Instruction 64 bit fetch Memory Application-Specific Instruction Set Interconnect Load-Store Local Data Register Files 128 bit Unit 0 Memory 0 Basic Registers Application-Specific Registers Load-Store Local Data 128 bit Unit 1 Memory 1 Application-Specific States 12
Bitmap Primitives B ITMAPS ARE A S PECIAL K IND OF I NDEX B ITMAPS C OMPRESSION § bit length equals number of tuples Table T W ORD -A LIGNED H YBRID (WAH) C ODE bitmap index OID X =0 =1 =2 =3 § Stateless compression 1 0 1 0 0 0 2 1 0 1 0 0 § Run-length-encoding (RLE) 3 3 0 0 0 1 - run of 0‘s and 1‘s 4 2 0 0 1 0 5 3 0 0 0 1 § WAH bitmaps contain RLE 6 3 0 0 0 1 - compressed fills and 7 1 0 1 0 0 8 3 0 0 0 1 - uncompressed literals b 1 b 2 b 3 b 4 Bit-wise OR select * from T where X < 2 13
Bit-Wise OR on Compressed Bitmaps 32 bit words b1 40000380 00000000 00000000 001FFFFF ... In hex Literal 0 fill Literal 10<runlength> WAH 40000380 8000002 001FFFFF b1 Logical operations 00000000 (AND, OR, XOR) on two compressed bitmaps Bit-wise OR OR OR OR OR 1) Load WAH word(s) 2) Calculate output (Fill-Fill, Literal-Fill, Literal-Literal) 7FFFFFFF 3) Combine output WAH C0000002 7C0001E0 3FE00000 b2 11<runlength> 1 fill Literal Literal b2 7FFFFFFF 7FFFFFFF 7C0001E0 3FE00000 ... 14
C-Code WHILE (X IDX !=X SIZE && Y IDX !=Y SIZE ) { if(YisFill==1){ //new X or Y? Calculate new fill count … YfillWords--; if((Y[Yidx]&0xC0000000)==0xC0000000) Fill-Fill if(XisFill==1 && YisFill==1) { //2 fills writeFill(comprResultBI, &Zidx, 0xC0000000, 1); if(XfillWords<YfillWords) else {comprResultBI[Zidx]=X[Xidx]; Zidx++; } min=XfillWords; } else } Literal-Literal min=YfillWords; else { writeFill(comprResultBI,&Zidx,X[Xidx]|Y[Yidx],min); result=X[Xidx]|Y[Yidx]; XfillWords-=min; if((result&0x7FFFFFFF)==0x7FFFFFFF) writeFill(comprResultBI, &Zidx, 0xC0000000, 1); YfillWords-=min; else if((result&0x7FFFFFFF)==0) } writeFill(comprResultBI, &Zidx, 0x80000000, 1); else if((XisFill==1 && YisFill==0) || (XisFill==0 && Literal-Fill else { comprResultBI[Zidx]=X[Xidx]|Y[Yidx]; Zidx++; } YisFill==1)) { } if(XisFill==1){ } XfillWords--; if((X[Xidx]&0xC0000000)==0xC0000000) writeFill(comprResultBI, &Zidx, 0xC0000000, 1); else { comprResultBI[Zidx]=Y[Yidx]; Zidx++; } } 15
Processing with PE Extension Load Store Initial Load Prepare Store Application specific states Preprocessing Application specific states Memory 0 Memory 0 Operation Memory 1 Memory 1 Postprocessing 40000380 M 10000000..11000001..00101010..01110111.. Align to 128-bit lines E M 80000002 11000000..00101010..11000001..00110111.. O 10000000..11000001..00101010..0111011.. 4 x WAHinst() R 001FFFFF Is word fill or Literal? Y ldXstream() -> fill -> overwrite input words 0000000F 11001110.. M 0 00000000..00000000..00000000.. Proceed E to M 00000000.. 11111111..11111111..11111111..11111111.. O next word ldYstream() M C0000002 Buffer result R (4x) E 00000000.. Perform operation OR Y M 7C0001E0 00000000.. v 11111111... => 111111.. O 00000000.. 11000000..00101010..11000001..0011011.. 0/1 R Write to output stream 3FE00000 Y -> append or overwrite previous word with increased fill counter 00000003 1 00000000.0000000..00000..110011010.. 16
Bit-Wise OR on Compressed Bitmaps 32 bit words b1 40000380 00000000 00000000 001FFFFF In hex Literal 0 fill Literal WAH 40000380 8000002 001FFFFF b1 Co Code wi with Ext xtension do{ ldXstream(); Bit-wise OR OR OR OR OR ldYstream(); WAHinst(); WAHinst(); WAHinst(); } while(WAHinst()); WAH C0000002 7C0001E0 3FE00000 b2 1 fill Literal Literal b2 7FFFFFFF 7FFFFFFF 7C0001E0 3FE00000 17
Many More Extensions Bi Bitmap p Co Compression and and Ha Hashing Sorted Se So Set Op Operations Pr Processin ing ( (AND, OR, XOR OR OR) Aggregation (SUM) Extension Sort-Merge Join Hash + Lookup Hash Sampling Hash + Insert Intersection CityHash32 Sort-Merge Merge Sort Difference Hash Keys COMPAX PLWAH Union WAH Processor BitiX X X X HASHI X X X X X Titan3D X X X X X X X Tomahawk X X X X X X DBA 18
Evaluation R EFERENCE P ROCESSORS § Tomahawk DBA Processor --> Set of different DB-Extensions for WAH-Compression, Hashing, and Sortes-Set Operations Te Technology P MA MAX [W [W] ] @ Pr Processor De Description A to tal [m [mm ² ] f MA MAX [GHz Hz] tota [n [nm] f MA MAX Tomahawk Basic Xtensa LX5 without instruction set 28 15.92 0.555 0.7 without DBA extensions, 1 LSU, 32-bit memory interface Set of different DB-Extensions for WAH- Tomahawk Compression, Hashing and Sorted-Set 28 18 0.5 0.753 with DBA Operations Comparison Low-power Intel 2-core processor based Intel i7-6500U 14 99* 3.1 25 on Skylake architecture, 4MB L3 cache 19
Evaluation - Bitmaps 20
Outline P ART 2: I NTELLIGENT DMA C ONTROLLER 21
Problem Statement T2 RISC T2 RISC T2 RISC 0 0xCCA Core Core Core 1 0x00B Local Memory Local Memory Local Memory 2 0x0FA Memory Memory t NMc t McM 3 0x1FD Controller NoC Synopsys DWC Micron DDR2 0xDE1 4 t MMc t McN t AN t NA DDR2 SDRAM 0x0ED 5 Cache Local Memory Problem: 0x00E 6 APP CM Many round-trips for key lookups APP CM 0xD0A 7 Tensilica 570T LX4-ISA_E Approach: “Teach B-trees to the memory controller“ t APP 22 22
Intelligent Main Memory Controller (iDMA) 0xCC6 0 Core Core Core 1 0x000 Local Memory Local Memory Local Memory 0x0F0 2 Memory Memory t NC t CN t NP t PMc t McM Memory 3 0x1FD Pointer Memory Controller NoC Controller Chaser Controller Micron DDR2 Synopsys DWC 0xDE1 4 t PN t McP t MMc Synopsis DDR2 SDRAM 5 0x0ED Cache Local Memory Vision (and first simulations) • Intelligent memory controller 6 0x00E • Is aware of the semantics of APP APP CM CM 0xD0A memory layout 7 • Implements core operations (e.g. lookup) Implementation (no yet in silicon) • 0,183mm ² PE with 200Mhz 23 23
First iDMA Design 24
Evaluation using Simulator 25
Summary H ARDWARE F OUNDATION I NTELLIGENT DMA C ONTROLLER E XTENSIONS F OR P ROCESSING E LEMENTS 26
Overview on Hardware Optimizations for Database Engines Annett Ungethüm, Dirk Habich, Tomas Karnagel, Sebastian Haas, Eric Mier, Gerhard Fettweis, Wolfgang Lehner BTW 2017, Stuttgart, Germany, 2017-03-09
Recommend
More recommend