hardware acceleration of database operations
play

Hardware Acceleration of Database Operations Jared Casper and Kunle - PowerPoint PPT Presentation

Hardware Acceleration of Database Operations Jared Casper and Kunle Olukotun Pervasive Parallelism Laboratory Stanford University Database machines n Database machines from late 1970s n Put some compute on the disk track/head/unit n


  1. Hardware Acceleration of Database Operations Jared Casper and Kunle Olukotun Pervasive Parallelism Laboratory Stanford University

  2. Database machines n Database machines from late 1970s n Put some compute on the disk track/head/unit n Processors got faster, I/O performance did not n Processor could keep up with disk n No performance left on the table n Today's database machines n Made up of general purpose components n Massive amounts of memory n Very high speed interconnect n Tables, even databases, fit entirely within memory 2

  3. Database Operation Acceleration n Processors can not keep up with memory n Join performance is at 100s of million tuples per second n 64-bit tuples → 2-3 GB/s n Chips can get over 100 GB/s n Performance is being left on the table n Follow 10x10 rule, build accelerators n Three acceleration blocks n Selection, merge join, sort n Combine these to do a sort merge join n Goal is to “keep up with memory” 3

  4. Select 1 0 1 1 1 0 0 1 C F E B E C A F E B A B E n Software implementation uses SIMD n Read data into SIMD register n Use SIMD shuffle operation to move selected data to one end of the register n Mask used as index into table for shuffle values n Unaligned write to append to output n Limited by SIMD width, number of SIMD registers 4

  5. Select 4 5 6 7 1011 5

  6. Merge Join n Scan two sorted columns, output matching values n Can have associated values or record IDs n Output cross product when multiple values n Generally viewed as the “free” thing after sorting n More an indication of how slow sorting is n Software implementations have bad branching behaviour n Limits the IPC → hard to keep up with memory 6

  7. Merge Join ¨ Output is bitmask of equal keys with corresponding values ¤ Ready for input into the select block 7

  8. Merge Sort 4 8 2 1 5 5 7 0 1st 4 8 1 2 5 5 0 7 Pass 1 2 4 8 0 5 5 7 2nd Pass 0 1 2 4 5 5 7 8

  9. Merge Sort Level 9

  10. High Bandwidth Sort Merge Node 10

  11. Sort Merge Join ¨ Sort, merge join, and select blocks are combined to perform an full sort merge join in hardware 11

  12. Prototyping Platform - Maxeler 12

  13. Select Throughput 24 64 23 62 60 % of Line Bandwidth 22 Throughput (GB/s) 58 21 56 54 Memory System 20 52 Saturated! 19 50 48 18 46 17 44 16 42 0 10 20 30 40 50 60 70 80 90 100 Cardinality (%) n Software achieved 7 GB/s (33%) n STREAM achieved 12 GB/s (57%) 13

  14. Select Resources Throughput (GB/s @ 400 MHz) 24 36 48 60 72 84 96 108 120 132 10 ROM bits 8 Count (thousands) 16:1 mux 4:1 mux registers 6 4 2 0 64 88 112 136 160 184 208 232 256 280 304 328 352 376 Throughput (bytes/clock) 14

  15. Merge Join Throughput 16 m=1 36 m=2 otal Line Throughput 34 m=3 14 Throughput (GB/s) m=8 32 30 12 28 26 24 10 22 % T 20 8 18 0 0.15 0.3 0.45 0.6 0.75 0.9 1.05 1.2 1.35 1.5 Output ratio ¨ Resources required is a quadratic function of desired bandwidth ¤ All in comparison logic, routing was the limiting factor ¨ Above 1.5x output, write bandwidth dominates ¤ Throughput above is input consumed 15

  16. Sort throughput 1600 2 passes Million values per second 3 passes 1400 3 passes (projected) 1200 1000 800 600 400 375K 750K 1.5M 3M 6M 12.5M 25M 50M 100M 200M 400M 800M 1.6B 3.2B 6.4B 12.5B 25B 50B Size of Input ¨ Resources required is a linear function of desired input size ¤ Dominated by the memory required to hold working sets ¨ Recent CPU/GPU numbers ~300M 32-bit values per second 16

  17. Sort Merge Join n Performance limited by intra-FPGA link n Total throughput is 800 million tuples/second n ~6.5 GB/s n 8x previous work on software joins 17

  18. Conclusions n FPGAs can be used to saturate memory bandwidth in ways that processors can not n Make the most of every byte read n In some cases, address bandwidth is just as important as raw data bandwidth n Scaling your design to high bandwidths can greatly influence the architecture n Think streaming n Next step is to interact with the rest of the system 18

  19. Questions?

Recommend


More recommend