Efficient Join Processing across Heterogeneous Processors Henning Funke, Sebastian Breß, Stefan Noll, Jens Teubner December 15, 2015 1 / 14
GPUs IN DATABASES ARE LIKE A MUSCLE CAR IN A TRAFFIC JAM
3 / 14
Bottleneck 3 / 14
really? 3 / 14
GPU – Hashjoin Algorithm Cuckoo Hashing Implementation 1 → Strict limit on number probes per query key ◮ Pipeline join probe and result compaction in shared memory 1 Based on: Alcantara, Dan Anthony Feliciano. Efficient hash tables on the GPU. University of California at Davis, 2011. 4 / 14
GPU – Hashjoin Algorithm Cuckoo Hashing Implementation 1 → Strict limit on number probes per query key ◮ Pipeline join probe and result compaction in shared memory Performance: Build (GTX970) Build Throughput GB/s PCIe 10 5 0 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 Fill factor 1 Based on: Alcantara, Dan Anthony Feliciano. Efficient hash tables on the GPU. University of California at Davis, 2011. 4 / 14
Performance: GPU Join Probe (GTX970) 20 Probe Throughput GB/s 15 PCIe 10 5 0 10 0 10 1 10 2 10 3 10 4 10 5 Build Size KB 5 / 14
Performance: GPU Join Probe (GTX970) 20 Probe Throughput GB/s 15 PCIe 10 > 100 K elements 5 Real world data 0 10 0 10 1 10 2 10 3 10 4 10 5 Build Size KB 6 / 14
Joins on Multiple Heterogeneous Processors Challenges ◮ Scalability to large data ◮ Communication ◮ Local and remote resources Related work Figure : P. Frey, R. Goncalves, M. L. Kersten, and J. Teubner. Spinning Relations: High-Speed Networks for Distributed Join Processing. In DaMoN, 2009. 7 / 14
Star Join on Heterogeneous Processors Processing Strategy → Allocate smaller tables on all devices → Asynchronous data transfers at computation speed → Merge results into continuous arrays 8 / 14
Performance: Join Probe Across Heterogeneous Processors Intel Xeon E5-1607 v2 and NVIDIA Geforce GTX970 5 CPU alone CPU + GPU 4 Probe throughput GB/s 3 2 1 0 0 2 4 6 8 CPU worker threads 9 / 14
Coprocessor Control Thread Scheduling Figure : Profiling coprocessor kernel invocations → Steer control flow from coprocessor itself → Increase block size 10 / 14
Hardware Schema – Memory Bandwidth Utilization Per core (4) scan 15.8 GB/s CPU gather 4 GB/s even share 7.8 GB/s GPU Local PCIe 149 GB/s prefix scan 84 GB/s MC 12 GB/s gather 33.2 GB/s MEM 31 GB/s MEM 11 / 14
Hardware Schema – Memory Bandwidth Utilization Per core (4) scan 15.8 GB/s CPU gather 4 GB/s even share 7.8 GB/s GPU Local PCIe 149 GB/s MC prefix scan 84 GB/s 12 GB/s gather 33.2 GB/s MEM 31 GB/s MEM 12 / 14
Hardware Schema – Memory Bandwidth Utilization Per core (4) scan 15.8 GB/s CPU gather 4 GB/s even share 7.8 GB/s GPU Local PCIe 149 GB/s MC prefix scan 84 GB/s 12 GB/s gather 33.2 GB/s MEM 31 GB/s MEM ◮ PCI express bus and main memory can become a bottleneck ◮ Take bandwidth footprint and throughput into account → Instead of bulk processing, apply dataflow perspective 12 / 14
Improving Resource Utilization Cache awareness ◮ Materialize tuples in hash table → Useful payload in cacheline ◮ Order probe data by hash function GPU Optimizations ◮ Pipeline data between GPU kernels ◮ Concurrent kernel execution 13 / 14
Key Insights ◮ PCIe is not the dominating bottleneck for GPU joins ◮ Dataflow oriented processing → Join arbitrary outer relation sizes ◮ Move part of probes to coprocessor → Save memory bandwidth → Gain throughput Future Work Figure : Operator pipelines. From Leis et al. ◮ Join processing → query processing Morsel-driven parallelism SIGMOD 2014 ◮ Compile pipelined operator sequences ◮ Stream arbitrary columns 14 / 14
Key Insights ◮ PCIe is not the dominating bottleneck for GPU joins ◮ Dataflow oriented processing → Join arbitrary outer relation sizes ◮ Move part of probes to coprocessor → Save memory bandwidth → Gain throughput Future Work Figure : Operator pipelines. From Leis et al. ◮ Join processing → query processing Morsel-driven parallelism SIGMOD 2014 ◮ Compile pipelined operator sequences ◮ Stream arbitrary columns Thank you! 14 / 14
Recommend
More recommend