Efficient Join Processing across Heterogeneous Processors Henning - PowerPoint PPT Presentation

Efficient Join Processing across Heterogeneous Processors Henning Funke, Sebastian Breß, Stefan Noll, Jens Teubner December 15, 2015 1 / 14

GPUs IN DATABASES ARE LIKE A MUSCLE CAR IN A TRAFFIC JAM

3 / 14

Bottleneck 3 / 14

really? 3 / 14

GPU – Hashjoin Algorithm Cuckoo Hashing Implementation 1 → Strict limit on number probes per query key ◮ Pipeline join probe and result compaction in shared memory 1 Based on: Alcantara, Dan Anthony Feliciano. Efficient hash tables on the GPU. University of California at Davis, 2011. 4 / 14

GPU – Hashjoin Algorithm Cuckoo Hashing Implementation 1 → Strict limit on number probes per query key ◮ Pipeline join probe and result compaction in shared memory Performance: Build (GTX970) Build Throughput GB/s PCIe 10 5 0 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 Fill factor 1 Based on: Alcantara, Dan Anthony Feliciano. Efficient hash tables on the GPU. University of California at Davis, 2011. 4 / 14

Performance: GPU Join Probe (GTX970) 20 Probe Throughput GB/s 15 PCIe 10 5 0 10 0 10 1 10 2 10 3 10 4 10 5 Build Size KB 5 / 14

Performance: GPU Join Probe (GTX970) 20 Probe Throughput GB/s 15 PCIe 10 > 100 K elements 5 Real world data 0 10 0 10 1 10 2 10 3 10 4 10 5 Build Size KB 6 / 14

Joins on Multiple Heterogeneous Processors Challenges ◮ Scalability to large data ◮ Communication ◮ Local and remote resources Related work Figure : P. Frey, R. Goncalves, M. L. Kersten, and J. Teubner. Spinning Relations: High-Speed Networks for Distributed Join Processing. In DaMoN, 2009. 7 / 14

Star Join on Heterogeneous Processors Processing Strategy → Allocate smaller tables on all devices → Asynchronous data transfers at computation speed → Merge results into continuous arrays 8 / 14

Performance: Join Probe Across Heterogeneous Processors Intel Xeon E5-1607 v2 and NVIDIA Geforce GTX970 5 CPU alone CPU + GPU 4 Probe throughput GB/s 3 2 1 0 0 2 4 6 8 CPU worker threads 9 / 14

Coprocessor Control Thread Scheduling Figure : Profiling coprocessor kernel invocations → Steer control flow from coprocessor itself → Increase block size 10 / 14

Hardware Schema – Memory Bandwidth Utilization Per core (4) scan 15.8 GB/s CPU gather 4 GB/s even share 7.8 GB/s GPU Local PCIe 149 GB/s prefix scan 84 GB/s MC 12 GB/s gather 33.2 GB/s MEM 31 GB/s MEM 11 / 14

Hardware Schema – Memory Bandwidth Utilization Per core (4) scan 15.8 GB/s CPU gather 4 GB/s even share 7.8 GB/s GPU Local PCIe 149 GB/s MC prefix scan 84 GB/s 12 GB/s gather 33.2 GB/s MEM 31 GB/s MEM 12 / 14

Hardware Schema – Memory Bandwidth Utilization Per core (4) scan 15.8 GB/s CPU gather 4 GB/s even share 7.8 GB/s GPU Local PCIe 149 GB/s MC prefix scan 84 GB/s 12 GB/s gather 33.2 GB/s MEM 31 GB/s MEM ◮ PCI express bus and main memory can become a bottleneck ◮ Take bandwidth footprint and throughput into account → Instead of bulk processing, apply dataflow perspective 12 / 14

Improving Resource Utilization Cache awareness ◮ Materialize tuples in hash table → Useful payload in cacheline ◮ Order probe data by hash function GPU Optimizations ◮ Pipeline data between GPU kernels ◮ Concurrent kernel execution 13 / 14

Key Insights ◮ PCIe is not the dominating bottleneck for GPU joins ◮ Dataflow oriented processing → Join arbitrary outer relation sizes ◮ Move part of probes to coprocessor → Save memory bandwidth → Gain throughput Future Work Figure : Operator pipelines. From Leis et al. ◮ Join processing → query processing Morsel-driven parallelism SIGMOD 2014 ◮ Compile pipelined operator sequences ◮ Stream arbitrary columns 14 / 14

Key Insights ◮ PCIe is not the dominating bottleneck for GPU joins ◮ Dataflow oriented processing → Join arbitrary outer relation sizes ◮ Move part of probes to coprocessor → Save memory bandwidth → Gain throughput Future Work Figure : Operator pipelines. From Leis et al. ◮ Join processing → query processing Morsel-driven parallelism SIGMOD 2014 ◮ Compile pipelined operator sequences ◮ Stream arbitrary columns Thank you! 14 / 14

Efficient Join Processing across Heterogeneous Processors Henning - PowerPoint PPT Presentation

Efficient Join Processing across Heterogeneous Processors Henning Funke, Sebastian Bre, Stefan Noll, Jens Teubner December 15, 2015 1 / 14 GPUs IN DATABASES ARE LIKE A MUSCLE CAR IN A TRAFFIC JAM 3 / 14 Bottleneck 3 / 14 really? 3 / 14

JOINS IN SQL By Rohit Dhanwani OBJECTIVES Define and use different types of joins INNER

Decentralized Dynamic Scheduling across Heterogeneous Multi core across Heterogeneous Multi

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

When to Optimize Enumerating all possible plans Selection Pushdown Join Conversion Join

I/O-EFFICIENT SIMILARITY JOIN R. Pagh, N. Pham, F. Silvestri, M. Stckel Similarity Join R = Q

HETEROGENEOUS MULTICORE PROCESSORS A LEXANDER V ITKALOV ENGRC 350 Novem ber 2 1 ,2 0 0 5 1

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

CS411 Database Systems Join Expressions 06: SQL Kazuhiro Minami Join Expressions Products and

How does Hash Join work in PostgreSQL and its derivates Yandong Yao Pivotal Greenplum team

Efficient audio signal processing using LLVM and Haskell Henning Thielemann 2013-04-30

Efficient Rendering of Heterogeneous Polydisperse Granular Materials Thomas Mller 1,2 Marios

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Efficient signal processing using Haskell and LLVM Henning Thielemann 2016-09-15 Efficient

Lect. 4: Shared Memory Multiprocessors Obtained by connecting full processors together

CS 105 Intel x86 (IA32/64) Processors Intel x86 (IA32/64) Processors Tour of the Black Holes

Utilizing commercial graphics processors Utilizing commercial graphics processors in the

Tackling the Reproducibility Problem in Systems Research with Declarative Experiment

F o r ma l S e c u r i t y A n a l y s i s o f t h e O p e n I D F

PanORAMa: Oblivious RAM with Logarithmic Overhead Sarvar Patel, Giuseppe Persiano, Mariana

UC.yber; Meeting 20 Announcements Tomorrow UCRI will be hearing out our research ideas IEEE

Introduction: Japanese Language and Culture, Mathematical Linguistics. Hilofumi Yamamoto, Ph. D.

SIMD Vectorized Hashing for Grouped Aggregation Bala Gurumurthy, David Broneske, Marcus Pinnecke,

Its Not Contagious: Connecting With Customers Who Have Mental Health Problems James Hudson

Arthur Berg Pennsylvania State University Introduction Bayes Estimation Empirical Bayes