S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March 20 th 2019
RECAP JOINS Joins are implicit in a business question SQL Database Operators Business question Counts the number of select orders in a given o_orderpriority, aggregate quarter of a given year count(o_orderkey) as order_count, from in which at least one orders lineitem was received where by the customer later predicate (filter) o_orderdate >= date '[DATE]' and than its committed o_orderdate < date '[DATE]' + interval '3' month and exists ( select * from lineitem date. The query lists join where l_orderkey = o_orderkey and the count of such predicate (filter) l_commitdate < l_receiptdate) orders for each order aggregate group by priority sorted in o_orderpriority, order by ascending priority sort o_orderpriority; order 4
TPC-H SCHEMA part (p_) PARTKEY customer (c_) NAME order (o_) lineitem (l_) CUSTKEY MFGR ORDERKEY ORDERKEY NAME CATEGORY LINENUMBER CUSTKEY ADDRESS BRAND PARTKEY ORDERDATE CITY … SUPPKEY ORDPRIORITY … COMMITDATE ORDERSTATUS supplier (s_) … RECEIPTDATE SUPPKEY … NAME … ADDRESS nation (n_) CITY NATIONKEY NATIONKEY NAME … … 5
RELATIONAL JOIN Join Results Lineitem 1 Order 2 l_orderkey o_orderkey o_orderpriority o_orderkey o_orderpriority 23 11 1 23 5 = 14 23 5 11 1 56 27 2 27 2 11 29 4 23 5 39 27 Payload Primary Key 23 Foreign Key 1 after applying predicate “ l_commitdate < l_receiptdate ” 2 after applying predicates “ o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month ” 6
HASH JOIN Join Results Lineitem 1 Order 2 l_orderkey o_orderkey o_orderpriority o_orderkey o_orderpriority 23 11 1 23 5 = 14 23 5 11 1 56 27 2 27 2 11 29 4 23 5 39 27 Payload Primary Key 23 Build hash table Foreign Key = Probe inputs 1 after applying predicate “ l_commitdate < l_receiptdate ” 2 after applying predicates “ o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month ” 7
JOINS & E2E PERFORMANCE CPU TPC-H Q4 execution breakdown GPU TPC-H Q4 execution breakdown join group-by join group-by 1% 1% 99% 99% 18/22 TPC-H Queries involve Joins and are the longest running ones 1 1 c.f. recently published TPC-H results at http://www.tpc.org/tpch/results/tpch_last_ten_results.asp 8
IMPLEMENTING GPU JOINS In Heterogeneous Systems Hash Table(s) If the hash table fits in Key Payload 32GB GPU memory, performance 23 5 HBM is primarily bound by 27 2 random memory access. 1 Let’s ignore CPU -GPU Build & Probe interconnect for a moment. 1TB+ DDR DB 9 1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289
PERFORMANCE Peak memory Random 8B bandwidth 1 access 1 High-end CPU 120 GB/s 6GB/s (6-channel DDR4) 10x NVIDIA Tesla V100 900 GB/s 60GB/s 1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289 10 http://on-demand-gtc.gputechconf.com/gtc-quicklink/ar9zi75
PERFORMANCE VS. CAPACITY Peak memory Random 8B Memory capacity bandwidth 1 access 1 High-end CPU 120 GB/s 6GB/s 1 TB+ (6-channel DDR4) 1/32 NVIDIA Tesla V100 900 GB/s 60GB/s 32GB 1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289 11 http://on-demand-gtc.gputechconf.com/gtc-quicklink/ar9zi75
PERFORMANCE VS. CAPACITY Peak memory Random 8B Memory capacity bandwidth 1 access 1 High-end CPU 120 GB/s 6GB/s 1 TB+ (6-channel DDR4) 1/2 NVIDIA Tesla V100 900 GB/s 60GB/s 32GB NVIDIA DGX-2 16 x 900 GB/s 16x 60GB/s 512 GB (16x V100) 1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289 12 http://on-demand-gtc.gputechconf.com/gtc-quicklink/ar9zi75
IS A SINGLE V100 FAST/LARGE ENOUGH? TPC-H query 4 @SF1000 = 1000GB data warehouse Hash table sizes GPU execution breakdown GPU execution breakdown, compressed data Query SF1K SF3K SF10K join group-by join group-by 1% 1% Q4 1.5 GB 4.5 GB 15 GB Q18 21 GB 63 GB 210 GB Q21 10.5 GB 31.5 GB 105 GB 7.0 s 3.8 s 99% 99% For further speedup or > SF 1000 need to to distribute hash table across multiple GPUs 13
DESIGNED TO TRAIN THE PREVIOUSLY IMPOSSIBLE NVIDIA DGX-2 Two GPU Boards 2 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory NVIDIA Tesla V100 32GB 1 interconnected by Plane Card Twelve NVSwitches Eight EDR Infiniband/100 GigE 3 4 2.4 TB/sec bi-section 1600 Gb/sec Total bandwidth Bi-directional Bandwidth Two High-Speed Ethernet 8 10/25/40/100 GigE 5 Two Intel Xeon Platinum CPUs 30 TB NVME SSDs 7 Internal Storage 6 1.5 TB System Memory 14
POTENTIAL DGX-2 IMPLEMENTATION Use 2.4TB/s bisection BW to exchange FT chunks GPU GPU GPU GPU GPU GPU GPU GPU 8 9 10 11 12 13 14 15 NVSwitch Fabric GPU GPU GPU GPU GPU GPU GPU GPU 0 1 2 3 4 5 6 7 15
SCALING OF INNER JOIN 16
DISCLAIMER This investigation is ongoing For a production system some additional aspects need to be considered: - Data Skew - Cardinality estimation - Query optimizer 17
SCALING OF INNER JOIN redundant build of replicated HT (step 0) GPU 0 GPU 1 GPU 2 GPU #GPU Full HT Full HT Full HT Full HT … Build table 0…B 1 -1 B 1 …B 2 -1 B 2 …B 3 -1 B # …B -1 0…P 1 -1 P 1 …P 2 -1 P 2 …P 3 -1 P # …P -1 18
SCALING OF INNER JOIN redundant build of replicated HT (step 1..#GPU-1) GPU 0 GPU 1 GPU 2 GPU #GPU Full HT Full HT Full HT Full HT … Build table 0…B 1 -1 B 1 …B 2 -1 B 2 …B 3 -1 B # …B -1 0…P 1 -1 P 1 …P 2 -1 P 2 …P 3 -1 P # …P -1 19
SCALING OF INNER JOIN redundant build of replicated HT (step #GPU) GPU 0 GPU 1 GPU 2 GPU #GPU Full HT Full HT Full HT Full HT … Build table 0…B 1 -1 B 1 …B 2 -1 B 2 …B 3 -1 B # …B -1 0…P 1 -1 P 1 …P 2 -1 P 2 …P 3 -1 P # …P -1 20
SCALING OF INNER JOIN parallel probe of replicated HT GPU 0 GPU 1 GPU 2 GPU #GPU Full HT Full HT Full HT Full HT … 0…B 1 -1 B 1 …B 2 -1 B 2 …B 3 -1 B # …B -1 Probe table 0…P 1 -1 P 1 …P 2 -1 P 2 …P 3 -1 P # …P -1 21
SCALING OF INNER JOIN Benchmark Problem randomly generated 8 bytes keys build table size = probe table size = 335544320 rows (worst case for HT creation fitting in the memory of a single GPU: 2x 2.5GiB for tables, 2x10GiB for HT + staging buffers (for strong scaling experiment)) HT occupancy = 50% selectivity = 0 for analytical purposes we will look at a real problem later build and probe tables are evenly partitioned across GPUs GPU 0 GPU 1 GPU 2 GPU #GPU Build table 0…B 1 -1 B 1 …B 2 -1 B 2 …B 3 -1 B # …B -1 … Probe table 0…P 1 -1 P 1 …P 2 -1 P 2 …P 3 -1 P # …P -1 22
SCALING OF INNER JOIN ON DGX-2 with redundant build of replicated HT 600 120% 500 100% 400 80% Parallel efficiency Runtime [ms] Runtime [ms] Build runtime [ms] 300 60% Probe runtime [ms] Parallel Efficiency build 200 40% Parallel Efficiency probe Parallel Efficiency 100 20% 0 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #GPUs 23 Runtimes are the minimum of 5 repetitions for probe + build (excluding setup overhead, e.g. allocation of hash tables or temp buffers)
SCALING OF INNER JOIN Basic Idea Open addressing hash table with N buckets key -> hash_value = hf(key) -> bucket_idx = hash_value%N Partition N hash table buckets equally onto GPUs: GPU 0 GPU #GPU GPU 1 Hash table 0…N 1 -1 N # ...N-1 N 1 …N 2 -1 … The bucket_idx and target HT partition can be computed locally from the key 24
SCALING OF INNER JOIN parallel build of a replicated HT (step 0 of phase 1) GPU 0 GPU #GPU GPU 1 temp HT temp HT temp HT if hash to if hash to if hash to bucket bucket bucket 0..N 1 -1 N 1 ..N 2 -1 N # ..N-1 … 0…B 1 -1 B # …B -1 B 1 …B 2 -1 0…P 1 -1 P # …P -1 P 1 …P 2 -1 25
SCALING OF INNER JOIN parallel build of a replicated HT (step 1..#GPU-1 of phase 1) GPU 0 GPU #GPU GPU 1 temp HT temp HT temp HT if hash to if hash to if hash to bucket bucket bucket 0..N 1 -1 N 1 ..N 2 -1 N # ..N-1 … 0…B 1 -1 B # …B -1 B 1 …B 2 -1 0…P 1 -1 P # …P -1 P 1 …P 2 -1 26
SCALING OF INNER JOIN parallel build of a replicated HT (step #GPU of phase 1) GPU 0 GPU #GPU GPU 1 temp HT temp HT temp HT if hash to if hash to if hash to bucket bucket bucket 0..N 1 -1 N 1 ..N 2 -1 N # ..N-1 … 0…B 1 -1 B # …B -1 B 1 …B 2 -1 0…P 1 -1 P # …P -1 P 1 …P 2 -1 27
SCALING OF INNER JOIN parallel build of a replicated HT (phase 2 – merge step) GPU 0 GPU 1 GPU 2 GPU # temp HT temp HT temp HT temp HT 0…N 1 -1 N 1 …N 2 -1 N 2 …N 3 -1 N # …N -1 28
SCALING OF INNER JOIN parallel build of a replicated HT (phase 2 – merge step) GPU 0 GPU 1 GPU 2 GPU # GPU 0 GPU 1 GPU 2 GPU # temp HT temp HT temp HT temp HT res HT res HT res HT res HT 0…N 1 -1 0…N 1 -1 0…N 1 -1 0…N 1 -1 0…N 1 -1 N 1 …N 2 -1 N 1 …N 2 -1 N 1 …N 2 -1 N 1 …N 2 -1 N 1 …N 2 -1 N 2 …N 3 -1 N 2 …N 3 -1 N 2 …N 3 -1 N 2 …N 3 -1 N 2 …N 3 -1 N # …N -1 N # …N -1 N # …N -1 N # …N -1 N # …N -1 29
Recommend
More recommend