S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay - - PowerPoint PPT Presentation

s9557 effective scalable multi gpu joins
SMART_READER_LITE
LIVE PREVIEW

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay - - PowerPoint PPT Presentation

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March 20 th 2019 RECAP JOINS Joins are implicit in a business question SQL Database Operators Business question Counts the number of select orders in


slide-1
SLIDE 1

Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March 20th 2019

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS

slide-2
SLIDE 2

4

RECAP JOINS

Counts the number of

  • rders in a given

quarter of a given year in which at least one lineitem was received by the customer later than its committed

  • date. The query lists

the count of such

  • rders for each order

priority sorted in ascending priority

  • rder

Joins are implicit in a business question

Business question

aggregate

Database Operators

predicate (filter) join aggregate sort predicate (filter)

SQL

select

  • _orderpriority,

count(o_orderkey) as order_count, from

  • rders

where

  • _orderdate >= date '[DATE]' and
  • _orderdate < date '[DATE]' + interval '3' month and

exists (select * from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate) group by

  • _orderpriority,
  • rder by
  • _orderpriority;
slide-3
SLIDE 3

5

TPC-H SCHEMA

ORDERKEY LINENUMBER PARTKEY SUPPKEY COMMITDATE RECEIPTDATE … … CUSTKEY NAME ADDRESS CITY … SUPPKEY NAME ADDRESS CITY NATIONKEY … PARTKEY NAME MFGR CATEGORY BRAND … NATIONKEY NAME …

customer (c_) nation (n_) lineitem (l_) supplier (s_) part (p_)

ORDERKEY CUSTKEY ORDERDATE ORDPRIORITY ORDERSTATUS …

  • rder (o_)
slide-4
SLIDE 4

6

RELATIONAL JOIN

Lineitem1 Order2

=

Payload Foreign Key Primary Key Join Results l_orderkey 23 14 56 11 39 27 23

  • _orderkey
  • _orderpriority

11 1 23 5 27 2 29 4

  • _orderkey
  • _orderpriority

23 5 11 1 27 2 23 5

1 after applying predicate “l_commitdate < l_receiptdate” 2 after applying predicates “o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month”

slide-5
SLIDE 5

7

HASH JOIN

=

Payload Foreign Key Primary Key Join Results l_orderkey 23 14 56 11 39 27 23

  • _orderkey
  • _orderpriority

11 1 23 5 27 2 29 4

  • _orderkey
  • _orderpriority

23 5 11 1 27 2 23 5 Build hash table = Probe inputs Lineitem1 Order2

1 after applying predicate “l_commitdate < l_receiptdate” 2 after applying predicates “o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month”

slide-6
SLIDE 6

8

JOINS & E2E PERFORMANCE

99% 1%

CPU TPC-H Q4 execution breakdown

join group-by 99% 1%

GPU TPC-H Q4 execution breakdown

join group-by 18/22 TPC-H Queries involve Joins and are the longest running ones1

1 c.f. recently published TPC-H results at http://www.tpc.org/tpch/results/tpch_last_ten_results.asp

slide-7
SLIDE 7

9

IMPLEMENTING GPU JOINS

In Heterogeneous Systems

DB

Key Payload 23 5 27 2

Build & Probe

Hash Table(s) 32GB HBM 1TB+ DDR If the hash table fits in GPU memory, performance is primarily bound by random memory access.1 Let’s ignore CPU-GPU interconnect for a moment.

1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289

slide-8
SLIDE 8

10

PERFORMANCE

Peak memory bandwidth1 Random 8B access1 High-end CPU (6-channel DDR4) 120 GB/s 6GB/s NVIDIA Tesla V100 900 GB/s 60GB/s

10x

1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289

http://on-demand-gtc.gputechconf.com/gtc-quicklink/ar9zi75

slide-9
SLIDE 9

11

PERFORMANCE VS. CAPACITY

Peak memory bandwidth1 Random 8B access1 Memory capacity High-end CPU (6-channel DDR4) 120 GB/s 6GB/s 1 TB+ NVIDIA Tesla V100 900 GB/s 60GB/s 32GB

1/32

1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289

http://on-demand-gtc.gputechconf.com/gtc-quicklink/ar9zi75

slide-10
SLIDE 10

12

PERFORMANCE VS. CAPACITY

Peak memory bandwidth1 Random 8B access1 Memory capacity High-end CPU (6-channel DDR4) 120 GB/s 6GB/s 1 TB+ NVIDIA Tesla V100 900 GB/s 60GB/s 32GB NVIDIA DGX-2 (16x V100) 16 x 900 GB/s 16x 60GB/s 512 GB

1/2

1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289

http://on-demand-gtc.gputechconf.com/gtc-quicklink/ar9zi75

slide-11
SLIDE 11

13

IS A SINGLE V100 FAST/LARGE ENOUGH?

TPC-H query 4 @SF1000 = 1000GB data warehouse

99% 1%

GPU execution breakdown

join group-by

Hash table sizes

99% 1%

GPU execution breakdown, compressed data

join group-by Query SF1K SF3K SF10K Q4 1.5 GB 4.5 GB 15 GB Q18 21 GB 63 GB 210 GB Q21 10.5 GB 31.5 GB 105 GB

For further speedup or > SF 1000 need to to distribute hash table across multiple GPUs

3.8 s 7.0 s

slide-12
SLIDE 12

14

DESIGNED TO TRAIN THE PREVIOUSLY IMPOSSIBLE

NVIDIA DGX-2

1 2 3 8 4 5 Two Intel Xeon Platinum CPUs 6 1.5 TB System Memory 30 TB NVME SSDs Internal Storage NVIDIA Tesla V100 32GB Two GPU Boards 8 V100 32GB GPUs per board 6 NVSwitches per board 512GB Total HBM2 Memory interconnected by Plane Card Twelve NVSwitches 2.4 TB/sec bi-section bandwidth Eight EDR Infiniband/100 GigE 1600 Gb/sec Total Bi-directional Bandwidth 7 Two High-Speed Ethernet 10/25/40/100 GigE

slide-13
SLIDE 13

15

POTENTIAL DGX-2 IMPLEMENTATION

Use 2.4TB/s bisection BW to exchange FT chunks

GPU 8 GPU 9 GPU 10 GPU 11 GPU 12 GPU 13 GPU 14 GPU 15 GPU GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7

NVSwitch Fabric

slide-14
SLIDE 14

16

SCALING OF INNER JOIN

slide-15
SLIDE 15

17

DISCLAIMER

For a production system some additional aspects need to be considered:

  • Data Skew
  • Cardinality estimation
  • Query optimizer

This investigation is ongoing

slide-16
SLIDE 16

18

SCALING OF INNER JOIN

redundant build of replicated HT (step 0)

GPU 0

0…B1-1 0…P1-1

Build table

Full HT

GPU 1

B1…B2-1 P1…P2-1 Full HT

GPU 2

B2…B3-1 P2…P3-1 Full HT B#…B-1 P#…P-1 Full HT

… GPU #GPU

slide-17
SLIDE 17

19

SCALING OF INNER JOIN

redundant build of replicated HT (step 1..#GPU-1)

GPU 0

0…B1-1 0…P1-1

Build table

Full HT

GPU 1

B1…B2-1 P1…P2-1 Full HT

GPU 2

B2…B3-1 P2…P3-1 Full HT B#…B-1 P#…P-1 Full HT

… GPU #GPU

slide-18
SLIDE 18

20

SCALING OF INNER JOIN

redundant build of replicated HT (step #GPU)

GPU 0

0…B1-1 0…P1-1

Build table

Full HT

GPU 1

B1…B2-1 P1…P2-1 Full HT

GPU 2

B2…B3-1 P2…P3-1 Full HT B#…B-1 P#…P-1 Full HT

… GPU #GPU

slide-19
SLIDE 19

21

SCALING OF INNER JOIN

parallel probe of replicated HT

GPU 0

0…B1-1 0…P1-1

Probe table

Full HT

GPU 1

B1…B2-1 P1…P2-1 Full HT

GPU 2

B2…B3-1 P2…P3-1 Full HT

GPU #GPU

B#…B-1 P#…P-1 Full HT

slide-20
SLIDE 20

22

SCALING OF INNER JOIN

randomly generated 8 bytes keys build table size = probe table size = 335544320 rows (worst case for HT creation fitting in the memory of a single GPU: 2x 2.5GiB for tables, 2x10GiB for HT + staging buffers (for strong scaling experiment)) HT occupancy = 50% selectivity = 0 for analytical purposes we will look at a real problem later build and probe tables are evenly partitioned across GPUs

Benchmark Problem

GPU 0

0…B1-1 0…P1-1

GPU 1

B1…B2-1

GPU 2 … Build table

B2…B3-1 B#…B-1 P1…P2-1 P2…P3-1 P#…P-1

Probe table GPU #GPU

slide-21
SLIDE 21

23

SCALING OF INNER JOIN ON DGX-2

with redundant build of replicated HT

Runtimes are the minimum of 5 repetitions for probe + build (excluding setup overhead, e.g. allocation of hash tables or temp buffers)

0% 20% 40% 60% 80% 100% 120% 100 200 300 400 500 600 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Parallel efficiency Runtime [ms] #GPUs

Runtime [ms] Build runtime [ms] Probe runtime [ms] Parallel Efficiency build Parallel Efficiency probe Parallel Efficiency

slide-22
SLIDE 22

24

SCALING OF INNER JOIN

Open addressing hash table with N buckets key -> hash_value = hf(key) -> bucket_idx = hash_value%N Partition N hash table buckets equally onto GPUs: The bucket_idx and target HT partition can be computed locally from the key

Basic Idea

GPU 0

0…N1-1

GPU 1

N1…N2-1

GPU #GPU

N#...N-1

… Hash table

slide-23
SLIDE 23

25

SCALING OF INNER JOIN

parallel build of a replicated HT (step 0 of phase 1)

GPU 0

0…B1-1 0…P1-1 temp HT

GPU 1

B1…B2-1 P1…P2-1 temp HT

GPU #GPU

B#…B-1 P#…P-1 temp HT

… if hash to bucket 0..N1-1 if hash to bucket N1..N2-1 if hash to bucket N#..N-1

slide-24
SLIDE 24

26

SCALING OF INNER JOIN

parallel build of a replicated HT (step 1..#GPU-1 of phase 1)

GPU 0

0…B1-1 0…P1-1 temp HT

GPU 1

B1…B2-1 P1…P2-1 temp HT

GPU #GPU

B#…B-1 P#…P-1 temp HT

… if hash to bucket 0..N1-1 if hash to bucket N1..N2-1 if hash to bucket N#..N-1

slide-25
SLIDE 25

27

SCALING OF INNER JOIN

parallel build of a replicated HT (step #GPU of phase 1)

GPU 0

0…B1-1 0…P1-1 temp HT

GPU 1

B1…B2-1 P1…P2-1 temp HT

GPU #GPU

B#…B-1 P#…P-1 temp HT

… if hash to bucket 0..N1-1 if hash to bucket N1..N2-1 if hash to bucket N#..N-1

slide-26
SLIDE 26

28

SCALING OF INNER JOIN

parallel build of a replicated HT (phase 2 – merge step)

GPU 0 temp HT

0…N1-1 N1…N2-1

GPU 1 temp HT

N2…N3-1

GPU 2 temp HT

N#…N-1

GPU # temp HT

slide-27
SLIDE 27

29

SCALING OF INNER JOIN

parallel build of a replicated HT (phase 2 – merge step)

GPU 0 temp HT

0…N1-1 N1…N2-1

GPU 1 temp HT

N2…N3-1

GPU 2 temp HT

N#…N-1

GPU # temp HT GPU 0 res HT GPU 1 res HT GPU 2 res HT GPU # res HT

0…N1-1 N2…N3-1 N1…N2-1 N#…N-1 0…N1-1 N2…N3-1 N1…N2-1 N#…N-1 0…N1-1 N2…N3-1 N1…N2-1 N#…N-1 0…N1-1 N2…N3-1 N1…N2-1 N#…N-1

slide-28
SLIDE 28

30

SCALING OF INNER JOIN

parallel build of a replicated HT (phase 2 – merge step)

GPU 0 temp HT

0…N1-1 N1…N2-1

GPU 1 temp HT

N2…N3-1

GPU 2 temp HT

N#…N-1

GPU # temp HT GPU 0 res HT GPU 1 res HT GPU 2 res HT GPU # res HT

0…N1-1 N2…N3-1 N1…N2-1 N#…N-1 0…N1-1 N2…N3-1 N1…N2-1 N#…N-1 0…N1-1 N2…N3-1 N1…N2-1 N#…N-1 0…N1-1 N2…N3-1 N1…N2-1 N#…N-1

slide-29
SLIDE 29

31

SCALING OF INNER JOIN ON DGX-2

with parallel build of replicated HT

Runtimes are the minimum of 5 repetitions for probe + build (excluding setup overhead, e.g. allocation of hash tables or temp buffers)

0% 20% 40% 60% 80% 100% 120% 100 200 300 400 500 600 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Parallel efficiency Runtime [ms] #GPUs

Runtime [ms] Build runtime [ms] Probe runtime [ms] Parallel Efficiency build Parallel Efficiency probe Parallel Efficiency

slide-30
SLIDE 30

32

SCALING OF INNER JOIN ON DGX-2

with parallel build of replicated HT

With 16 GPUs most

  • f the time is spend

in HT merging

slide-31
SLIDE 31

33

SCALING OF INNER JOIN

parallel build of partitioned HT and parallel probe

Full HT Full HT Full HT

GPU 0 GPU 1 GPU 2

Full HT

GPU 0 GPU 1 GPU 2 Replicated:

  • Limited capacity
  • Slower building
  • Need to merge HT partitions
  • Faster probing
  • No inter-GPU traffic

Partitioned:

  • High capacity
  • Faster building
  • No need to merge partitions
  • Slower probing
  • Need to access remote partitions
slide-32
SLIDE 32

34

SCALING OF INNER JOIN

parallel build of a partitioned HT (step 0)

GPU 0

0…B1-1 0…P1-1 0…N1-1

GPU 1

B1…B2-1 P1…P2-1 N1…N2-1

GPU #GPU

B#…B-1 P#…P-1 N#...N-1

… if hash to bucket 0..N1-1 if hash to bucket N1..N2-1 if hash to bucket N#..N-1 Hash table

slide-33
SLIDE 33

35

SCALING OF INNER JOIN

parallel build of a partitioned HT (step 1..#GPU-1)

GPU 0

0…B1-1 0…P1-1 0…N1-1

GPU 1

B1…B2-1 P1…P2-1 N1…N2-1

GPU #GPU

B#…B-1 P#…P-1 N#...N-1

… if hash to bucket 0..N1-1 if hash to bucket N1..N2-1 if hash to bucket N#..N-1 Hash table

slide-34
SLIDE 34

36

SCALING OF INNER JOIN

parallel build of a partitioned HT (ring exchange) (step #GPU)

GPU 0

0…B1-1 0…P1-1 0…N1-1

GPU 1

B1…B2-1 P1…P2-1 N1…N2-1

GPU #GPU

B#…B-1 P#…P-1 N#...N-1

… if hash to bucket 0..N1-1 if hash to bucket N1..N2-1 if hash to bucket N#..N-1 Hash table

slide-35
SLIDE 35

37

SCALING OF INNER JOIN

parallel probe of a partitioned HT (ring exchange) (step 0)

GPU 0

0…B1-1 0…P1-1 0…N1-1

GPU 1

B1…B2-1 P1…P2-1 N1…N2-1

GPU #GPU

B#…B-1 P#…P-1 N#...N-1

… if hash to bucket 0..N1-1 if hash to bucket N1..N2-1 if hash to bucket N#..N-1 Hash table

slide-36
SLIDE 36

38

SCALING OF INNER JOIN

parallel probe of a partitioned HT (ring exchange) (step 1..#GPU-1)

GPU 0

0…B1-1 0…P1-1 0…N1-1

GPU 1

B1…B2-1 P1…P2-1 N1…N2-1

GPU #GPU

B#…B-1 P#…P-1 N#...N-1

… if hash to bucket 0..N1-1 if hash to bucket N1..N2-1 if hash to bucket N#..N-1 Hash table

slide-37
SLIDE 37

39

SCALING OF INNER JOIN

parallel probe of a partitioned HT (ring exchange) (step #GPU)

GPU 0

0…B1-1 0…P1-1 0…N1-1

GPU 1

B1…B2-1 P1…P2-1 N1…N2-1

GPU #GPU

B#…B-1 P#…P-1 N#...N-1

… if hash to bucket 0..N1-1 if hash to bucket N1..N2-1 if hash to bucket N#..N-1 Hash table

slide-38
SLIDE 38

40

SCALING OF INNER JOIN ON DGX-2

parallel build of partitioned HT and parallel probe (ring exchange)

Runtimes are the minimum of 5 repetitions for probe + build (excluding setup overhead, e.g. allocation of hash tables or temp buffers)

0% 20% 40% 60% 80% 100% 120% 140% 160% 100 200 300 400 500 600 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Parallel efficiency Runtime [ms] #GPUs

inner join with parallel build of distributed HT (ring exchange)

Runtime [ms] Build runtime [ms] Probe runtime [ms] Parallel Efficiency build Parallel Efficiency probe Parallel Efficiency

slide-39
SLIDE 39

41

SCALING OF INNER JOIN ON DGX-2

parallel build of partitioned HT – Memory Subsystem Metrics

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Metric Value [%] #GPUs

Unified Cache Hit Rate L2 Cache Hit Rate random mem ops/coalesced mem ops per step

slide-40
SLIDE 40

42

SCALING OF INNER JOIN

parallel probe of a partitioned HT (staged direct send) (round 0)

GPU 0

0…B1-1 0…P1-1 0…N1-1

GPU 1

B1…B2-1 P1…P2-1 N1…N2-1

GPU #GPU

B#…B-1 P#…P-1 N#...N-1

… if hash to bucket 0..N1-1 if hash to bucket N1..N2-1 if hash to bucket N#..N-1 Hash table if hash to bucket 0..N1-1 if hash to bucket N2..N3-1 if hash to bucket N1..N2-1

slide-41
SLIDE 41

43

SCALING OF INNER JOIN

parallel probe of a partitioned HT (staged direct send) (round (k-1))

GPU 0

0…B1-1 0…P1-1 0…N1-1

GPU K

Bk…Bk+1-1 Pk…Pk+1-1 NK…NK+1-1

GPU 2K

B2K…B2K+1- 1 P2K…P2K+1- 1 N2K...N2K+1

  • 1

Hash table if hash to bucket N2K…N2K+1-1 if hash to bucket NK…NK+1-1

slide-42
SLIDE 42

44

SCALING OF INNER JOIN

parallel probe of a partitioned HT (staged direct send) (round #GPU)

GPU 0

0…B1-1 0…P1-1 0…N1-1

GPU 1

B1…B2-1 P1…P2-1 N1…N2-1

GPU #GPU

B#…B-1 P#…P-1 N#...N-1

Hash table

slide-43
SLIDE 43

45

SCALING OF INNER JOIN ON DGX-2

parallel build of partitioned HT and parallel probe (staged direct send)

0% 20% 40% 60% 80% 100% 120% 140% 160% 100 200 300 400 500 600 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Parallel efficiency Runtime [ms] #GPUs

Runtime [ms] Build runtime [ms] Probe runtime [ms] Parallel Efficiency build Parallel Efficiency probe Parallel Efficiency

Runtimes are the minimum of 5 repetitions for probe + build (excluding setup overhead, e.g. allocation of hash tables or temp buffers)

slide-44
SLIDE 44

46

SCALING OF INNER JOIN ON DGX-2

replicated HT vs. partitioned HT (16 GPUs, total # rows = 671088640)

Runtimes are the minimum of 5 repetitions for probe + build (excluding setup overhead, e.g. allocation of hash tables or temp buffers)

0.5 1 1.5 2 2.5 3 3.5 20 40 60 80 100 120 1 2 4 8 16 32 64 128 256 512

Speedup Runtime [ms] probe tbl size / build tbl size

Runtime with replicated HT Runtime with partitioned HT speedup partitioned speedup replicated

slide-45
SLIDE 45

47

REAL OLAP QUERIES

slide-46
SLIDE 46

48

TPC-H BENCHMARK

select

  • _orderpriority,

count(o_orderkey) as order_count, from

  • rders

where

  • _orderdate >= date '[DATE]’ and
  • _orderdate < date '[DATE]' + interval '3' month and

exists (select * from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate) group by

  • _orderpriority,
  • rder by
  • _orderpriority;

semi-join SQL code for TPC-H Query 4:

99% 1%

CPU execution breakdown

join group-by

slide-47
SLIDE 47

49

Q4: INPUT DATA

1.5M rows per SF 6M rows per SF

  • _orderkey
  • _orderdate
  • _orderpriority

7 1996-01-10 2-HIGH 32 1995-07-16 2-HIGH 33 1993-10-27 3-MEDIUM 34 1998-07-21 3-MEDIUM l_orderkey l_commitdate l_receiptdate 7 7 7 32 1995-10-07 1995-08-27 32 1995-08-20 1995-09-14 32 1995-10-01 1995-09-03 34 34 > > <

slide-48
SLIDE 48

50

Q4 JOIN: BUILD

32 1995-07-16 2-HIGH GPU 0 GPU 1

  • rders
  • _orderdate >= date '[DATE]’ and
  • _orderdate < date '[DATE]' + interval '3' month

filter (selectivity 3.8%) compute destination HT partition push (o_orderkey, o_orderpriority) to the remote GPU insert (o_orderkey, o_orderpriority) into the local HT partition

slide-49
SLIDE 49

51

Q4 JOIN: PROBE

32 1995-08-20 1995-09-14 GPU 0 GPU 1 lineitem filter (selectivity 63%) compute destination HT partition push l_orderkey to the remote GPU probe against the local HT partition

l_commitdate < l_receiptdate

remove element from HT (semi-join) increment o_orderpriority counter (groupby) match

slide-50
SLIDE 50

52

TEST SETUP

Performance metrics: time, parallel efficiency, throughput (input data size / time) Use 8B keys, 2B encoded dates, 1B encoded priority string

TPC-H Q4 SF1000

89GB 1.4GB

Input columns used in Q4 GPU hash table (50% HT occupancy) All tables in CSV format

1000GB

slide-51
SLIDE 51

53

PERFORMANCE RESULTS ON DGX-2

Q4 SF1000, input distributed in GPU memory

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # of GPUs

Q4 execution time (s)

6M rows chunk

slide-52
SLIDE 52

54

PERFORMANCE RESULTS ON DGX-2

Q4 SF1000, input distributed in GPU memory

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # of GPUs

Q4 parallel efficiency

6M rows chunk

slide-53
SLIDE 53

55

DGX-2 PROFILE: INPUT IN GPU MEMORY

the main bottleneck is HT build (74% of the overall query time)

slide-54
SLIDE 54

56

DGX-2 PROFILE: INPUT IN GPU MEMORY

CUDA API overhead (kernel launches, recording events)

slide-55
SLIDE 55

57

OPTIMIZED CHUNK SIZE ON DGX-2

Q4 SF1000, input distributed in GPU memory

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # of GPUs

Q4 execution time (s)

6M rows chunk 1 chunk per GPU

slide-56
SLIDE 56

58

OPTIMIZED CHUNK SIZE ON DGX-2

Q4 SF1000, input distributed in GPU memory

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # of GPUs

Q4 parallel efficiency

6M rows chunk 1 chunk per GPU

slide-57
SLIDE 57

59

PERFORMANCE RESULTS ON DGX-2

Q4 SF1000, input in system memory

10 20 30 40 50 60 single V100 replicated HT - redundant build, parallel probe replicated HT - cooperative build, parallel probe partitioned HT - cooperative build, parallel probe

throughput (GB/s)

PCIe3 x16 4x PCIe3 x16

slide-58
SLIDE 58

60

DGX-2 PROFILE: INPUT IN CPU MEMORY

the main bottleneck is HT probe (82% of the overall query time)

slide-59
SLIDE 59

61

IS THIS THE BEST WE CAN DO?

l_orderkey l_commitdate l_receiptdate 7 7 7 32 1995-10-07 1995-08-27 32 1995-08-20 1995-09-14 32 1995-10-01 1995-09-03 34 34 8B 2B 2B

slide-60
SLIDE 60

62

IS THIS THE BEST WE CAN DO?

l_orderkey l_commitdate l_receiptdate 7 7 7 32 1995-10-07 1995-08-27 32 1995-08-20 1995-09-14 32 1995-10-01 1995-09-03 34 34 filters can be executed on the CPU

slide-61
SLIDE 61

63

IS THIS THE BEST WE CAN DO?

l_orderkey l_commitdate l_receiptdate 7 7 7 32 1995-10-07 1995-08-27 32 1995-08-20 1995-09-14 32 1995-10-01 1995-09-03 34 34 can be compressed to <8B per key 8B 2B 2B

slide-62
SLIDE 62

64

IS THIS THE BEST WE CAN DO?

l_orderkey l_commitdate l_receiptdate 7 7 7 32 1995-10-07 1995-08-27 32 1995-08-20 1995-09-14 32 1995-10-01 1995-09-03 34 34 can be compressed to <2B per date 8B 2B 2B

slide-63
SLIDE 63

65

IS THIS THE BEST WE CAN DO?

l_orderkey l_commitdate l_receiptdate 7 7 7 32 1995-10-07 1995-08-27 32 1995-08-20 1995-09-14 32 1995-10-01 1995-09-03 34 34 can be compressed to <2B per date 8B 2B 2B

slide-64
SLIDE 64

66

112233 2,2,2 2:0,0,0 1,2,3 1,1,1 3 3:0 1 1:0

RLE-DELTA-RLE COMPRESSION

RLE RLE Delta bit-packing bit-packing Uncompressed Compressed runs vals runs vals

1 c.f.“Breaking the Speed of Interconnect with Compression for Database Applications”, GTC Silicon Valley 2018, Session ID S8417

http://on-demand-gtc.gputechconf.com/gtc-quicklink/7LVQs

slide-65
SLIDE 65

67

APPLYING COMPRESSION TO TPC-H Q4

Use RLE + Delta + RLE + bit-packing Compression rate for SF1K l_orderkey: 14x Multiple streams per GPU Pipeline decompress & probe kernels

12 62 113 20 40 60 80 100 120 Uncompressed (8B) RLE+bp RLE+Delta+RLE+ bp

l_orderkey decompression throughput (GB/s) reading from system memory

slide-66
SLIDE 66

68

TPC-H SF1000 Q4 RESULTS

*CPU-only results from: http://www.tpc.org/tpch/results/tpch_result_detail.asp?id=117111701

3.2 1.8 1.0 0.06 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

Best published CPU-only results* 2x Intel Xeon Platinum 8180 DGX-2 GPU HT, CPU input w/o compression DGX-2 GPU HT, CPU input with compression DGX-2 GPU HT, GPU input

Query time (s) lower is better

slide-67
SLIDE 67

69

99% 1% join group-by

0% 20% 40% 60% 80% 100% 120% 140% 160% 100 200 300 400 500 600 1 2 3 4 5 6 7 8 9 10111213141516

Parallel efficiency Runtime [ms] #GPUs

Runtime [ms] Build runtime [ms] Probe runtime [ms] Parallel Efficiency build Parallel Efficiency probe Parallel Efficiency

  • 1. Joins is the key bottleneck in OLAP
  • 2. Multi-GPU joins improve perf and enable larger workloads
  • 3. Speed-ups on real analytical queries

DGX-2 can run TPC-H Q4 SF1K in 1 second! (input data in system memory) If columns preloaded to GPU memory Q4 time goes down to just 60ms

TAKEAWAY

3.2 1.8 1.0 0.06 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

Best published CPU-only results* 2x Intel Xeon Platinum 8180 DGX-2 GPU HT, CPU input w/o compression DGX-2 GPU HT, CPU input with compression DGX-2 GPU HT, GPU input

Query time (s)

slide-68
SLIDE 68