gpu acceleration for olap
play

GPU ACCELERATION FOR OLAP Tim Kaldewey, Jiri Kraus, Nikolay - PowerPoint PPT Presentation

GPU ACCELERATION FOR OLAP Tim Kaldewey, Jiri Kraus, Nikolay Sakharnykh 03/26/2018 A TYPICAL ANALYTICS QUERY From a business question to SQL Business question (TPC-H query 4) SQL Determines how well the order priority select


  1. GPU ACCELERATION FOR OLAP Tim Kaldewey, Jiri Kraus, Nikolay Sakharnykh 03/26/2018

  2. A TYPICAL ANALYTICS QUERY From a business question to SQL Business question (TPC-H query 4) SQL Determines how well the order priority select o_orderpriority, system is working and gives an assessment count(o_orderkey) as order_count, of customer satisfaction from orders Counts the number of orders ordered in a where given quarter of a given year in which at o_orderdate >= date '[DATE]' and o_orderdate < date '[DATE]' + interval '3' month and least one lineitem was received by the exists ( select * from lineitem customer later than its committed date. where l_orderkey = o_orderkey and The query lists the count of such orders for l_commitdate < l_receiptdate) each order priority sorted in ascending group by o_orderpriority, priority order order by o_orderpriority; Examples: DATE = 1/1/1993, 1/4/1993, … 2

  3. A TYPICAL ANALYTICS QUERY From SQL to Database Operators Database Operators SQL select o_orderpriority, aggregate count(o_orderkey) as order_count, from orders where o_orderdate >= date '[DATE]' and predicate (filter) o_orderdate < date '[DATE]' + interval '3' month and exists ( select * from lineitem join where l_orderkey = o_orderkey and predicate (filter) l_commitdate < l_receiptdate) group by aggregate o_orderpriority, order by sort o_orderpriority; 3

  4. A TYPICAL ANALYTICS QUERY Joins are implicit in a business question SQL Database Operators Business question Counts the number of select orders ordered in a o_orderpriority, aggregate given quarter of a count(o_orderkey) as order_count, from given year in which at orders least one lineitem was where received by the predicate (filter) o_orderdate >= date '[DATE]' and customer later than its o_orderdate < date '[DATE]' + interval '3' month and exists ( select * from lineitem committed date. The join where l_orderkey = o_orderkey and query lists the count of predicate (filter) l_commitdate < l_receiptdate) such orders for each aggregate group by order priority sorted in o_orderpriority, order by ascending priority sort o_orderpriority; order 4

  5. TPC-H SCHEMA part (p_) PARTKEY customer (c_) NAME order (o_) lineitem (l_) CUSTKEY MFGR ORDERKEY ORDERKEY NAME CATEGORY LINENUMBER CUSTKEY ADDRESS BRAND PARTKEY ORDERDATE CITY … SUPPKEY ORDPRIORITY … COMMITDATE ORDERSTATUS supplier (s_) … RECEIPTDATE SUPPKEY … NAME … ADDRESS nation (n_) CITY NATIONKEY NATIONKEY NAME … … 5

  6. WHERE DOES TIME GO? TPC-H query 4 @SF10 = 10GB data warehouse CPU execution breakdown select o_orderpriority, join group-by count(o_orderkey) as order_count, 1% from orders where o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month and exists ( select * from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate) group by o_orderpriority, order by o_orderpriority; 99% 18/22 TPC-H Queries involve Join & are the longest running ones 6

  7. RELATIONAL JOIN Join Results Lineitem 1 Order 2 l_orderkey o_orderkey o_orderpriority o_orderkey o_orderpriority 23 11 1 23 5 = 14 23 5 11 1 56 27 2 27 2 11 29 4 23 5 39 27 Payload Primary Key 23 Foreign Key 1 after applying predicate “ l_commitdate < l_receiptdate ” 7 2 after applying predicates “ o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month ”

  8. HASH JOIN Join Results Lineitem 1 Order 2 l_orderkey o_orderkey o_orderpriority o_orderkey o_orderpriority 23 11 1 23 5 = 14 23 5 11 1 56 27 2 27 2 11 29 4 23 5 39 27 Payload Primary Key 23 Build hash table Foreign Key = Probe inputs 1 after applying predicate “ l_commitdate < l_receiptdate ” 8 2 after applying predicates “ o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month ”

  9. HASH JOIN General approach – including payload(s) 1. Build a hash table Build table Hash table k 1 ,p 1 k 1 ,p 1 Scan k 2, p 2 k 2 ,p 2 1. Compute h(k) – k is a primary (unique) key 2. Insert k with payload p into hashed location 3. If occupied, insert into next free one 9

  10. HASH JOIN General approach – including payload(s) 1. Build a hash table 2. Probe the hash table Build table Hash table Probe table Hash table k 1 ,p 1 k 1 ,p 1 k 1 ,p 1 fk 1 Scan Scan k 2, p 2 fk 2 k 2 ,p 2 k 3 ,p 3 k 2 ,p 2 k 4 ,p 4 k 5 ,p 5 k 6 ,p 6 k 7 ,p 7 1. Compute h(k) – k is a primary (unique) key 1. Compute h(fk) – fk is foreign key 2. Insert k with payload p into hashed location 2. Compare fk and key k in the hash table 3. If occupied, insert into next free one 3. If there is a match store the result 10

  11. HASH JOIN General approach – including payload(s) 1. Build a hash table 2. Probe the hash table Build table Hash table Probe table Hash table k 1 ,p 1 k 1 ,p 1 k 1 ,p 1 fk 1 Scan Scan k 2, p 2 fk 2 k 2 ,p 2 k 3 ,p 3 k 2 ,p 2 k 4 ,p 4 k 5 ,p 5 k 6 ,p 6 k 7 ,p 7 1. Compute h(k) – k is a primary (unique) key 1. Compute h(fk) – fk is foreign key 2. Insert k with payload p into hashed location 2. Compare fk and key k in the hash table 3. If occupied, insert into next free one 3. If there is a match store the result Build and Probe both produce a random data access pattern! 11

  12. RANDOM ACCESS PERFORMANCE Intel Xeon Gold 6140 (Skylake) DDR4: 6 channels, 64-bit per channel, 2666MT/s, tFAW=30ns* Peak memory bandwidth 120GB/s Random 8B access 6GB/s NVIDIA Tesla V100 (Volta) Peak memory bandwidth 900GB/s Random 8B access 60GB/s 12 *DDR4 specs: https://www.micron.com/~/media/documents/products/data-sheet/dram/ddr4/4gb_ddr4_sdram.pdf

  13. HETEROGENEOUS SYSTEM Faster access 16GB 16GB HBM HBM … Multiple TBs DDR Larger capacity 13

  14. HETEROGENEOUS SYSTEM Data Placement & Movement Faster access Hash Table(s) Key Payload 16GB 16GB 23 5 HBM HBM 27 2 Build & Probe … Multiple TBs DDR DB Larger capacity 14

  15. INTERCONNECT SPEEDS PCIe3: 16GB/s • • NVLINK1: 20GB/s per link, up to 2 links between CPU/GPU = 40GB/s NVLINK2: 25GB/s per link, up to 3 links = 75GB/s • • Further increase throughput via compression: see S8417 tomorrow at 2pm* GPU joins (HT probing): PCIe/NVLINK1: GPU random access bandwidth >> CPU-GPU bandwidth • • NVLINK2 could be limited by GPU random access performance 15 *S8417 - Breaking the Speed of Interconnect with Compression for Database Applications – Tuesday, Mar 27, 2:00pm – Room 210F

  16. JOIN OPERATOR ACCELERATION 16

  17. IMPLEMENTING HASH JOIN concurrent_unordered_map Hash table in Unified Memory can be accessed on the host and template < typename Key , typename Element , device Key unused_key , Element unused_element , typename Hasher = default_hash < Key >, typename Equality = equal_to < Key >, typename Allocator = managed_allocator< thrust :: pair < Key , Element > > > class concurrent_unordered_map : public managed { Features: public : //omitting typedef* Construction on CPU explicit concurrent_unordered_map ( size_type n , • const Hasher & hf , const Equality & eql , • Works on CPU and GPU const allocator_type & a ); Concurrent insert’s • __host__ __device__ iterator begin (); • Concurrent find __host__ __device__ iterator end (); Concurrent insert and find* • __host__ __device__ iterator insert ( const value_type & x ); No concurrent CPU and GPU • __host__ __device__ const_iterator find ( const key_type & k ) const ; insert (currently) void prefetch ( const int dev_id ); }; *No visibility guarantees 17

  18. IMPLEMENTING HASH JOIN 1. Build a hash table with a concurrent_unordered_map template < typename map_type > __global__ void build_hash_tbl ( map_type * const map , const typename map_type :: key_type * const build_tbl , const typename map_type :: size_type build_tbl_size ) { const typename map_type :: mapped_type i = tIdx . x + bIdx . x * bDim . x ; if ( i < build_tbl_size ) { map -> insert ( thrust :: make_pair ( build_tbl [ i ], i ) ); } } 18

  19. IMPLEMENTING HASH JOIN Concurrent Inserts with atomicCAS iterator insert ( const value_type & x ) { size_type key_hash = hf ( x . first ); Key Key Value Value size_type ht_idx = key_hash % ht_size ; unused_key unused_key unused_element unused_element value_type * it = 0 ; while ( 0 == it ) { unused_key unused_key unused_element unused_element value_type * tmp_it = ht_values + ht_idx ; unused_key unused_key unused_element unused_element const key_type old_key = atomicCAS ( &( tmp_it -> first ), unused_key , x . first ); 7287984 7287984 9024 9024 if ( equal ( unused_key , old_key ) || unused_key x.first=75 x.second=875 unused_element equal ( x.first , old_key ) ) { ( ht_values + ht_idx )-> second = x . second ; 283408 283408 2309480 2309480 it = tmp_it ; } unused_key unused_key unused_element unused_element ht_idx = ( ht_idx + 1 )% ht_size ; unused_key unused_key unused_element unused_element } return iterator ( ht_values , ht_values + ht_size , it ); } 19

Recommend


More recommend