how to run sql queries on tbs of data using gpus
play

How to run SQL queries on TBs of data using GPUs Jake Wheat Lead - PowerPoint PPT Presentation

How to run SQL queries on TBs of data using GPUs Jake Wheat Lead Architect, SQream Technologies @sqreamtech SIL7138 1 How to run SQL queries on TBs of data using GPUs 1. A toy SQL query engine 2. Support wide range of SQL queries 3.


  1. How to run SQL queries on TBs of data using GPUs Jake Wheat Lead Architect, SQream Technologies @sqreamtech SIL7138 1

  2. How to run SQL queries on TBs of data using GPUs 1. A toy SQL query engine 2. Support wide range of SQL queries 3. Support larger data 4. Optimise @sqreamtech SIL7138 2

  3. Hypothetical set of physical operator primitives Exam ample SQ SQL Physical cal O Operato ator Implement ntati ation select select a+b, c * 5 from t thrust::transform (a.k.a project/extend/rename) stream aggregate select a, count(*), sum(b), thrust::reduce_by_key avg(b) from t group by a filter select a, b from t where a thrust::remove_if > 0.5 stream distinct select distinct a from t thrust::unique sort select a, b, c, d from t thrust::sort order by a,b union all select * from t union all - select * from u simple implementation: sort merge join (smj) select * from t thrust::upper_bounds, lower_bounds, inner join u using (a) unnest, gather @sqreamtech SIL7138 3

  4. A more complete engine • non indexed nested loop join (ninlj) • not matching • outer join • stream union distinct • distinct aggregates • grouping sets • window functions @sqreamtech SIL7138 4

  5. Running on bigger data @sqreamtech SIL7138 5

  6. Queries on data which doesn't fit in the GPU memory Chunk the data GPU 4 MB Result 1 TB Didn’t fit 1 TB 4 MB Result at a time @sqreamtech SIL7138 6

  7. Queries on data which doesn't fit in the GPU memory Use an external sorting algorithm Extend the idea to other operators too, for instance: NINLJ • SMJ • window functions • distinct aggregates (select a, count(distinct b), count(b) from t group by a) • outer joins • @sqreamtech SIL7138 7

  8. Queries on data which doesn't fit in the GPU memory Use external sorting algorithms and a variety of spools GPU Sort GPU / CPU spool + external sort 4 MB at a time 1 TB @sqreamtech SIL7138 8

  9. Optimising: some non-GPU specific ideas • Use good benchmarking and profiling tools • Take advantage of columnar - don't read columns which aren't needed • Compressed data – helps with both disk I/O and PCI bus I/O • Make sure we use good I/O patterns: big reads and writes, work with the Linux FS cache effectively • Use large host memory buffers • Use AST optimisations (e.g. push predicates closer to the tablescans) • Use a cost based optimizer @sqreamtech SIL7138 9

  10. Some GPU-specific ideas • Use CUDA specific profiling tools in addition to your regular C++ ones (or at least have a way to enable cudaDeviceSynchronize calls all over your code) • GPU task/queue • Combine tasks on the GPU • Use large chunks • Rechunking • Reduce or avoid host ↔ device transfers • Optimise GPU code @sqreamtech SIL7138 10

  11. GPU task/queue 1. Allocate all the GPU memory we will need 2. Upload the data to the GPU 3. Run some kernels on this data 4. Download the results to host 5. Release the allocated GPU memory @sqreamtech SIL7138 11

  12. GPU task/queue Benefits • Can help support concurrency in a single query (to hide PCI transfer latency) • Can help support concurrent queries on a single GPU with some basic fairness property (long running query won't starve out a short running query) • GPU memory usage is very predictable • No inter-task co-ordination needed • No chance of deadlock on GPU resources • Can be simply extended to distribute tasks over multiple GPUs @sqreamtech SIL7138 12

  13. Short GPU task concept, concurrency GPU 1 GPU worker Host Worker GPU task Host Worker GPU worker queue …. Host Worker GPU 2 Host Worker GPU worker Host Worker GPU worker …. @sqreamtech SIL7138 13

  14. Combine tasks on the GPU where possible select a, b+c as d from t where b > 5 order by a Logical TableScan Transform Remove If Sort by Sort Merge a,b,c d:=b+c b>5 a Direct TableScan Transform Remove If Sort by Sort To device To host To device To host To device To host a,b,c d:=b+c b>5 a Merge Combined TableScan Transform Remove If Sort by To device To host Sort Merge a,b,c d:=b+c b>5 a @sqreamtech SIL7138 14

  15. Use larger chunks of data 1. transfer 10,000 rows to device 1. transfer 10,000,000 rows to device 2. transform d := b + c 2. transform d := b + c 3. remove_if b > 5 3. remove_if b > 5 4. sort by a 4. sort by a 5. transfer results to host 5. transfer results to host @sqreamtech SIL7138 15

  16. Rechunking Table scans Optimised disk IO sizes often much bigger than the GPU can handle IO reads GPU process Chunk size Small slow good Big fast too big to fit split Big fast good @sqreamtech SIL7138 16

  17. Rechunking Output of remove_if, join, reduce/reduce_by_key , etc. If the output of these is very small amounts of rows, collecting these rows back into big chunks can improve the performance of later operations (e.g. sort). @sqreamtech SIL7138 17

  18. Rechunking input remove if sort High selectivity fast input remove if sort Low selectivity slow input remove if rechunk sort fast @sqreamtech SIL7138 18

  19. Rechunking Reducing PCI transfer amounts in NINLJ for each chunk in table A: for each chunk in table B: load A,B to the GPU, join them Table A size: 10,000,000,000 rows Table B size: 10,000,000 rows At a chunk size of 1 million rows you will upload 110 billion rows to the device. If you increase it to 20 million rows, you will upload only 15 billion rows to the device 110B / 15B = Potentially more than 7x faster @sqreamtech SIL7138 19

  20. Rechunking Reducing PCI transfer amounts in NINLJ 9 chunks Loop and load 3 chunks Loop and load 9 times 3 times @sqreamtech SIL7138 20

  21. Reduce / avoid HOST ↔ DEVICE transfers • NINLJ: keep the small table on the GPU for the whole join • Chunk skipping - data is inserted to the database ordered by a record timestamp. The storage layer preserves this order (or inserts the data in sorted order) Typical use-case is 24 months of data in the table, but queries only want to summarize a particular day, week or month. @sqreamtech SIL7138 21

  22. Optimising Kernels Three example bottlenecks from SQream: • reduce_by_key • multikey sort • internals of the join @sqreamtech SIL7138 22

  23. Strings are very difficult to deal with • compression • can be large size per record non big data specific GPU issues: • variable length data • collations • unicode, ICU doesn't run on GPUs Good solutions: we're still looking @sqreamtech SIL7138 23

  24. Find out more about SQream’s high performance GPU-driven database software  sqream.com @sqreamtech SIL7138 24

Recommend


More recommend