efficient top k query processing on massively parallel
play

Efficient Top-K Query Processing on Massively Parallel Hardware - PowerPoint PPT Presentation

Efficient Top-K Query Processing on Massively Parallel Hardware ANIL SHANBHAG, HOLGER PIRK, SAM MADDEN 1 CPU GPU 16-24 Cores ~5000 Cores 60 GB/s 250-900 GB/s Main Mem GPU Mem 128-256GB 12-32GB 16-40 GB/s 2 Top-K SELECT id FROM tweets


  1. Efficient Top-K Query Processing on Massively Parallel Hardware ANIL SHANBHAG, HOLGER PIRK, SAM MADDEN 1

  2. CPU GPU 16-24 Cores ~5000 Cores 60 GB/s 250-900 GB/s Main Mem GPU Mem 128-256GB 12-32GB 16-40 GB/s 2

  3. Top-K SELECT id FROM tweets WHERE tweet_time ∈ [X,Y] ORDER BY retweet_count + 0.5*likes_count DESC LIMIT K Typical K is 5-100 3

  4. Top-K 7 Classic Sequential Algorithm: Use a min-heap of size k to maintain the top-k items 15 20 21 32 50 40 4

  5. Partition and Merge On Multi-core CPU: Partition data Core Core Core Merge Results 5

  6. On GPU Does not work well on GPU execution model PROBLEMS ! Warp of Threads • Significant thread divergence ce …… • Ma Maintaini ning ng he heap o p of s size k k pe per t r thr hread l d limits performance ce 6

  7. Intuition Bitonic Top-K Priority Queue ??? Top-K Heap Sort Bitonic Sort Sort Sequential Parallel Sort + Top-K Heap Per-Thread Radix-Select Bucket-Select 7

  8. Bitonic Top-K 8

  9. Bitonic Sequence Two monotonic sequences Sequence S = <a 0 , a 1 , a 2 … a n-1 > such that • a 0 ≤ a 1 ≤ .. ≤ a k • a k+1 ≥ a k+2 ≥ ... ≥ a n-1 9

  10. Bitonic Merge S 1 = <min(a 0 , a n/2 ), min(a 1 , a n/2+1 ), ... min(a n/2-1 , a n-1 )> S 2 = <max(a 0 , a n/2 ), max(a 1 ,a n/2+1 ), ... max(a n/2-1 , a n-1 )> S 1 and S 2 are both bitonic S 1 < S 2 : Every element in S 1 is smaller than any element of S 2 < < < Apply recursively on S 1 and S 2 => From S2 From S1 Sor Sort Entire Sequenc equence e -> > log(n) r ) rounds. 10

  11. Bitonic Sort Complexity : O(n(logn) 2 ) 1 2 Phase 3 Step 1 2 1 4 2 1 0 1 2 3 4 5 6 7 11

  12. Bitonic Top-K Complexity : O(n(logk) 2 ) Unsorted Sequence Phase 1 : Local Sort Sorted Sequences of length k Finding Top-4 in 16 elements Phase 2: Merge P1: Local Sort P2: Merge P2:Merge P3:Rebuild P3: Rebuild Merge neighboring sorted sequences of length k Len 1 2 2 2 Inc To select largest k elements (bitonic sequence) 1 2 1 2 1 2 1 0 1 2 Phase 3: Rebuild 3 4 Sort bitonic sequence of length k 5 v v v 6 7 When list size = k 8 9 10 Result top-k 11 12 13 14 12 15

  13. On the GPU Simplest way to partition into kernels: Each column has a kernel invocation Each thread does 1 comparison n/2 comparisons needed => n/2 threads launched 521ms Naive Sort 130ms Time to find top-32 in Final 14.5ms sequence of size 2 29 One Pass 10ms 13

  14. Optimizations 14

  15. Optimizations SM-2 SM-1 SM-N Registers Registers Registers Registers Upto 3.5 TBps L1 SMEM L1 SMEM L1 SMEM Shared Memory 260 GBps L2 Cache On chip Off chip Global Memory Global Memory 15

  16. Optimization 1: Using Shared Memory Global memory For thread block with T threads, access load 2T elements into Shared memory shared memory access Time to find top-32 in ������ ��� �� sequence of size 2 29 ����� ��� �� �� �� ��� ��� ��� ���� �� �� 16

  17. Optimization 2: Combining Phases Global memory access Instead of loading 2T, lets load 8T elements and Shared memory access combine the 5 phases Shared Memory ������ ��� �� Bandwidth Bound ����� ���� �� �� �� ��� ��� ��� ���� �� �� 17

  18. 4 2 1 4 2 1 0 0 1 1 2 2 3 3 4 4 Optimization 3: 5 5 6 6 Combining Steps 7 7 8 8 9 9 10 10 11 11 12 12 13 13 Shared memory 14 14 access 15 15 Three steps at a time One step at a time ������ ���� �� ����� ���� �� �� �� ��� ��� ��� ���� �� �� 18

  19. Memory Bank 0 1 2 3 4 5 6 7 Optimization 4: 2 1 0 1 Padding 0 Before Padding 2 3 1 2 Address 4 5 3 4 7 6 5 6 7 Memory Bank 8 0 1 2 3 4 5 6 7 9 10 0 1 After Padding 11 12 X 2 3 13 Address X 4 5 14 15 Thread X 6 7 Access Unused Cell ������ ���� �� ����� ���� �� �� �� ��� ��� ��� ���� �� �� 19

  20. Optimization 5: Chunk Permutation Step 1 2 1 4 2 1 0 1 2 3 17.8ms Before 4 4 5 0 16ms 6 After 7 1 2 3 4 5 6 0 1 2 3 4 5 6 7 7 8 9 0 1 2 3 4 5 6 7 10 11 12 13 14 15 Padded Cell 20

  21. Evaluation 21

  22. Setup Intel i7 16 Cores Titan X 60 GB/s 260 GB/s GPU Mem Main Mem 12 GB 64GB 16 GB/s 22

  23. Varying K For 2^29 (1/2 billion) floats from U(0,1) 23

  24. Varying Distributions 24

  25. Integration Dataset: 250 million tweets May 2017 SELECT id FROM tweets WHERE tweet_time < X ORDER BY retweet_count DESC LIMIT 50 4.5 x Faster 25

  26. Conclusion Data analytics on GPUs increasingly common and Top-K on GPU non-trivial Bitonic Top-k: Novel Top-K algorithm for GPU ◦ Distribution Independent ◦ Best performing for K <= 256 Integrated into a real database - >4x performance improvement 26

Recommend


More recommend