Efficient Top-K Query Processing on Massively Parallel Hardware ANIL SHANBHAG, HOLGER PIRK, SAM MADDEN 1
CPU GPU 16-24 Cores ~5000 Cores 60 GB/s 250-900 GB/s Main Mem GPU Mem 128-256GB 12-32GB 16-40 GB/s 2
Top-K SELECT id FROM tweets WHERE tweet_time ∈ [X,Y] ORDER BY retweet_count + 0.5*likes_count DESC LIMIT K Typical K is 5-100 3
Top-K 7 Classic Sequential Algorithm: Use a min-heap of size k to maintain the top-k items 15 20 21 32 50 40 4
Partition and Merge On Multi-core CPU: Partition data Core Core Core Merge Results 5
On GPU Does not work well on GPU execution model PROBLEMS ! Warp of Threads • Significant thread divergence ce …… • Ma Maintaini ning ng he heap o p of s size k k pe per t r thr hread l d limits performance ce 6
Intuition Bitonic Top-K Priority Queue ??? Top-K Heap Sort Bitonic Sort Sort Sequential Parallel Sort + Top-K Heap Per-Thread Radix-Select Bucket-Select 7
Bitonic Top-K 8
Bitonic Sequence Two monotonic sequences Sequence S = <a 0 , a 1 , a 2 … a n-1 > such that • a 0 ≤ a 1 ≤ .. ≤ a k • a k+1 ≥ a k+2 ≥ ... ≥ a n-1 9
Bitonic Merge S 1 = <min(a 0 , a n/2 ), min(a 1 , a n/2+1 ), ... min(a n/2-1 , a n-1 )> S 2 = <max(a 0 , a n/2 ), max(a 1 ,a n/2+1 ), ... max(a n/2-1 , a n-1 )> S 1 and S 2 are both bitonic S 1 < S 2 : Every element in S 1 is smaller than any element of S 2 < < < Apply recursively on S 1 and S 2 => From S2 From S1 Sor Sort Entire Sequenc equence e -> > log(n) r ) rounds. 10
Bitonic Sort Complexity : O(n(logn) 2 ) 1 2 Phase 3 Step 1 2 1 4 2 1 0 1 2 3 4 5 6 7 11
Bitonic Top-K Complexity : O(n(logk) 2 ) Unsorted Sequence Phase 1 : Local Sort Sorted Sequences of length k Finding Top-4 in 16 elements Phase 2: Merge P1: Local Sort P2: Merge P2:Merge P3:Rebuild P3: Rebuild Merge neighboring sorted sequences of length k Len 1 2 2 2 Inc To select largest k elements (bitonic sequence) 1 2 1 2 1 2 1 0 1 2 Phase 3: Rebuild 3 4 Sort bitonic sequence of length k 5 v v v 6 7 When list size = k 8 9 10 Result top-k 11 12 13 14 12 15
On the GPU Simplest way to partition into kernels: Each column has a kernel invocation Each thread does 1 comparison n/2 comparisons needed => n/2 threads launched 521ms Naive Sort 130ms Time to find top-32 in Final 14.5ms sequence of size 2 29 One Pass 10ms 13
Optimizations 14
Optimizations SM-2 SM-1 SM-N Registers Registers Registers Registers Upto 3.5 TBps L1 SMEM L1 SMEM L1 SMEM Shared Memory 260 GBps L2 Cache On chip Off chip Global Memory Global Memory 15
Optimization 1: Using Shared Memory Global memory For thread block with T threads, access load 2T elements into Shared memory shared memory access Time to find top-32 in ������ ��� �� sequence of size 2 29 ����� ��� �� �� �� ��� ��� ��� ���� �� �� 16
Optimization 2: Combining Phases Global memory access Instead of loading 2T, lets load 8T elements and Shared memory access combine the 5 phases Shared Memory ������ ��� �� Bandwidth Bound ����� ���� �� �� �� ��� ��� ��� ���� �� �� 17
4 2 1 4 2 1 0 0 1 1 2 2 3 3 4 4 Optimization 3: 5 5 6 6 Combining Steps 7 7 8 8 9 9 10 10 11 11 12 12 13 13 Shared memory 14 14 access 15 15 Three steps at a time One step at a time ������ ���� �� ����� ���� �� �� �� ��� ��� ��� ���� �� �� 18
Memory Bank 0 1 2 3 4 5 6 7 Optimization 4: 2 1 0 1 Padding 0 Before Padding 2 3 1 2 Address 4 5 3 4 7 6 5 6 7 Memory Bank 8 0 1 2 3 4 5 6 7 9 10 0 1 After Padding 11 12 X 2 3 13 Address X 4 5 14 15 Thread X 6 7 Access Unused Cell ������ ���� �� ����� ���� �� �� �� ��� ��� ��� ���� �� �� 19
Optimization 5: Chunk Permutation Step 1 2 1 4 2 1 0 1 2 3 17.8ms Before 4 4 5 0 16ms 6 After 7 1 2 3 4 5 6 0 1 2 3 4 5 6 7 7 8 9 0 1 2 3 4 5 6 7 10 11 12 13 14 15 Padded Cell 20
Evaluation 21
Setup Intel i7 16 Cores Titan X 60 GB/s 260 GB/s GPU Mem Main Mem 12 GB 64GB 16 GB/s 22
Varying K For 2^29 (1/2 billion) floats from U(0,1) 23
Varying Distributions 24
Integration Dataset: 250 million tweets May 2017 SELECT id FROM tweets WHERE tweet_time < X ORDER BY retweet_count DESC LIMIT 50 4.5 x Faster 25
Conclusion Data analytics on GPUs increasingly common and Top-K on GPU non-trivial Bitonic Top-k: Novel Top-K algorithm for GPU ◦ Distribution Independent ◦ Best performing for K <= 256 Integrated into a real database - >4x performance improvement 26
Recommend
More recommend