Efficient Top-K Query Processing on Massively Parallel Hardware - PowerPoint PPT Presentation

Efficient Top-K Query Processing on Massively Parallel Hardware ANIL SHANBHAG, HOLGER PIRK, SAM MADDEN 1

CPU GPU 16-24 Cores ~5000 Cores 60 GB/s 250-900 GB/s Main Mem GPU Mem 128-256GB 12-32GB 16-40 GB/s 2

Top-K SELECT id FROM tweets WHERE tweet_time ∈ [X,Y] ORDER BY retweet_count + 0.5*likes_count DESC LIMIT K Typical K is 5-100 3

Top-K 7 Classic Sequential Algorithm: Use a min-heap of size k to maintain the top-k items 15 20 21 32 50 40 4

Partition and Merge On Multi-core CPU: Partition data Core Core Core Merge Results 5

On GPU Does not work well on GPU execution model PROBLEMS ! Warp of Threads • Significant thread divergence ce …… • Ma Maintaini ning ng he heap o p of s size k k pe per t r thr hread l d limits performance ce 6

Intuition Bitonic Top-K Priority Queue ??? Top-K Heap Sort Bitonic Sort Sort Sequential Parallel Sort + Top-K Heap Per-Thread Radix-Select Bucket-Select 7

Bitonic Top-K 8

Bitonic Sequence Two monotonic sequences Sequence S = <a 0 , a 1 , a 2 … a n-1 > such that • a 0 ≤ a 1 ≤ .. ≤ a k • a k+1 ≥ a k+2 ≥ ... ≥ a n-1 9

Bitonic Merge S 1 = <min(a 0 , a n/2 ), min(a 1 , a n/2+1 ), ... min(a n/2-1 , a n-1 )> S 2 = <max(a 0 , a n/2 ), max(a 1 ,a n/2+1 ), ... max(a n/2-1 , a n-1 )> S 1 and S 2 are both bitonic S 1 < S 2 : Every element in S 1 is smaller than any element of S 2 < < < Apply recursively on S 1 and S 2 => From S2 From S1 Sor Sort Entire Sequenc equence e -> > log(n) r ) rounds. 10

Bitonic Sort Complexity : O(n(logn) 2 ) 1 2 Phase 3 Step 1 2 1 4 2 1 0 1 2 3 4 5 6 7 11

Bitonic Top-K Complexity : O(n(logk) 2 ) Unsorted Sequence Phase 1 : Local Sort Sorted Sequences of length k Finding Top-4 in 16 elements Phase 2: Merge P1: Local Sort P2: Merge P2:Merge P3:Rebuild P3: Rebuild Merge neighboring sorted sequences of length k Len 1 2 2 2 Inc To select largest k elements (bitonic sequence) 1 2 1 2 1 2 1 0 1 2 Phase 3: Rebuild 3 4 Sort bitonic sequence of length k 5 v v v 6 7 When list size = k 8 9 10 Result top-k 11 12 13 14 12 15

On the GPU Simplest way to partition into kernels: Each column has a kernel invocation Each thread does 1 comparison n/2 comparisons needed => n/2 threads launched 521ms Naive Sort 130ms Time to find top-32 in Final 14.5ms sequence of size 2 29 One Pass 10ms 13

Optimizations 14

Optimizations SM-2 SM-1 SM-N Registers Registers Registers Registers Upto 3.5 TBps L1 SMEM L1 SMEM L1 SMEM Shared Memory 260 GBps L2 Cache On chip Off chip Global Memory Global Memory 15

Optimization 1: Using Shared Memory Global memory For thread block with T threads, access load 2T elements into Shared memory shared memory access Time to find top-32 in �� sequence of size 2 29 �� 16

Optimization 2: Combining Phases Global memory access Instead of loading 2T, lets load 8T elements and Shared memory access combine the 5 phases Shared Memory �� Bandwidth Bound �� 17

4 2 1 4 2 1 0 0 1 1 2 2 3 3 4 4 Optimization 3: 5 5 6 6 Combining Steps 7 7 8 8 9 9 10 10 11 11 12 12 13 13 Shared memory 14 14 access 15 15 Three steps at a time One step at a time �� 18

Memory Bank 0 1 2 3 4 5 6 7 Optimization 4: 2 1 0 1 Padding 0 Before Padding 2 3 1 2 Address 4 5 3 4 7 6 5 6 7 Memory Bank 8 0 1 2 3 4 5 6 7 9 10 0 1 After Padding 11 12 X 2 3 13 Address X 4 5 14 15 Thread X 6 7 Access Unused Cell �� 19

Optimization 5: Chunk Permutation Step 1 2 1 4 2 1 0 1 2 3 17.8ms Before 4 4 5 0 16ms 6 After 7 1 2 3 4 5 6 0 1 2 3 4 5 6 7 7 8 9 0 1 2 3 4 5 6 7 10 11 12 13 14 15 Padded Cell 20

Evaluation 21

Setup Intel i7 16 Cores Titan X 60 GB/s 260 GB/s GPU Mem Main Mem 12 GB 64GB 16 GB/s 22

Varying K For 2^29 (1/2 billion) floats from U(0,1) 23

Varying Distributions 24

Integration Dataset: 250 million tweets May 2017 SELECT id FROM tweets WHERE tweet_time < X ORDER BY retweet_count DESC LIMIT 50 4.5 x Faster 25

Conclusion Data analytics on GPUs increasingly common and Top-K on GPU non-trivial Bitonic Top-k: Novel Top-K algorithm for GPU ◦ Distribution Independent ◦ Best performing for K <= 256 Integrated into a real database - >4x performance improvement 26

Efficient Top-K Query Processing on Massively Parallel Hardware - PowerPoint PPT Presentation

Efficient Top-K Query Processing on Massively Parallel Hardware ANIL SHANBHAG, HOLGER PIRK, SAM MADDEN 1 CPU GPU 16-24 Cores ~5000 Cores 60 GB/s 250-900 GB/s Main Mem GPU Mem 128-256GB 12-32GB 16-40 GB/s 2 Top-K SELECT id FROM tweets

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

Parallel Query Execution in POLARDB for MySQL ystein Grvlen Benny Wang Alibaba Cloud Agenda

Massively Parallel Communication and Query Evaluation Paul Beame U. of Washington Based on

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Breaking the Linear-Memory Barrier in Massively Parallel Computing MIS on Trees with Strongly

Loosely Dependent Parallel Processes Complementary Paradigms Massively Parallel Task

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Massively Parallel A* Search on a GPU Yichao Zhou Jianyang Zeng Institute for Interdisciplinary

MPMPLAPACK: A Massively Parallel Multi-Precision Linear Algebra Package Jason Martin

To TOP or NOT to TOP www.SAS.com To TOP or NOT to TOP Using the TOP command in Linux By Len van

1 Recap: Taking Differences Distribution of Integer Values 0.1 idea: use efficient coding

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

alignment, arrays, and pointers hic 1 allocation of multiple variables Consider the program

Lecture 18: Concluding Convolutional Neural Networks, Graphical Models as Foundation for Recurrent

Untagging Tor: A Tale of Onions, Raccoons, and Security Definitions Jean Paul Degabriele Martijn

Optimizer for Timing Closure Yi-Shan Lu 1 , Wenmian Hua 2 , Rajit Manohar 2 , Keshav Pingali 1 1

Embperl - How to Build Large Scale Websites/Webapplications With Perl ApacheCon 2002 Gerald

Lower Bounds for Encrypted Multi-Maps and Searchable Encryption in the Leakage Cell Probe Model

Internet Software Technologies I t t S ft T h l i HTML HTML IMCNE IMCNE A.A. 2008/09

Deep learning 4.5. Pooling Fran cois Fleuret https://fleuret.org/ee559/ Nov 2, 2020 The