Toward GPUs being mainstream in analytic processing An initial argument using simple scan- aggregate queries Jason Power || Yinan Li || Mark D. Hill Jignesh M. Patel || David A. Wood < powerjg@cs.wisc.edu> DaMoN 2015 6/1/2015 UNIVERSITY OF WISCONSIN 1
Summary ▪ GPUs are energy efficient ▪ Discrete GPUs unpopular for DBMS ▪ New integrated GPUs solve the problems ▪ Scan-aggregate GPU implementation ▪ Wide bit-parallel scan ▪ Fine-grained aggregate GPU offload ▪ Up to 70% energy savings over multicore CPU ▪ Even more in the future 6/1/2015 UNIVERSITY OF WISCONSIN 2
Analytic Data is Growing ▪ Data is growing rapidly ▪ Analytic DBs increasingly important Source: IDC’s Digital Universe Study. 2012. Want: High performance Need: Low energy 6/1/2015 UNIVERSITY OF WISCONSIN 3
GPUs to the Rescue? ▪ GPUs are becoming more general ▪ Easier to program ▪ Integrated GPUs are everywhere ▪ GPUs show great promise [Govindaraju ’04, He ’14, He ’14, Kaldewey ‘12, Satish ’10, and many others] ▪ Higher performance than CPUs ▪ Better energy efficiency ▪ Analytic DBs look like GPU workloads 6/1/2015 UNIVERSITY OF WISCONSIN 4
GPU Microarchitecture Compute Unit Graphics Processing Unit I-Fetch/Sched CU SP SP SP SP SP SP SP SP L2 Cache SP SP SP SP Register File Scratchpad L1 Cache Cache 6/1/2015 UNIVERSITY OF WISCONSIN 5
Discrete GPUs CPU chip PCIe Bus Cores Memory Bus Discrete GPU Memory Bus 6/1/2015 UNIVERSITY OF WISCONSIN 6
Discrete GPUs ➊ CPU chip PCIe Bus Cores Memory Bus Discrete GPU Memory Bus ➋ 6/1/2015 UNIVERSITY OF WISCONSIN 7
Discrete GPUs ➌ CPU chip PCIe Bus Cores Memory Bus ➍ And repeat Discrete GPU Memory Bus 6/1/2015 UNIVERSITY OF WISCONSIN 8
Discrete GPUs ▪ Copy data over PCIe ➊ ▪ Low bandwidth ▪ High latency ▪ Small working memory ➋ ▪ High latency user → kernel calls ➌ ▪ Repeated many times ➍ 98% of time spent not computing 6/1/2015 UNIVERSITY OF WISCONSIN 9
Integrated GPUs Heterogeneous chip CPU cores Memory Bus GPU CUs 6/1/2015 UNIVERSITY OF WISCONSIN 10
Heterogeneous System Arch. ▪ API for tightly-integrated accelerators ▪ Industry support ▪ Initial hardware support today ▪ HSA foundation (AMD, ARM,Qualcomm, others) ▪ No need for data copies ➊➋ ▪ Cache coherence and shared address space ❹ ▪ No OS kernel interaction ➌ ▪ User-mode queues 6/1/2015 UNIVERSITY OF WISCONSIN 11
Outline ▪ Background ▪ Algorithms ▪ Scan ▪ Aggregate ▪ Results 6/1/2015 UNIVERSITY OF WISCONSIN 12
Analytic DBs ▪ Resident in main-memory ▪ Column-based layout ▪ WideTable & BitWeaving [Li and Patel ‘13 & ‘14] ▪ Convert queries to mostly scans by pre-joining tables ▪ Fast scan by using sub-word parallelism ▪ Similar to industry proposals [SAP Hana, Oracle Exalytics, IBM DB2 BLU] ▪ Scan-aggregate queries 6/1/2015 UNIVERSITY OF WISCONSIN 13
Running Example Shirt Shirt Shirt Color Color Amount 2 1 Green 2 3 Color Code Green 0 1 1 Red Blue 1 Blue 2 5 Green 2 Green 3 7 Yellow 3 Yellow 0 2 Red 3 1 Yellow 1 4 Blue 3 2 Yellow 6/1/2015 UNIVERSITY OF WISCONSIN 14
Running Example Shirt Shirt Shirt Count the number of Color Color Amount green shirts in the 2 1 Green inventory 2 3 Green 1 1 Blue 2 5 Green Scan the color ➊ 3 7 Yellow column for green (2) 0 2 Red 3 1 Yellow Aggregate amount ➋ 1 4 Blue where there is a match 3 2 Yellow 6/1/2015 UNIVERSITY OF WISCONSIN 15
Traditional Scan Algorithm Shirt Column 10 10 01 Color Data 2 (10) Compare 2 (10) 10 10 10 Code 1 (01) (Green) 2 (10) 3 (11) Result BitVector 11 11010000 0000... 110 0 (00) 3 (11) 1 (01) 3 (11) 6/1/2015 UNIVERSITY OF WISCONSIN 16
Vertical Layout Color word word word c0 c0 c0 c1 c1 c2 c3 c4 c5 c6 c7 2 (10) c0 w0 w0 w0 1 1 1 1 1 0 1 1 0 1 0 2 (10) c1 w1 w1 w1 0 0 0 0 0 1 0 1 0 1 1 1 (01) c2 c8 c9 2 (10) c3 w2 1 0 3 (11) c4 w3 1 0 0 (00) c5 3 (11) c6 1 (01) c7 110110110 00101011 10000000 ... 3 (11) c8 0 (00) c9 6/1/2015 UNIVERSITY OF WISCONSIN 17
CPU BitWeaving Scan Column 11011011 00101011 10000000 Data Compare 11111111 00000000 Code Result 1101 11010000 0000... BitVector CPU width: 64-bits, up to 256-bit SIMD 6/1/2015 UNIVERSITY OF WISCONSIN 18
GPU BitWeaving Scan Column 11011011 00101011 10000000 Data Compare 11111111 11111111 11111111 Code Result 11010000 0000... BitVector GPU width: 16,384-bit SIMD 6/1/2015 UNIVERSITY OF WISCONSIN 19
GPU Scan Algorithm ▪ GPU uses very wide “words” ▪ CPU: 64-bits or 256-bits with SIMD ▪ GPU: 16,384 bits (256 lanes × 64-bits) ▪ Memory and caches optimized for bandwidth ▪ HSA programming model ▪ No data copies ▪ Low CPU-GPU interaction overhead 6/1/2015 UNIVERSITY OF WISCONSIN 20
CPU Aggregate Algorithm Shirt Result Amount BitVector 11010000 0000... 1 3 1 5 7 1+3+5+... 1+3 Result 2 1 4 2 6/1/2015 UNIVERSITY OF WISCONSIN 21
GPU Aggregate Algorithm Result BitVector 11010000 0000... On CPU Column 0,1 0,1,3,... Offsets 6/1/2015 UNIVERSITY OF WISCONSIN 22
GPU Aggregate Algorithm Shirt Column 0,1,3,... Amount Offsets 1 3 1 On GPU 5 7 1+3+5+... Result 2 1 4 2 6/1/2015 UNIVERSITY OF WISCONSIN 23
Aggregate Algorithm ▪ Two phases ▪ Convert from BitVector to offsets (on CPU) ▪ Materialize data and compute ( offload to GPU ) ▪ Two group-by algorithms (see paper) ▪ HSA programming model ▪ Fine-grained sharing ▪ Can offload subset of computation 6/1/2015 UNIVERSITY OF WISCONSIN 24
Outline ▪ Background ▪ Algorithms ▪ Results 6/1/2015 UNIVERSITY OF WISCONSIN 25
Experimental Methods ▪ AMD A10-7850 ▪ 4-core CPU ▪ 8-compute unit GPU ▪ 16GB capacity, 21 GB/s DDR3 memory ▪ Separate discrete GPU ▪ Watts-Up meter for full-system power ▪ TPC-H @ scale-factor 10 6/1/2015 UNIVERSITY OF WISCONSIN 26
Scan Performance & Energy 6/1/2015 UNIVERSITY OF WISCONSIN 27
Scan Performance & Energy Takeaway: Integrated GPU most efficient for scans 6/1/2015 UNIVERSITY OF WISCONSIN 28
TPC-H Queries Query 12 Performance 6/1/2015 UNIVERSITY OF WISCONSIN 29
TPC-H Queries Query 12 Performance Query 12 Energy Integrated GPU faster for both aggregate and scan computation 6/1/2015 UNIVERSITY OF WISCONSIN 30
TPC-H Queries Query 12 Performance Query 12 Energy 6/1/2015 UNIVERSITY OF WISCONSIN 31
TPC-H Queries Query 12 Performance Query 12 Energy More energy : Decrease in latency does not offset power increase Less energy : Decrease in latency AND decrease in power 6/1/2015 UNIVERSITY OF WISCONSIN 32
Future Die Stacked GPUs ▪ 3D die stacking DRAM ▪ Same physical & logical integration GPU ▪ Increased compute CPU Board ▪ Increased bandwidth Power et al. Implications of 3D GPUs on the Scan Primitive SIGMOD Record. Volume 44, Issue 1. March 2015 6/1/2015 UNIVERSITY OF WISCONSIN 33
Conclusions Discrete Integrated 3D Stacked GPUs GPUs GPUs High ☺ High ☺ Moderate Performance Memory High ☺ High ☺ Low ☹ Bandwidth Low ☺ Low ☺ High ☹ Overhead Memory High ☺ Low ☹ Moderate Capacity 6/1/2015 UNIVERSITY OF WISCONSIN 34
? 6/1/2015 UNIVERSITY OF WISCONSIN 35
HSA vs CUDA/OpenCL ▪ HSA defines a heterogeneous architecture ▪ Cache coherence ▪ Shared virtual addresses ▪ Architected queuing ▪ Intermediate language ▪ CUDA/OpenCL are a level above HSA ▪ Come with baggage ▪ Not as flexible ▪ May not be able to take advantage of all features 6/1/2015 UNIVERSITY OF WISCONSIN 36
Scan Performance & Energy 6/1/2015 UNIVERSITY OF WISCONSIN 37
Group-by Algorithms 6/1/2015 UNIVERSITY OF WISCONSIN 38
All TPC-H Results 6/1/2015 UNIVERSITY OF WISCONSIN 39
Average TPC-H Results Average Performance Average Energy 6/1/2015 UNIVERSITY OF WISCONSIN 40
What’s Next? ▪ Developing cost model for GPU ▪ Using the GPU is just another algorithm to choose ▪ Evaluate exactly when the GPU is more efficient ▪ Future “database machines” ▪ GPUs are a good tradeoff between specialization and commodity 6/1/2015 UNIVERSITY OF WISCONSIN 41
Conclusions ▪ Integrated GPUs viable for DBMS? ▪ Solve problems with discrete GPUs ▪ (Somewhat) better performance and energy ▪ Looking toward the future... ▪ CPUs cannot keep up with bandwidth ▪ GPUs perfectly designed for these workloads 6/1/2015 UNIVERSITY OF WISCONSIN 42
Recommend
More recommend