Unleashing the Hidden Power of Integrated-GPUs for Database Co-Processing Edward Ching, Norbert Egi , Masood Mortazavi, Vincent Cheung, Guangyu Shi BigSys’14, September 25 th 2014 IT Research Department
Overview • Wide variety of compute resources available (MC CPUs, GPUs, FPGAs, DSPs, etc.) • Discrete GPUs (d-GPU) might be the most well-known • Integrated GPUs (i-GPU) became available for general-purpose parallel computation • Architectural and performance comparison 2
Introduction • Discrete GPUs (d-GPUs) have long been used for application acceleration • CPU+GPU co-processing for data analytics being widely adopted: requires less computation compared to HPC, but on much more data • PCIe became performance “bottleneck” • Recent CPUs with integrated GPUs (i-GPUs) look like a viable alternative • Our focus is on modern i-GPUs for parallel data processing • Help system designers in selecting the right architectural option 3
Architecture: d-GPU • Large local memory and cache hierarchy • High-throughput GDR5 access • Connection over PCIe – Low throughput – High latency 4
Architecture: i-GPU Haswell Microarchitecture • Connection over internal bus – High throughput – Low latency • Shared LLC • True zero-copy 5
Hardware parameters Nvidia GTX780 Intel HD4600 Cores 12 20 Threads / Core 6 7 Data lane / Thread 32 8 / 16 / 32 Max. Physical Occupancy 2304 4480 Clock (GHz) 1.0 1.25 Power consumption (W) 250 <30 GFLOPS 3977 432 Memory / Cache 3GB GDR5 8MB L3 cache 6
Data Transfer Mechanisms d-GPU i-GPU DMA Memory Mapping Zero-Copy HW supported No CPU or GPU intervention Directly to GDR5 Goes over the relatively slow PCIe Can add to the total execution time 7
Data Transfer Mechanisms d-GPU i-GPU DMA Memory Mapping Zero-Copy HW supported DDR3 can directly be referenced via GPU MMU No CPU or GPU Programming MMU is intervention faster the DMA transfer Directly to GDR5 Only data that is needed is moved Goes over the relatively GPU is stalled during slow PCIe data transfer Can add to the total Goes over the relatively execution time slow PCIe 8
Data Transfer Mechanisms d-GPU i-GPU DMA Memory Mapping Zero-Copy HW supported DDR3 can directly be Shared cache and main referenced via GPU MMU memory No CPU or GPU Programming MMU is GTT mapping (similar to intervention faster the DMA transfer MMU) Directly to GDR5 Only data that is needed Data goes over the fast is moved internal bus of the CPU Goes over the relatively GPU is stalled during Data can even be slow PCIe data transfer retrieved from shared LLC Can add to the total Goes over the relatively Works “vice versa” (CPU GPU) execution time slow PCIe 9
Performance Analysis • (1) Compilation Environment • (2) Raw data transfer • (3) Micro-benchmarks • (4) Database queries 10
Compilation Environment OpenCL Query Co-Processing Functions OpenCL Kernel src, OpenCL API Linux OpenCL API for Haswell OpenCL Kernel src GPU (EU, GTT, etc) config Linux OpenCL Compiler Memory Mgmt for Haswell GPU binary Execution Control Haswell GPU Instr Set binary Linux OpenCL driver for Haswell Device access Haswell Processor Graphics Hardware (Gen 7 - HD4600) 11
Raw Data Transfer 12
Micro-benchmarks • Simple optimal memory access patterns: Map, Reduce, Scan; • Randomized memory access pattern: Gather, Scatter; • Combination of several sorting operations: Split, Bitonic and Radix Sort; 13
Micro-benchmarks 14
Micro-benchmarks 15
Database query TPC-H Q1 TPC-H Q9 select select nation, o_year, sum(amount) as sum_profit l_returnflag, from ( l_linestatus, select sum(l_quantity) as sum_qty, n_name as nation, sum(l_extendedprice) as sum_base_price, extract(year from o_orderdate) as o_year, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, l_extendedprice * (1 - l_discount) - ps_supplycost * sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as l_quantity as amount sum_charge, from avg(l_quantity) as avg_qty, part, supplier, lineitem, partsupp, orders, nation avg(l_extendedprice) as avg_price, where avg(l_discount) as avg_disc, s_suppkey = l_suppkey and ps_suppkey = l_suppkey count(*) as count_order and ps_partkey = l_partkey and p_partkey = l_partkey from and o_orderkey = l_orderkey and s_nationkey = lineitem n_nationkey and p_name like '%[COLOR] %’ ) as profit where group by l_shipdate <= date '1998-12-01' - interval '[DELTA]' day (3) nation, group by o_year l_returnflag, order by l_linestatus nation, order by o_year desc; l_returnflag, l_linestatus; 16
Database query (TPC-H Q1) TPC-H Q1 UDF Benchmark Test: iGPU vs dGPUs vs CPU 50.00 (ms) 40.00 30.00 Time (ms) 20.00 10.00 0.00 0.500 0.400 0.300 Throughput (GB/s/W) 0.200 0.100 0.000 17
Database query (TPC-H Q9) 18
Conclusion • Examined query and primitive operation processing • Used micro-benchmarks and more realistic data- analytics queries • Found, that i-GPU compute resources are weaker • But excel significantly in the speed of data access • Behave as “free” resources • Consume far less power 19
Q & A 20
R&D Openings in Munich • Huawei’s European Research Center (ERC) • 10+ openings • Database System Architects and Software Engineers • recruitment.erc@huawei.com • http://career.huawei.com/career/en/i18n/toJobDeta il.do?callMethod=toJobDetail&jobID=43263 21
Recommend
More recommend