of Integrated-GPUs for Database Co-Processing Edward Ching, Norbert - PowerPoint PPT Presentation

Unleashing the Hidden Power of Integrated-GPUs for Database Co-Processing Edward Ching, Norbert Egi , Masood Mortazavi, Vincent Cheung, Guangyu Shi BigSys’14, September 25 th 2014 IT Research Department

Overview • Wide variety of compute resources available (MC CPUs, GPUs, FPGAs, DSPs, etc.) • Discrete GPUs (d-GPU) might be the most well-known • Integrated GPUs (i-GPU) became available for general-purpose parallel computation • Architectural and performance comparison 2

Introduction • Discrete GPUs (d-GPUs) have long been used for application acceleration • CPU+GPU co-processing for data analytics being widely adopted: requires less computation compared to HPC, but on much more data • PCIe became performance “bottleneck” • Recent CPUs with integrated GPUs (i-GPUs) look like a viable alternative • Our focus is on modern i-GPUs for parallel data processing • Help system designers in selecting the right architectural option 3

Architecture: d-GPU • Large local memory and cache hierarchy • High-throughput GDR5 access • Connection over PCIe – Low throughput – High latency 4

Architecture: i-GPU Haswell Microarchitecture • Connection over internal bus – High throughput – Low latency • Shared LLC • True zero-copy 5

Hardware parameters Nvidia GTX780 Intel HD4600 Cores 12 20 Threads / Core 6 7 Data lane / Thread 32 8 / 16 / 32 Max. Physical Occupancy 2304 4480 Clock (GHz) 1.0 1.25 Power consumption (W) 250 <30 GFLOPS 3977 432 Memory / Cache 3GB GDR5 8MB L3 cache 6

Data Transfer Mechanisms d-GPU i-GPU DMA Memory Mapping Zero-Copy  HW supported  No CPU or GPU intervention  Directly to GDR5  Goes over the relatively slow PCIe  Can add to the total execution time 7

Data Transfer Mechanisms d-GPU i-GPU DMA Memory Mapping Zero-Copy  HW supported  DDR3 can directly be referenced via GPU MMU  No CPU or GPU  Programming MMU is intervention faster the DMA transfer  Directly to GDR5  Only data that is needed is moved  Goes over the relatively  GPU is stalled during slow PCIe data transfer  Can add to the total  Goes over the relatively execution time slow PCIe 8

Data Transfer Mechanisms d-GPU i-GPU DMA Memory Mapping Zero-Copy  HW supported  DDR3 can directly be  Shared cache and main referenced via GPU MMU memory  No CPU or GPU  Programming MMU is  GTT mapping (similar to intervention faster the DMA transfer MMU)  Directly to GDR5  Only data that is needed  Data goes over the fast is moved internal bus of the CPU  Goes over the relatively  GPU is stalled during  Data can even be slow PCIe data transfer retrieved from shared LLC  Can add to the total  Goes over the relatively  Works “vice versa” (CPU  GPU) execution time slow PCIe 9

Performance Analysis • (1) Compilation Environment • (2) Raw data transfer • (3) Micro-benchmarks • (4) Database queries 10

Compilation Environment OpenCL Query Co-Processing Functions OpenCL Kernel src, OpenCL API Linux OpenCL API for Haswell OpenCL Kernel src GPU (EU, GTT, etc) config Linux OpenCL Compiler Memory Mgmt for Haswell GPU binary Execution Control Haswell GPU Instr Set binary Linux OpenCL driver for Haswell Device access Haswell Processor Graphics Hardware (Gen 7 - HD4600) 11

Raw Data Transfer 12

Micro-benchmarks • Simple optimal memory access patterns: Map, Reduce, Scan; • Randomized memory access pattern: Gather, Scatter; • Combination of several sorting operations: Split, Bitonic and Radix Sort; 13

Micro-benchmarks 14

Micro-benchmarks 15

Database query TPC-H Q1 TPC-H Q9 select select nation, o_year, sum(amount) as sum_profit l_returnflag, from ( l_linestatus, select sum(l_quantity) as sum_qty, n_name as nation, sum(l_extendedprice) as sum_base_price, extract(year from o_orderdate) as o_year, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, l_extendedprice * (1 - l_discount) - ps_supplycost * sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as l_quantity as amount sum_charge, from avg(l_quantity) as avg_qty, part, supplier, lineitem, partsupp, orders, nation avg(l_extendedprice) as avg_price, where avg(l_discount) as avg_disc, s_suppkey = l_suppkey and ps_suppkey = l_suppkey count(*) as count_order and ps_partkey = l_partkey and p_partkey = l_partkey from and o_orderkey = l_orderkey and s_nationkey = lineitem n_nationkey and p_name like '%[COLOR] %’ ) as profit where group by l_shipdate <= date '1998-12-01' - interval '[DELTA]' day (3) nation, group by o_year l_returnflag, order by l_linestatus nation, order by o_year desc; l_returnflag, l_linestatus; 16

Database query (TPC-H Q1) TPC-H Q1 UDF Benchmark Test: iGPU vs dGPUs vs CPU 50.00 (ms) 40.00 30.00 Time (ms) 20.00 10.00 0.00 0.500 0.400 0.300 Throughput (GB/s/W) 0.200 0.100 0.000 17

Database query (TPC-H Q9) 18

Conclusion • Examined query and primitive operation processing • Used micro-benchmarks and more realistic data- analytics queries • Found, that i-GPU compute resources are weaker • But excel significantly in the speed of data access • Behave as “free” resources • Consume far less power 19

Q & A 20

R&D Openings in Munich • Huawei’s European Research Center (ERC) • 10+ openings • Database System Architects and Software Engineers • recruitment.erc@huawei.com • http://career.huawei.com/career/en/i18n/toJobDeta il.do?callMethod=toJobDetail&jobID=43263 21

of Integrated-GPUs for Database Co-Processing Edward Ching, Norbert - PowerPoint PPT Presentation

Unleashing the Hidden Power of Integrated-GPUs for Database Co-Processing Edward Ching, Norbert Egi , Masood Mortazavi, Vincent Cheung, Guangyu Shi BigSys14, September 25 th 2014 IT Research Department Overview Wide variety of compute

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Native Offload of Haskell Repa Programs to Integrated GPUs Hai (Paul) Liu with Laurence Day, Neal

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

MD5 Chosen-Prefix Collisions on GPUs Marc Bevand m.bevand@gmail.com marc.bevand@rapid7.com

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

Integrated Resource Plan Integrated Resource Plan Rick Haener September 4th, 2015 Integrated

INTEGRATED GOVERNANCE : A Model to Achieve Benefits Through Coherency Management. Integrated

Integrated care Londons programme of change 1. The benefits of integrated care learning from

Can GPUs Cure Cancer? Multi-scale Integrative Analysis Predict treatment outcome, select,

Data Science Applications of GPUs in the R University of California at Language Davis GTC 2016

How Walmart Improves Forecast Accuracy with NVIDIA GPUs March 19, 2019 Agenda Walmarts

How to Deal with Radiation: Evaluation and Mitigation of GPUs Soft-Errors Paolo Rech

Sibyl: A Practical Internet Route Oracle Ethan Katz-Bassett (University of Southern California)

Tier-1s break Anycast DNS Zhihao Li, Neil Spring D-Root: 199.7.91.13 111 Anycast

GPU-resource multiplexing in component-based systems Sebastian Sumpf <

Phylogenetics: Parsimony and Likelihood COMP 571 - Spring 2016 Luay Nakhleh, Rice University

COMPROMISING ELECTROMAGNETIC EMANATIONS OF WIRED AND WIRELESS KEYBOARDS EPFL/LASEC/USENIX

Machine Learning Classifiers: Many Diverse Ways to Learn CS271P, Winter Quarter, 2019

CS Research for The Tree of Life Tandy Warnow The Tree of Life Fundamental science:

The Kondo effect in dense QCD In collaboration with Xu-Guang Huang (Fudan U.) and Rob Pisarski

of Integrated-GPUs for Database Co-Processing Edward Ching, Norbert - PowerPoint PPT Presentation

Unleashing the Hidden Power of Integrated-GPUs for Database Co-Processing Edward Ching, Norbert Egi , Masood Mortazavi, Vincent Cheung, Guangyu Shi BigSys14, September 25 th 2014 IT Research Department Overview Wide variety of compute

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Native Offload of Haskell Repa Programs to Integrated GPUs Hai (Paul) Liu with Laurence Day, Neal

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

MD5 Chosen-Prefix Collisions on GPUs Marc Bevand m.bevand@gmail.com marc.bevand@rapid7.com

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

Integrated Resource Plan Integrated Resource Plan Rick Haener September 4th, 2015 Integrated

INTEGRATED GOVERNANCE : A Model to Achieve Benefits Through Coherency Management. Integrated

Integrated care Londons programme of change 1. The benefits of integrated care learning from

Can GPUs Cure Cancer? Multi-scale Integrative Analysis Predict treatment outcome, select,

Data Science Applications of GPUs in the R University of California at Language Davis GTC 2016

How Walmart Improves Forecast Accuracy with NVIDIA GPUs March 19, 2019 Agenda Walmarts

How to Deal with Radiation: Evaluation and Mitigation of GPUs Soft-Errors Paolo Rech

Sibyl: A Practical Internet Route Oracle Ethan Katz-Bassett (University of Southern California)

Tier-1s break Anycast DNS Zhihao Li, Neil Spring D-Root: 199.7.91.13 111 Anycast

GPU-resource multiplexing in component-based systems Sebastian Sumpf &lt;

Phylogenetics: Parsimony and Likelihood COMP 571 - Spring 2016 Luay Nakhleh, Rice University

COMPROMISING ELECTROMAGNETIC EMANATIONS OF WIRED AND WIRELESS KEYBOARDS EPFL/LASEC/USENIX

Machine Learning Classifiers: Many Diverse Ways to Learn CS271P, Winter Quarter, 2019

CS Research for The Tree of Life Tandy Warnow The Tree of Life Fundamental science:

The Kondo effect in dense QCD In collaboration with Xu-Guang Huang (Fudan U.) and Rob Pisarski

GPU-resource multiplexing in component-based systems Sebastian Sumpf <