Memory Management Strategies in CPU/GPU Database Systems: A Survey - PowerPoint PPT Presentation

Workgroup Databases and Software Engineering Memory Management Strategies in CPU/GPU Database Systems: A Survey Iya Arefyeva, David Broneske, Gabriel Campero Durand, Marcus Pinnecke, Gunter Saake Presenter: Marten Wallewein-Eising 1

Motivation GPGPU is an essential technology nowadays: 3 out of Top 5 from HPC 500 (June 2018) are powered by GPUs ● 56% of the flops on the list come from GPU acceleration ● Summit Supercomputer, Oak Ridge 2 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey

Motivation A modern server can have hundreds of GB of RAM (e.g. 64 TB on IBM Power System E980) Top workstation GPUs, on the other hand: ● Nvidia Tesla V100 ● AMD Radeon Vega Frontier Edition picture from ibm.com (both released in June 2017) only 16GB of memory picture from nvidia.com 3 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey

Motivation ● GPU memory size is not enough to store all the data ● No shared memory between a CPU and a GPU ● Data has to be transferred over the PCI-E bus ● Bandwidth is increasing with the years but memory bandwidth of Nvidia Tesla V100 ● The limitations have to be considered until they are eliminated 4 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey

Motivation An ideal GPU memory management model should be able to: 1. allow for GPU memory oversubscription 2. minimize the idle time of the GPU by overlapping transfers and computations 3. avoid unnecessary transfers 4. keep the data coherent 5 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey

Strategies [1] ● Programmer-managed GPU memory: ○ divide-and-conquer approach ○ Unified Memory ● Pinned host memory ○ Mapped memory ○ Unified Virtual Addressing 6 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey

Divide-and-conquer used by He et al. [2] and Wang et al. [3] The data is split into chunks small enough for the GPU memory, the output is merged on the CPU. - Serial processing: - Asynchronous processing: highest speedup when transfer time is ≤ processing time 7 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey

Mapped memory used in Bakkum et al. [4] and Yuan et al. [5] Direct access to data in host memory with implicit data transfers. 1. A block of pinned (page-locked) host memory is allocated 2. Data is copied to this memory block 3. Accessed data is transferred to the GPU memory during processing 8 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey

Unified Virtual Addressing* used in Appuswamy et al. [6] ● Makes usage of mapped memory more convenient ● GPU and CPU share address space: identical pointers for pinned host memory ● Data can be directly transferred between two GPUs * supported starting from CUDA 4.0 and requires a Fermi-class GPU with compute capability ≥ 2.0 9 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey

Unified Memory* Automatically manages device memory allocations and data transfers. ● single pointer for both GPU and CPU memories ● migrates the data (accessed pages) between GPU and CPU ● Makes programming easier, but does not always lead to performance improvements [7] [8] * supported starting from CUDA 6.0 and requires compute capability ≥ 3.0 10 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey

Properties Data location MAIN GPU MEMORY MEMORY UVA = Unified Virtual Addressing divide UM mapped -and- memory UM = Unified Memory conquer UVA Memory oversubscription YES NO divide mapped -and- memory conquer UM UVA 11 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey

Comparison Performance overlapped UVA no unnecessary transfers and data transfers executions mapped memory divide UM -and- conquer fast memory accesses 12 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey

Comparison Convenience no explicit unified address allocations and space UVA transfers mapped memory UM divide-and-conquer no coherence problems 13 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey

When to use? divide -and- conquer mapped memory + GPU operations are read-only + long processing time - GPU changes data + data is big (requires synchronization) + not all elements are accessed - accessing a small portion of - repeated accesses by GPU data (unnecessary transfers) UVA UM + data is big + not all elements are accessed + repeated accesses by one (for data in the main memory) device - repeated accesses by GPU + data changed by both devices (for data in the main memory) - the same data often accessed by both devices 14 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey

Thank you! Questions? you are also welcome to send questions/suggestions to iya.arefyeva@ovgu.de 15

References 1. Kim, Y., Lee, J., Jo, J.E. and Kim, J., 2014, February. GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on (pp. 546-557). IEEE. 2. He, B., Lu, M., Yang, K., Fang, R., Govindaraju, N.K., Luo, Q., Sander, P.V.: Relational query coprocessing on graphics processors. TODS 34(4) (2009) 21 3. Wang, K., Zhang, K., Yuan, Y., Ma, S., Lee, R., Ding, X., Zhang, X.: Concurrent analytical query processing with GPUs. Proceedings of the VLDB Endowment 7(11) (2014) 1011-1022 4. Bakkum, P., Chakradhar, S.: Efficient data management for GPU databases. Technical report, High Performance Computing on Graphics Processing Units (2012) 5. Yuan, Y., Lee, R., Zhang, X.: The Yin and Yang of processing data warehousing queries on GPU devices. VLDB 6(10) (2013) 817-828 6. Appuswamy, R., Karpathiotakis, M., Porobic, D., Ailamaki, A.: The case for heterogeneous HTAP. In: CIDR. (2017) 7. Negrut, D., Serban, R., Li, A., Seidl, A.: Unified Memory in CUDA 6: A brief overview and related data access. Technical Report TR-2014-09, University of Wisconsin-Madison (2014) 8. Landaverde, R., Zhang, T., Coskun, A.K., Herbordt, M.: An investigation of unified memory access performance in CUDA. In: HPEC, IEEE (2014) 1-6 16

Memory Management Strategies in CPU/GPU Database Systems: A Survey - PowerPoint PPT Presentation

Workgroup Databases and Software Engineering Memory Management Strategies in CPU/GPU Database Systems: A Survey Iya Arefyeva, David Broneske, Gabriel Campero Durand, Marcus Pinnecke, Gunter Saake Presenter: Marten Wallewein-Eising 1

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy

? Group 6 ? ? CPU ? CPU Memory We want multiple processors to share memory

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Memory and I/O buses I/O bus 1880Mbps 1056Mbps Memory CPU Crossbar CPU accesses physical

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Early warning of climate tipping points Tim Lenton With thanks to John Schellnhuber, Valerie

On the performance of the Lasso in terms of prediction loss Joint work with M. Hebiri and J.

Holger Stark Max-Planck-Institute for Biophysical Chemistry and University of Gttingen 37077

ALT 8 The Dynamics of Probabilistic Eighth Biennial Conference of the Association for Grammar:

Joint variable and rank selection for parsimonious estimation of high dimensional matrices

Alexis Brandeker Stockholm University Duy Cuong Nguyen ,

Partial agreement in German: A processing issue? Ilona Steiner SFB 441, University of Tbingen

February 2003 FIRST Technical Colloquium February 10-11, 2003 @ Uppsala, Sweden bifrost a high