Workgroup Databases and Software Engineering Memory Management Strategies in CPU/GPU Database Systems: A Survey Iya Arefyeva, David Broneske, Gabriel Campero Durand, Marcus Pinnecke, Gunter Saake Presenter: Marten Wallewein-Eising 1
Motivation GPGPU is an essential technology nowadays: 3 out of Top 5 from HPC 500 (June 2018) are powered by GPUs ● 56% of the flops on the list come from GPU acceleration ● Summit Supercomputer, Oak Ridge 2 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey
Motivation A modern server can have hundreds of GB of RAM (e.g. 64 TB on IBM Power System E980) Top workstation GPUs, on the other hand: ● Nvidia Tesla V100 ● AMD Radeon Vega Frontier Edition picture from ibm.com (both released in June 2017) only 16GB of memory picture from nvidia.com 3 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey
Motivation ● GPU memory size is not enough to store all the data ● No shared memory between a CPU and a GPU ● Data has to be transferred over the PCI-E bus ● Bandwidth is increasing with the years but memory bandwidth of Nvidia Tesla V100 ● The limitations have to be considered until they are eliminated 4 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey
Motivation An ideal GPU memory management model should be able to: 1. allow for GPU memory oversubscription 2. minimize the idle time of the GPU by overlapping transfers and computations 3. avoid unnecessary transfers 4. keep the data coherent 5 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey
Strategies [1] ● Programmer-managed GPU memory: ○ divide-and-conquer approach ○ Unified Memory ● Pinned host memory ○ Mapped memory ○ Unified Virtual Addressing 6 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey
Divide-and-conquer used by He et al. [2] and Wang et al. [3] The data is split into chunks small enough for the GPU memory, the output is merged on the CPU. - Serial processing: - Asynchronous processing: highest speedup when transfer time is ≤ processing time 7 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey
Mapped memory used in Bakkum et al. [4] and Yuan et al. [5] Direct access to data in host memory with implicit data transfers. 1. A block of pinned (page-locked) host memory is allocated 2. Data is copied to this memory block 3. Accessed data is transferred to the GPU memory during processing 8 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey
Unified Virtual Addressing* used in Appuswamy et al. [6] ● Makes usage of mapped memory more convenient ● GPU and CPU share address space: identical pointers for pinned host memory ● Data can be directly transferred between two GPUs * supported starting from CUDA 4.0 and requires a Fermi-class GPU with compute capability ≥ 2.0 9 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey
Unified Memory* Automatically manages device memory allocations and data transfers. ● single pointer for both GPU and CPU memories ● migrates the data (accessed pages) between GPU and CPU ● Makes programming easier, but does not always lead to performance improvements [7] [8] * supported starting from CUDA 6.0 and requires compute capability ≥ 3.0 10 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey
Properties Data location MAIN GPU MEMORY MEMORY UVA = Unified Virtual Addressing divide UM mapped -and- memory UM = Unified Memory conquer UVA Memory oversubscription YES NO divide mapped -and- memory conquer UM UVA 11 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey
Comparison Performance overlapped UVA no unnecessary transfers and data transfers executions mapped memory divide UM -and- conquer fast memory accesses 12 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey
Comparison Convenience no explicit unified address allocations and space UVA transfers mapped memory UM divide-and-conquer no coherence problems 13 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey
When to use? divide -and- conquer mapped memory + GPU operations are read-only + long processing time - GPU changes data + data is big (requires synchronization) + not all elements are accessed - accessing a small portion of - repeated accesses by GPU data (unnecessary transfers) UVA UM + data is big + not all elements are accessed + repeated accesses by one (for data in the main memory) device - repeated accesses by GPU + data changed by both devices (for data in the main memory) - the same data often accessed by both devices 14 Arefyeva et al., Memory Management Strategies in CPU/GPU Database Systems: A Survey
Thank you! Questions? you are also welcome to send questions/suggestions to iya.arefyeva@ovgu.de 15
References 1. Kim, Y., Lee, J., Jo, J.E. and Kim, J., 2014, February. GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on (pp. 546-557). IEEE. 2. He, B., Lu, M., Yang, K., Fang, R., Govindaraju, N.K., Luo, Q., Sander, P.V.: Relational query coprocessing on graphics processors. TODS 34(4) (2009) 21 3. Wang, K., Zhang, K., Yuan, Y., Ma, S., Lee, R., Ding, X., Zhang, X.: Concurrent analytical query processing with GPUs. Proceedings of the VLDB Endowment 7(11) (2014) 1011-1022 4. Bakkum, P., Chakradhar, S.: Efficient data management for GPU databases. Technical report, High Performance Computing on Graphics Processing Units (2012) 5. Yuan, Y., Lee, R., Zhang, X.: The Yin and Yang of processing data warehousing queries on GPU devices. VLDB 6(10) (2013) 817-828 6. Appuswamy, R., Karpathiotakis, M., Porobic, D., Ailamaki, A.: The case for heterogeneous HTAP. In: CIDR. (2017) 7. Negrut, D., Serban, R., Li, A., Seidl, A.: Unified Memory in CUDA 6: A brief overview and related data access. Technical Report TR-2014-09, University of Wisconsin-Madison (2014) 8. Landaverde, R., Zhang, T., Coskun, A.K., Herbordt, M.: An investigation of unified memory access performance in CUDA. In: HPEC, IEEE (2014) 1-6 16
Recommend
More recommend