RAPIDS CUDA DataFrame Internals for C++ Developers - S91043 Jake - PowerPoint PPT Presentation

RAPIDS CUDA DataFrame Internals for C++ Developers - S91043 Jake Hemstad - NVIDIA - Developer Technology Engineer GTC2019 | 03/20/19

What is RAPIDS cuDF? Open-Source CUDA DataFrame cuDF File Read and Data Preparation GPU-accelerated DataFrames Data science operations: filter, sort, join, groupby,… High-level, Python productivity (Pandas-like) Bare-metal, CUDA/C++ performance Time in seconds — Shorter is better github.com/rapidsai/cudf 200GB CSV dataset; Data preparation includes joins, variable rapids.ai transformations. 5x DGX-1 on InfiniBand network. CPU nodes: 61 GiB of memory, 8 vCPUs, 64-bit platform, Apache Spark 2

libcudf Who This Talk is For You want to learn about: cuDF Pandas-like ● libcudf: cuDF’s underlying C++14 library How to use libcudf in your applications ● Cython CUDA-accelerated data science algorithms ● libcudf How to contribute to libcudf ● Thrust CUB Jitify RAPIDS Memory Manager (RMM) CUDA 3

CUDA DataFrame What is a DataFrame? Mortgage ID Pay Date Amount($) Think spreadsheet Equal length columns of different types 101 12/18/2018 1029.30 How to store in memory? 102 12/21/2018 1429.31 cuDF uses Apache Arrow [1] ● 103 12/14/2018 1289.27 Contiguous, column-major data ● 101 01/15/2018 1104.59 representation 102 01/17/2018 1457.15 103 NULL NULL 4 [1] https://arrow.apache.org/docs/memory_layout.html

Apache Arrow Memory Format Enabling Interoperability cuDF cuML cuGraph cuDNN 5

Column Representation libcudf is column-centric All operations defined in terms of struct column { void* data; // contiguous buffer operations on columns int size; // number of elements DType type; // type indicator uint32_t* mask; // null bitmask Type-erased data ( void*) allows }; interoperability with other languages and enum DType { type systems INT, // int value FLOAT, // float value Arrow enables efficient, shallow copy DATE // int64_t ms since epoch ... data sharing across frameworks/languages }; 6

Null Bitmask How To Represent Missing Data Any element can be NULL —> undefined values = [0, 1, null, NaN, null, 3] Different from NaN —> defined invalid NULL values are represented in bitmask bitmask = [0 0 1 0 1 0 1 1] = 0x2B 1-bit per element 0 == NULL ● ● 1 == not NULL Least-significant bit ordering 7

Column Example Apache Arrow Memory Layout Mortgage ID Mortgage data = [101, 102, 103, 101, 102, 103] Pay Date Amount ID size = 6 type = INT Note LSB order bitmask = [0x3F] = [0 0 1 1 1 1 1 1] 101 12/18/2018 1029.30 Pay Date 102 12/21/2018 1429.31 data = [1545091200000, 1545350400000, 1544745600000, 1514764800000, 1516147200000, *garbage* ] size = 6 103 12/14/2018 1289.27 type = DATE bitmask = [0x1F] = [0 0 0 1 1 1 1 1] 101 01/15/2018 1104.59 Amount data = [1029.30, 1429.31, 1289.27, 102 01/17/2018 1457.15 1104.59, 1457.15, *garbage*] size = 6 103 NULL NULL type = FLOAT bitmask = [0x1F] = [0 0 0 1 1 1 1 1] 8

libcudf Operations All functions act on one or more columns Operations include: ● Sort void some_function( cudf::column const* input, Join ● cudf::column * output, ● Groupby args...) Filtering ● { ● Transpose // Do something with input etc. ● // Produce output Input columns are generally immutable } 9

Example Operation Sort in->data is type-erased void sort(cudf::column * in){ switch(in->type){ case INT: 1. Deduce T from enum in->type typed_sort<int32_t>(in); break; case FLOAT: 2. Call function template with T typed_sort<float>(in); break; case DATE: 3. Cast in->data to T* typed_sort<int64_t>(in); break; ... 4. Perform thrust::sort with } } typed_data template <typename T> void typed_sort(cudf::column * in){ Common pattern in libcudf T* typed_data{ static_cast<T*>(in->data) }; thrust::sort(thrust::device, Problem: Duplicated switch es are typed_data, typed_data + in->size); difficult to maintain and error-prone } 10

Type Dispatching libcudf ’s Solution template <typename Functor> Centralize and abstract the switch auto type_dispatcher(DType type, Functor F) type_dispatcher { switch(type){ 1. Maps type enum to T case INT: return F<int32_t>(); 2. Invokes functor F<T>() case FLOAT: return F<float>(); case DATE: return F<int64_t>(); ... } } Note: The syntax F<T>() is abbreviated for clarity. The correct syntax is F::template operator()<T>(). libcudf’s type dispatcher also supports functors with arguments 11

Type Dispatching Sort Revisited sort.cu Define a functor F with operator() template #include <type_dispatcher.cuh> type_dispatcher maps type to T and sort_functor{ cudf::column _col; invokes F<T>() sort_functor(cudf_column col ) : _col{col} {} sort_functor casts with T template <typename T> void operator()(){ Perform thrust::sort on typed_data T* typed_data = static_cast<T*>(_col->data); thrust::sort(typed_data, typed_data + _col->size); } }; void sort(cudf::column * col){ type_dispatcher(col->type, sort_functor{ *col }); } 12

Type Dispatching Combinatorial Type Explosion Binary operations between two columns are void binary_op(cudf::column* out, cudf::column* lhs, common (e.g., sum, minus, div, etc.) cudf::column* rhs, Op op) { out = lhs op rhs // out, lhs, rhs types are all independent // Need to instantiate code for all combinations // Repeat for every `op` Independent types } 11+ types, 14+ ops Problem: 11 3 x 14 = ~18,600 instantiations ● 1+ hour to compile just binary operations ● 13

Solution: JIT compilation with Jitify Simplify CUDA Run-time Compilation const char* program_source = "my_program\n" "template<int N, typename T>\n" Compiles specialized kernel string at run time "__global__\n" "void my_kernel(T* data) {\n" " T data0 = data[0];\n" Compiled kernel is cached for reuse " for( int i=0; i<N-1; ++i ) {\n" " data[0] *= data0;\n" libcudf uses Jitify for binary operations " }\n" "}\n"; ● ~300ms overhead to compile new kernel static jitify::JitCache kernel_cache; ~150ms to reuse kernel w/ new types ● jitify::Program program = kernel_cache.program(program_source); ● Trivial overhead to reuse from cache dim3 grid(1); dim3 block(1); https://github.com/NVIDIA/jitify using jitify::reflection::type_of; program.kernel("my_kernel") .instantiate(3, type_of(*data)) // Instantiates template .configure(grid, block) .launch(data); 14

Recap libcudf so far... Apache Arrow memory layout ● sort_functor{ ● Column-centric operations cudf::column _col; sort_functor(cudf_column col ) : _col{col} {} Type-erased data ● template <typename T> type_dispatcher to reconstruct type ● void operator()(){ T* typed_data = static_cast<T*>(_col->data); Runtime compilation w/ Jitify ● // Allocates temporary memory! thrust::sort(thrust::device, Many operations require temporary memory typed_data, typed_data + _col->size); allocations } }; Most cuDF ops not performed in place: void sort(cudf::column * col){ many column allocations/deallocations type_dispatcher(col->type, sort_functor{ *col }); } 15

Memory Management 16

Memory Management Overhead Example: cuDF Mortgage Workflow Data cleanup and feature engineering 1. Read CSV files into DataFrames 2. Joins, groupbys, unary/binary ops 3. Create DMatrix for XGBoost cuDF ops are not in-place => frequent malloc/free 88% of cuDF time spent in CUDA memory management! 17

CUDA Memory Allocation cudaMalloc / cudaFree: Why are they expensive? Synchronous: blocks the device cudaMalloc(&buffer, size_in_bytes); cudaFree scrubs memory for security cudaFree(buffer); Peer Access: GPU-to-GPU direct memory access cudaMalloc creates peer mappings Scales O(#GPUs 2 ) 18

RMM Memory Pool Allocation https://github.com/rapidsai/rmm Use large cudaMalloc allocation as memory pool Previously Allocated Blocks Custom memory management in pool bufferA Streams enable asynchronous malloc/free bufferB RMM currently uses CNMem as it’s Sub-allocator cudaMalloc’ d Memory Pool https://github.com/NVIDIA/cnmem RMM is standalone and free to use in your own projects! GPU Memory 19

RAPIDS Memory Manager (RMM) Drop-in Allocation Replacement RMM_ALLOC(&buffer, size_in_bytes, stream_id); RMM_FREE(buffer, stream_id); Asynchronous rmm::device_vector<int> dvec(size); thrust::sort(rmm::exec_policy(stream)->on(stream), … ); dev_ones = rmm.device_array(np.ones(count)) dev_twos = rmm.device_array_like(dev_ones) # also rmm.to_device(), rmm.auto_device(), etc. 20

RMM Raw Performance 1000x faster than cudaMalloc/cudaFree (microbenchmark) 21

RMM: 10x Performance on RAPIDS Mortgage Workflow on 16x V100 GPUs of DGX-2 Time spent in Total ETL Time % Time malloc/free cudaMalloc / cudaFree (no pool) 486s 550s 88.3% rmmAlloc / rmmFree (pool) 0.088s 55s 0.16% 10x cudaMalloc/cudaFree overhead gets worse with more GPUs RMM is valuable even on Single-GPU runs, where the fraction is “only” 14-15% RMM benefit is combination of low-overhead suballocation and reduced synchronization 22

Deep Dive 23

RAPIDS CUDA DataFrame Internals for C++ Developers - S91043 Jake - PowerPoint PPT Presentation

RAPIDS CUDA DataFrame Internals for C++ Developers - S91043 Jake Hemstad - NVIDIA - Developer Technology Engineer GTC2019 | 03/20/19 What is RAPIDS cuDF? Open-Source CUDA DataFrame cuDF File Read and Data Preparation GPU-accelerated

Windows 8 Heap Internals Windows 8 Heap Internals Windows 8 Heap Internals INTRODUCTION Windows 8

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

RAPIDS: Deep Dive Into How the Platform Works Paul Mahler, 3/18/19 Introduction to RAPIDS 2

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Welcome Perham to Pelican Rapids Regional Trail Perham to Pelican Rapids Regional Trail Status

Webinar Series CITY OF GRAND RAPIDS' CANNABIS LICENSING, SOCIAL EQUITY, AND ZONING REGULATIONS

DataFrame column operations CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Developers Developers Developers (what about creatives?) Ryan Gorley Designer + Marketer + Art

Age-Friendly Grand Rapids Strategic Priority Alignment Economic Vibrancy & Affordability

RAPIDS: PLATFORM INSIDE AND OUT Joshua Patterson 3-19-2019 RAPIDS End to End Accelerate GPU Data

MARS RAPIDS GPU

AD39 - Making the Jump from DevOps to DevSecOps

Massive-parallel Input & Output with structured extensible data (HDF5 successful

Combating Malware in the age of APT SANS Digital Forensic and Incident Response Summit July 2010

Mayors Office of Economic Development FY 21 Budget Hearing John Barros, Chief May 12, 2020

Computer Science Class XII ( As per CBSE Board) Visit : python.mykvs.in for regular updates

Fine-Grained User-Space Security Through Virtualization Mathias Payer and Thomas R. Gross ETH

MET Symposium Airways NZ view of future use of MET information 31 August 2017 Future

Domain Adaptation from a Pre-trained Source Model Application on fraud detection tasks Presenter:

RAPIDS CUDA DataFrame Internals for C++ Developers - S91043 Jake - PowerPoint PPT Presentation

RAPIDS CUDA DataFrame Internals for C++ Developers - S91043 Jake Hemstad - NVIDIA - Developer Technology Engineer GTC2019 | 03/20/19 What is RAPIDS cuDF? Open-Source CUDA DataFrame cuDF File Read and Data Preparation GPU-accelerated

Windows 8 Heap Internals Windows 8 Heap Internals Windows 8 Heap Internals INTRODUCTION Windows 8

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

RAPIDS: Deep Dive Into How the Platform Works Paul Mahler, 3/18/19 Introduction to RAPIDS 2

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Welcome Perham to Pelican Rapids Regional Trail Perham to Pelican Rapids Regional Trail Status

Webinar Series CITY OF GRAND RAPIDS' CANNABIS LICENSING, SOCIAL EQUITY, AND ZONING REGULATIONS

DataFrame column operations CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Developers Developers Developers (what about creatives?) Ryan Gorley Designer + Marketer + Art

Age-Friendly Grand Rapids Strategic Priority Alignment Economic Vibrancy &amp; Affordability

RAPIDS: PLATFORM INSIDE AND OUT Joshua Patterson 3-19-2019 RAPIDS End to End Accelerate GPU Data

MARS RAPIDS GPU

AD39 - Making the Jump from DevOps to DevSecOps

Massive-parallel Input &amp; Output with structured extensible data (HDF5 successful

Combating Malware in the age of APT SANS Digital Forensic and Incident Response Summit July 2010

Mayors Office of Economic Development FY 21 Budget Hearing John Barros, Chief May 12, 2020

Computer Science Class XII ( As per CBSE Board) Visit : python.mykvs.in for regular updates

Fine-Grained User-Space Security Through Virtualization Mathias Payer and Thomas R. Gross ETH

MET Symposium Airways NZ view of future use of MET information 31 August 2017 Future

Domain Adaptation from a Pre-trained Source Model Application on fraud detection tasks Presenter:

Age-Friendly Grand Rapids Strategic Priority Alignment Economic Vibrancy & Affordability

Massive-parallel Input & Output with structured extensible data (HDF5 successful