unified memory for data analytics and deep learning
play

UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING Nikolay - PowerPoint PPT Presentation

UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING Nikolay Sakharnykh, Chirayu Garg, and Dmitri Vainbrand, Thu Mar 19, 3:00 PM RAPIDS CUDA-accelerated Data Science Libraries PYTHON DL RAPIDS FRAMEWORKS DASK / SPARK cuGraph cuDF cuML


  1. UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING Nikolay Sakharnykh, Chirayu Garg, and Dmitri Vainbrand, Thu Mar 19, 3:00 PM

  2. RAPIDS CUDA-accelerated Data Science Libraries PYTHON DL RAPIDS FRAMEWORKS DASK / SPARK cuGraph cuDF cuML cuDNN CUDA APACHE ARROW on GPU Memory 2

  3. MORTGAGE PIPELINE: ETL https://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb CSV read CSV filter DF join groupby Arrow 3

  4. MORTGAGE PIPELINE: PREP + ML https://github.com/rapidsai/notebooks/blob/master/mortgage/E2E.ipynb Arrow Arrow Arrow concat DF convert DMatrix XGboost 4

  5. GTC EU KEYNOTE RESULTS ON DGX-1 Mortage workflow time breakdown on DGX-1 (s) 140 120 100 80 60 40 20 0 ETL PREP ML 5

  6. MAXIMUM MEMORY USAGE ON DGX-1 35 Tesla V100 limit – 32GB 30 25 20 GB 15 10 5 0 1 2 3 4 5 6 7 8 GPU ID 6

  7. ETL INPUT https://rapidsai.github.io/demos/datasets/mortgage-data 112 quarters (~2-3GB) 240 quarters (1GB) original input set CSV CSV CSV CSV CSV CSV CSV CSV CSV CSV CSV CSV CSV CSV CSV … CSV CSV CSV 7

  8. 10 15 20 25 30 35 40 0 5 1 1388 2775 Tesla V100 limit – 32GB 4162 5549 6936 GPU memory usage (GB) - ETL 8323 CAN WE AVOID INPUT SPLITTING? 9710 11097 12484 13871 15258 16645 (112 parts) 18032 19419 20806 22193 23580 24967 26354 27741 29128 30515 31902 33289 34676 36063 37450 10 15 20 25 30 35 40 0 5 1 75 149 223 GPU memory usage (GB) - ETL 297 371 445 519 (original dataset) 593 667 741 815 889 963 1037 1111 1185 1259 1333 1407 1481 1555 1629 1703 CRASH 1777 OOM 1851 1925 1999 8

  9. ML INPUT Some # of quarters are used for ML training Arrow Arrow Arrow concat DF convert DMatrix XGboost 9

  10. CAN WE TRAIN ON MORE DATA? GPU memory usage (GB) - PREP GPU memory usage (GB) - PREP (112->20 parts) (112->28 parts) 35 35 Tesla V100 limit – 32GB 30 30 25 25 OOM CRASH 20 20 15 15 10 10 5 5 0 0 1 38 75 112 149 186 223 260 297 334 371 408 445 482 519 556 593 630 667 704 741 778 815 852 889 926 963 1000 1 24 47 70 93 116 139 162 185 208 231 254 277 300 323 346 369 392 415 438 461 484 507 530 553 576 599 622 10

  11. HOW MEMORY MANAGED IN RAPIDS 11

  12. RAPIDS MEMORY MANAGER https://github.com/rapidsai/rmm RAPIDS Memory Manager (RMM) is: • A replacement allocator for CUDA Device Memory A pool allocator to make CUDA device memory allocation faster & asynchronous • • A central place for all device memory allocations in cuDF and other RAPIDS libraries 12

  13. WHY DO WE NEED MEMORY POOLS cudaMalloc/cudaFree are synchronous cudaMalloc(&buffer, size_in_bytes); • block the device cudaFree(buffer); cudaMalloc/cudaFree are expensive cudaFree must zero memory for security • • cudaMalloc creates peer mappings for all GPUs Using cnmem memory pool improves RAPIDS ETL time by 10x 13

  14. RAPIDS MEMORY MANAGER (RMM) Fast, Asynchronous Device Memory Management RMM_ALLOC(&buffer, size_in_bytes, stream_id); C/C++ RMM_FREE(buffer, stream_id); dev_ones = rmm.device_array(np.ones(count)) Python : drop-in replacement dev_twos = rmm.device_array_like(dev_ones) for Numba API # also rmm.to_device(), rmm.auto_device(), etc. #include <rmm_thrust_allocator.h> Thrust : device vector and rmm::device_vector<int> dvec(size); execution policies thrust::sort(rmm::exec_policy(stream)- >on(stream), …); 14

  15. MANAGING MEMORY IN THE E2E PIPELINE perf optimization At this point all ETL processing is done and memory stored in arrow required to avoid OOM Arrow 15

  16. KEY MEMORY MANAGEMENT QUESTIONS Can we make memory management easier? • • Can we avoid artificial pre-processing of input data? Can we train on larger datasets? • 16

  17. SOLUTION: UNIFIED MEMORY Partially Occupied Fully Occupied Empty Oversubscription GPU memory GPU memory GPU memory cudaMallocManaged cudaMallocManaged ... cudaMallocManaged ... cudaMallocManaged ... Evict Page on GPU Page on GPU (oversubscribed) CPU Memory 17

  18. HOW TO USE UNIFIED MEMORY IN CUDF from librmm_cffi import librmm_config as rmm_cfg Python rmm_cfg.use_pool_allocator = True # default is False rmm_cfg.use_managed_memory = True # default is False 18

  19. IMPLEMENTATION DETAILS Regular RMM allocation: if (rmm::Manager::usePoolAllocator()) { RMM_CHECK(rmm::Manager::getInstance().registerStream(stream)); RMM_CHECK_CNMEM(cnmemMalloc(reinterpret_cast<void**>(ptr), size, stream)); } else if (rmm::Manager::useManagedMemory()) RMM_CHECK_CUDA(cudaMallocManaged(reinterpret_cast<void**>(ptr), size)); else RMM_CHECK_CUDA(cudaMalloc(reinterpret_cast<void**>(ptr), size)); Pool allocator (CNMEM): if (mFlags & CNMEM_FLAGS_MANAGED) { CNMEM_DEBUG_INFO("cudaMallocManaged(%lu)\n", size); CNMEM_CHECK_CUDA(cudaMallocManaged(&data, size)); CNMEM_CHECK_CUDA(cudaMemPrefetchAsync(data, size, mDevice)); } else { CNMEM_DEBUG_INFO("cudaMalloc(%lu)\n", size); CNMEM_CHECK_CUDA(cudaMalloc(&data, size)); } 19

  20. 1. UNSPLIT DATASET “JUST WORKS” GPU memory usage (GB) - ETL GPU memory usage (GB) - ETL (original (original dataset) – cudaMalloc dataset) - cudaMallocManaged 100 100 90 90 80 80 70 70 60 60 50 50 OOM mem used CRASH pool size Tesla V100 limit – 32GB 40 40 30 30 20 20 10 10 0 0 1065 1141 1217 1293 1369 1445 1521 1597 1673 1749 1825 1901 1977 1013 1266 1519 1772 2025 2278 2531 2784 3037 3290 3543 3796 4049 4302 4555 4808 5061 5314 1 77 153 229 305 381 457 533 609 685 761 837 913 989 1 254 507 760 20

  21. 2. SPEED-UP ON CONVERSION DGX-1 time (s) 140 120 100 80 ETL 60 PREP 46 36 ML 40 20 0 20 quarters 20 quarters cudaMalloc cudaMallocManaged 25% speed-up on PREP! 21

  22. 3. LARGER ML TRAINING SET DGX-1 time (s) 160 140 120 100 ETL 80 PREP 60 ML OOM! 40 20 0 20 quarters 20 quarters 28 quarters 28 quarters cudaMalloc cudaMallocManaged cudaMalloc cudaMallocManaged 22

  23. UNIFIED MEMORY GOTCHAS 1. UVM doesn’t work with CUDA IPC - careful when sharing data between processes Workaround - separate (small) cudaMalloc pool for communication buffers In the future it will work transparently with Linux HMM 2. Yes, you can oversubscribe, but there is danger that it will just run very slowly Capture Nsight or nvprof profiles to check eviction traffic In the future RMM may show some warnings about this 23

  24. RECAP carefully partition input data Just to run the full pipeline on the GPU you need adjust memory pool options throughout the pipeline limit training size to fit in memory makes life easier for data scientists – less tweaking! Unified Memory improves performance – sometimes it’s faster to allocate less often & oversubscribe enables easy experiments with larger datasets 24

  25. MEMORY MANAGEMENT IN THE FUTURE OmniSci XGBoost BlazingDB cuDNN cuDF cuML NEXT BIG Databases THING Contribute to RAPIDS: https://github.com/rapidsai/cudf Contribute to RMM: https://github.com/rapidsai/rmm 25

  26. UNIFIED MEMORY FOR DEEP LEARNING 26

  27. FROM ANALYTICS TO DEEP LEARNING Machine Learning Deep Learning Data Preparation 27

  28. PYTORCH INTEGRATION PyTorch uses a caching allocator to manage GPU memory Small allocations distributed from fixed buffer (for ex: 1 MB) Large allocations are dedicated cudaMalloc’s Trivial change Replace cudaMalloc with cudaMallocManaged Immediately call cudaMemPrefetchAsync to allocate pages on GPU Otherwise cuDNN may select sub-optimal kernels 28

  29. PYTORCH ALLOCATOR VS RMM PyTorch Caching Allocator RMM Memory pool to avoid synchronization on Memory pool to avoid synchronization on malloc/free malloc/free Directly uses CUDA APIs for memory Uses Cnmem for memory allocation and allocations management Pool size not fixed Reserves half the available GPU memory for pool Specific to PyTorch C++ library Re-usable across projects and with interfaces for various languages 29

  30. WORKLOADS Image Models BN-ReLU-Conv 1x1 BN-ReLU-Conv 1x1 BN-ReLU-Conv 3x3 ResNet-1001 + DenseNet-264 VNet 30

  31. WORKLOADS Language Models Word Language Modelling Loss Dictionary Size = 33278 Softmax Embedding Size = 256 FC LSTM units = 256 LSTM Back propagation through time = 1408 and 2800 Embedding 31

  32. WORKLOADS Baseline Training Performance on V100-32GB FP16 FP32 Model Batch Size Samples/sec Batch Size Samples/sec ResNet-1001 98 98.7 48 44.3 DenseNet-264 218 255.8 109 143.1 Vnet 30 3.56 15 3.4 Lang_Model-1408 32 94.9 40 77.9 Lang_Model-2800 16 46.5 18 35.7 Optimal Batch Size Selected for High Throughput 32 All results in this presentation are using PyTorch 1.0rc1, R418 driver , Tesla V100-32GB

  33. Samples/sec 100 120 20 40 60 80 0 2 14 26 38 50 62 74 86 98 ResNet-1001 110 FP16 122 Batch Size 134 GPU OVERSUBSCRIPTION 146 158 FP32 170 182 Upto 3x Optimal Batch Size 194 206 218 230 242 254 266 278 290 Samples/sec 100 150 200 250 300 50 0 2 26 50 74 98 122 146 170 194 218 DenseNet-264 242 FP16 266 290 Batch Size 314 338 FP32 362 386 410 434 458 482 506 530 554 578 33 602 626 650

  34. GPU OVERSUBSCRIPTION Fill … CPU GPU Mem … … 34

  35. GPU OVERSUBSCRIPTION Evict … CPU GPU Mem … … 35

Recommend


More recommend