combining nvidia docker and databases to enhance agile
play

Combining NVIDIA Docker and databases to enhance agile development - PowerPoint PPT Presentation

Combining NVIDIA Docker and databases to enhance agile development and optimize resource allocation Chris Davis , Sophie Voisin , Devin White, Andrew Hardin Scalable and High Performance Geocomputation Team Geographic Information Science and


  1. Combining NVIDIA Docker and databases to enhance agile development and optimize resource allocation Chris Davis , Sophie Voisin , Devin White, Andrew Hardin Scalable and High Performance Geocomputation Team Geographic Information Science and Technology Group Oak Ridge National Laboratory GTC 2017 – May 2017 ORNL is managed by UT-Battelle for the US Department of Energy

  2. Outline • Background • Example HPC Application • Study Results • Lessons Learned / Future Work 2

  3. The Story • We are: – Developing an HPC suite of applications – Spread across multiple R&D teams – In an Agile development process – Delivering to a production environment – Needing to support multiple systems / multiple capabilities – Collecting performance metrics for system optimization 3

  4. Why We Use NVIDIA-Docker Resource Optimization Operating Systems GPU Access Flexibility NVIDIA-Docker Docker Virtual Machine 4

  5. Hardware – Quadro: Compute + Display Card M4000 P6000 Capability 5.2 6.1 Block 32 32 SM 13 30 Cores 1664 3840 Memory 8GB 24GB 5

  6. Hardware – Tesla: Compute Only Card K40 K80 Capability 3.5 3.7 Block 16 16 SM 15 13 Cores 2880 2496 Memory 12GB 12GB 6

  7. Hardware – High End DELL C4130 GPU 4 x K80 RAM 256GB Cores 48 SSD Storage 400GB 7

  8. Constructing Containers PostgreSQL Compile Stats • Build Container: Profile Stats – Based off NVIDIA Images at gitlab.com – https://gitlab.com/nvidia/cuda/tree/centos7 – CentOS 7 Git Repo – CUDA 8.0 / 7.5 HPC Server – cuDNN 5.1 Isilon Local Drive – GCC 4.9.2 Container Container – Cores: 24 Container – Mount local folder with code NVIDIA-Docker CPUs GPUs • Compile against chosen compute capability • Copy product inside container Container • ”docker commit” container updates to new image • “docker save” to Isilon Data 8

  9. Running Containers PostgreSQL Compile Stats Profile Stats • For each compute capability: Isilon – “docker load” from Isilon storage Container HPC Server – Run container & profile script Local Drive Container – Send nvprof results to Profile Stats DB Container – Container/Image removed NVIDIA-Docker CPUs GPUs Container Data 9

  10. PostgreSQL Hooking It All Together Compile Stats Profile Stats Git Repo HPC Server HPC Server Isilon Local Drive Local Drive Container Container Container Container NVIDIA-Docker NVIDIA-Docker CPUs GPUs CPUs GPUs • One server generates containers Container HPC Server • All servers pull containers from Isilon Local Drive Data • Data to be processed pulled from Isilon Container • Container build stats stored in Compiler DB NVIDIA-Docker CPUs GPUs • Container execution stats stored in Profiler DB 10

  11. Profiling Combinations P6000 • nvprof 6.1 CPU – Output Parsed – Sent to Profile DB 6.0 3.0 • Containers for: D4 D1 – Cuda Version – Each Capability M4000 D3 D2 – All Capabilities 5.2 3.5 – CPU only CUDA 7.5 • Data sets: 4 CUDA 8.0 5.0 3.7 • Total of 104 profiles K40 All Capabilities K80 11

  12. Database Hostname • Postgres Databases Compute CUDA Num CPU – Shared Fields Capability Version Threads – Compile DB Compile GPU Kernel / Time Device API Call – Run Time DB Execution Dataset Time – NVPROF DB Num CPU Step Time Timestamp Threads Step Time Timestamp Dataset Max Time Percent Num Calls Num CPU Ave Time Threads Timestamp Min Time Step Name 12

  13. Outline • Background • Example HPC Application • Study Results • Lessons Learned / Future Work 13

  14. Example HPC Application • Geospatial metadata generator – Leverages Open Source 3rdparty libraries • OpenCV, Caffe, GDAL, … – Computer Vision Algorithms – GPU Enabled • SURF, ORB, NCC, NMI… – Automated matching against control data – Calculates geospatial metadata for input imagery Satellites Manned Aircraft Unmanned Aerial Systems 14

  15. Example HPC Application - GTC16 • Two-step Image Re-alignment Application using NMI Normalized Mutual Information !"# = & ' + & ) Pipeline & * Preprocessing Input Image Source Selection Core Libraries: • NITRO • GDAL Global Localization • Proj.4 • libpq (Postgres) Histograms Control Source • OpenCV • CUDA Registration • OpenMP Joint Resection CPU GPU Output Image Metadata 15

  16. Example HPC Application - GTC16 • Global Localization Control 382x100 Pipeline Tactical 258x67 Solutions 4250 Preprocessing Input Image Source Selection Core Libraries: • NITRO • Objective • GDAL Global Localization • Proj.4 • libpq (Postgres) – Re-align the source image with the control image. • OpenCV • CUDA Registration • OpenMP • Method In-house Implementation – Roughly match source and control images. Resection CPU GPU – Coarse resolution Output Image Metadata – Mask for non-valid data – Exhaustive search 16

  17. Example HPC Application - GTC16 • Global Localization 17

  18. Example HPC Application - GTC16 • Similarity Metric – Normalized Mutual Information !"# = & ' + & ) & * Source image and mask: N S xM S pixels 3 & = − , - . /01 2 - . 456 & is the entropy - . is the probability density function H ∈ J 0. . 255 for S and C 0. . 65535 for J – Histogram with masked area Control image and mask: N C xM C pixels • Missing data • Artifact • Homogeneous area Solution space: nxm NMI coefficients 18

  19. Example HPC Application - GTC16 Kernel specifications 100% occupancy Summary threads / block 128 • Global Localization as coarse re-alignment stack frame 264192 33.81 MB total memory / block – Problematic: joint histogram computation for each solution 541.06 MB total memory / SM • No compromise on the number of bins - 65536 7.03 GB total memory / GPU • Exhaustive search memory % 61.06% – Solution: leverage of the K80 specifications 0 – 0 spill stores – spill loads • 12 GB of memory 27 registers • 1 thread per solution 0 smem / block • Less than 25 seconds - 61K solutions 0 smem / SM for a 131K pixel image smem % 0.00% 448 – 20 cmem[0] – cmem[2] - 1 solution / thread 19

  20. Example HPC Application - GTC16 • Registration Control 382x100 Tactical 258x67 Pipeline Preprocessing Input Image Source Selection Core Libraries: • NITRO • GDAL Global Localization • Proj.4 • libpq (Postgres) • OpenCV • CUDA Registration • OpenMP Resection CPU GPU Output Image Metadata 20

  21. Example HPC Application - GTC16 • Registration Control 382x100 Tactical 258x67 Tactical & Control 4571x1555 Pipeline Preprocessing Input Image Source Selection Core Libraries: • NITRO • GDAL Global Localization • Proj.4 • libpq (Postgres) • Objective • OpenCV • CUDA Registration Refine the localization – • OpenMP • Method Resection CPU GPU Use higher resolution ~400 times – Keypoint matching – Output Image Metadata 21

  22. Example HPC Application - GTC16 • Registration Workflow Search windows: 73x73 pixels Control Image Search Windows Descriptor metric Keypoint list detect from Tiepoint list Source Image detect Keypoint list describe Descriptor Descriptors: 11x11 intensity values 22

  23. Application • Similarity Metric – Normalized Mutual Information … !"# = & ' + & ) & * Descriptors: 11x11 intensity values 3 & = − , - . /01 2 - . … 456 & is the entropy - . is the probability density function H ∈ J 0. . 255 for S and C 0. . 65535 for J Search windows: 73x73 pixels – Small “images” but numerous Keypoints … • Numerous keypoints – up to 65536 with GPU SURF detector • Image / Descriptor size – 11 x 11 intensity values to describe • Search area Solution spaces: 63x63 NMI coefficients – 73 x 73 control sub-image • Solution space – 63 x 63 = 3969 / keypoint 23

  24. Example HPC Application - GTC16 Kernel Find the best match for all keypoints 1 block per keypoint Summary Optimize for the 63 x 63 search windows 64 threads / blocks – 1 idle each threads compute a “row” of solutions • Registration refine the re-alignment Sparse joint histogram 65536 bins but only 121 values – Problematic: joint histogram computation for each solution Leverage the 11 x 11 descriptor size Create 2 lists (length 121) of intensity values • No compromise on the number of bins - 65536 Update joint histogram count from lists Loop over lists to retrieve aggregate count • Exhaustive search Set aggregate count to 0 after first retrieval – Solution: leverage of the K80 specifications • 12 GB of memory • 1 block per solution List of indices for source • Leverage the number of values of the descriptors = 121 (maximum) << 65536 List of indices for the corresponding subset control • Less than 100 seconds - 65K keypoints Joint histogram 260M NMI coefficients • About 10K keypoints in less than 20 seconds 24

  25. Outline • Background • Example HPC Application • Study Results • Lessons Learned / Future Work 25

  26. Compile Time Results Compute Capability Specifications 2500 1000 900 2000 800 size of binary files in MB 700 time in seconds 1500 600 500 1000 400 300 500 200 100 0 0 OFF 30 35 37 50 52 60 61 30 - 52 30 - 61 CUDA 7.5 CUDA 8.0 CUDA7.5 CUDA 8.0 26

Recommend


More recommend