Image Compositing on GPU-Accelerated Supercomputers Pascal Grosset & Charles (Chuck) Hansen Tuesday 5 April 2016 GTC 2016
Outline - Direct Volume Rendering - Distributed Volume Rendering - Rendering Pipeline - Setup - Rendering - Compositing - Test Setup - Results & Discussion - Conclusion & Future Work GTC 2016
Direct Volume Rendering Block of scalar values GTC 2016
Distributed Volume Rendering Sort-last Parallel Rendering 1. Partition the data among the nodes ( loading ) 2. Forming an image from the data (rendering) 3. Assemble the image (compositing) Block of scalar values GTC 2016
Distributed Volume Rendering Sort-last Parallel Rendering 1. Partition the data among the nodes ( loading ) 2. Forming an image from the data (rendering) 3. Assemble the image (compositing) GTC 2016
Distributed Volume Rendering Sort-last Parallel Rendering 1. Partition the data among the nodes (loading) 2. Forming an image from the data ( rendering ) 3. Assemble the image (compositing) GTC 2016
Distributed Volume Rendering Sort-last Parallel Rendering 1. Partition the data among the nodes (loading) 2. Forming an image from the data ( rendering ) 3. Assemble the image (compositing) GTC 2016
Distributed Volume Rendering Sort-last Parallel Rendering 1. Partition the data among the nodes (loading) 2. Forming an image from the data (rendering) 3. Assemble the image ( compositing ) GTC 2016
Distributed Volume Rendering on GPU Rendering: OpenGL: Most common way to render Compositing: Transfer to CPU and composite there? GTC 2016
Inter-node GPU Communication GTC 2016
Inter-node GPU Communication CUDA Network Network CUDA Driver Buffer Driver Buffer Driver Buffer Driver Buffer 5 operations !!! GTC 2016
Inter-node GPU Communication NO GPU Direct RDMA: 5 operations GPU Direct RDMA: 1 operation GTC 2016
Distributed Volume Rendering on GPU Rendering: OpenGL: Most common way to render Compositing: Transfer to CPU and composite there? Use the GPU: CUDA GTC 2016
Distributed Volume Rendering on GPU Rendering: OpenGL Shaders Compositing: CUDA: Computation + Communication Using OpenGL would imply 5 copies when compositing! GTC 2016
Distributed Volume Rendering on GPU Rendering: OpenGL Shaders CUDA OpenGL Interop for linking OpenGL with CUDA Compositing: CUDA: Computation + Communication CUDA and OpenGL can run together on Tesla class GPUs GTC 2016
Pipeline Setup Volume Rendering CUDA OpenGL Interop Compositing GTC 2016
Pipeline OpenGL with Shaders Setup Setup Offscreen Rendering Volume Rendering Volume Rendering CUDA OpenGL Interop GPU Direct RDMA does NOT work with Compositing Compositing texture memory!!! GTC 2016
Pipeline OpenGL with Shaders Setup Setup Offscreen Rendering to GL_TEXTURE_BUFFER Volume Rendering Volume Rendering CUDA OpenGL Interop Compositing Compositing GTC 2016
Pipeline Compositing: Setup Setup CUDA Kernels GPU Direct RDMA Volume Rendering Volume Rendering Constraint : Computation >> Communication CUDA OpenGL Interop Algorithm that minimizes communication Compositing Compositing GTC 2016
TOD-Tree Task-Overlapped Direct send Tree (TOD-Tree): 1. Direct Send 2. K-ary Tree compositing 3. Gather Aim: - Minimize communication - Overlap communication with computation GTC 2016
TOD-Tree: Direct Send (stage 1) Each node: - Determine the nodes in its locality of size r - Creates and advertises receiving buffer - Do parallel Direct Send GTC 2016
TOD-Tree: K-ary Tree (stage 2) Each node: ● Determine if it is sending or receiving Sending node: ● Sends to the receiving node Receiving node: ● Creates buffer and advertises ● Blend images GTC 2016
TOD-Tree: Gather (stage 3) Display node: Receive from other images ● Other nodes: Nodes that have images send ● their data to the display node ● GTC 2016
TOD-Tree vs Radix-K and Binary Swap Binary Swap CPU Comparison against IceT Radix-k TOD-Tree GTC 2016
Pipeline Volume Rendering: Setup: Setup OpenGL Buffer Object Activate X Server Write offscreen using shaders Create OpenGL Context using GLX OpenGL CUDA Interop Driver 358 requires no X Server for OpenGL context Compositing: CUDA Kernel - Blending GPU Direct RDMA - Communication TOD-Tree - Logic GTC 2016
Setup for testing Test Data: Cube dataset - one cube per node Test Platform: Piz Daint at Swiss National Supercomputing Center (CSCS) Cray XC30 with 5,272 Tesla K20X 7th in Top 500 Supercomputers Algorithm: TOD-Tree GTC 2016
Results: TOD-Tree Edison vs Piz Daint 70 17 65 16 60 15 55 Time (ms) 14 Time (ms) 50 13 45 12 40 11 35 GTC 2016
Results: TOD-Tree Edison vs Piz Daint GTC 2016
Conclusion Image compositing on GPUs is now feasible! Rendering: OpenGL Shaders offscreen to GL_TEXTURE_BUFFER CUDA OpenGL InterOP Compositing: Blending: CUDA Kernels Communication: GPU Direct RDMA Logic: TOD-Tree Scales very well as we increase the size of images GTC 2016
Future Work - Test in-situ rendering - Scale to a larger number of nodes - Vulkan for OpenGL volume rendering GTC 2016
More details ... Paper: - A. V. Pascal Grosset, Manasa Prasad, Cameron Christensen, Aaron Knoll, Charles Hansen, " TOD-Tree: Task-Overlapped Direct send Tree Image Compositing for Hybrid MPI Parallelism and GPUs" , IEEE Transactions on Visualization & Computer Graphics , no. 1, pp. 1, PrePrints, doi:10.1109/TVCG. 2016.2542069 GTC 2016
Thank you! Any Questions? Special thanks to Tom Fogal, Peter Messmer and Jean Favre. My email: pgrosset@sci.utah.edu GTC 2016
Recommend
More recommend