Portable Designs for Performance Using the Hybrid Task Graph Scheduler Tim Blattner NIST | ITL | SSD | ISG
Disclaimer No approval or endorsement of any commercial product by NIST is intended or implied. Certain commercial software, products, and systems are identified in this report to facilitate better understanding. Such identification does not imply recommendations or endorsement by NIST, nor does it imply that the software and products identified are necessarily the best available for the purpose. 2 GPU Technology Conference 2018-03-28
Acknowledgements } University of Maryland, College Park } Shuvra Bhattacharyya, Jiahao Wu } Green Bank Observatory, WV } Richard Prestage } NIST } Walid Keyrouz, Derek Juba, Alexandre Bardakoff, Peter Bajcsy, Mike Majurski, Adele Peskin, Zachary Levine, Adam Pintar, Mary Brady 3 GPU Technology Conference 2018-03-28
Outline } Introduction } Experiments with HTGS } Current HTGS applications } Lessons Learned and Future } Closure 4 GPU Technology Conference 2018-03-28
Performance of Scalable Systems --- Research Goals } Software approaches for parallelism } Scale with hardware parallelism } Multicore, GPU, and cluster computing } Scalable programmer Instruments / Sensors } Modest programming effort } Achieve 80-90% attainable performance IBM “Minsky” Power8+ NVLink Storing / Streaming } Built on abstractions and software 4 Tesla P100 GPUs, GTC DC 2016 infrastructure Data ¨ Accessible performance model 50000-500-MultiGPU-Block-Panel-LUD Compute time: 87.671914 s Creation time: 0.000038 s 885.256672 GFLOPS Execution Pipeline0 Compute time: 87.670352 s Creation time: 0.000633 s CudaCopyInPanelTaskUpper x1 computeTime: 32.626862 s waitTime: 22.028843 s maxQueueSize: 25 memoryWaitTime: 33.011533 sec GBs/TBs/PBs CopyUpdateRuleUpper 0 GausElimTask x1 CudaCopyInPanelWindowTask x1 Bookkeeper x1 MatrixMulPanelTask x1 CudaCopyOutPanelTask x1 Bookkeeper x1 computeTime: 5.395365 s computeTime: 10.855282 s GausElimRule Graph Input computeTime: 0.013551 s MatrixMulRule computeTime: 82.606453 s computeTime: 39.324830 s computeTime: 0.084473 s waitTime: 82.273382 s CopyUpdateRuleUpperWindow 0 waitTime: 76.720687 s 0 0 0 0 0 maxQueueSize: 1 waitTime: 87.646589 s waitTime: 4.580509 s waitTime: 48.332059 s waitTime: 87.572424 s of data maxQueueSize: 72 Bookkeeper x1 maxQueueSize: 24 maxQueueSize: 77 maxQueueSize: 6 maxQueueSize: 7 UpdateFactorRule Bookkeeper x1 memoryWaitTime: 0.085142 sec computeTime: 0.029889 s computeTime: 0.194452 s GausElimRuleUpper 0 waitTime: 87.630479 s CopyFactorMatrixRule waitTime: 87.458737 s maxQueueSize: 25 maxQueueSize: 98 GausElimRuleLower CudaCopyInPanelTaskLower x1 computeTime: 3.973354 s FactorLowerTask x20 Bookkeeper x1 0 0 0 waitTime: 51.270483 s computeTime: 5.177253 s computeTime: 0.006953 s GatherBlockRule 0 0 maxQueueSize: 1 waitTime: 82.482351 s waitTime: 87.652483 s memoryWaitTime: 32.424224 sec maxQueueSize: 87 maxQueueSize: 29 DecompositionRule 5 GPU Technology Conference 2018-03-28
Challenging Hardware Landscape } Modern computers have } Multi-core CPUs } 10+ per CPU } Many-core accelerators } GPUs } How to take advantage of these machines? } Particularly with multi-GPU configurations } Need a programming model at a higher level of abstraction } Focus on parallelism } Data motion Abstract execution models } Memory usage 6 GPU Technology Conference 2018-03-28
Current Practice—Scalability Perspective } Retrofitting approach } Fine vs coarse-grained parallelism } Parallel directives } OpenMP, OpenACC } Parallel libraries Traditional offload approach } OpenCV, OpenBLAS, … } Task libraries } OpenMP } StarPU, Legion, … } Performance Portability Programming Pipelined workflow approach } Kokkos 7 GPU Technology Conference 2018-03-28
Expanding on our Lessons Learned } Image Stitching (2013) ≥ 1 ≥ 1 1 Threads } Hybrid Pipeline Workflows FFT / Disp. Q 01 Q 12 Q 23 BK 1 read } Multi-GPU } Multiple producers – multiple consumers } Significant effort to implement CPU Image Stitching Hybrid Pipeline Workflow } Generalize and extend on the image stitching workflow for other 6 applications Q 01 Q 12 copier Q 23 Q 34 BK 1 Q 45 Disp read FFT GPU 0 Pipeline ≥ 1 } The Hybrid Task Graph Scheduler . . Q 56 CCF . } The scalable programmer GPU n Pipeline copier Q 01 read Q 12 Q 23 FFT Q 34 BK n Q 45 Disp } Experimentation for performance Multi-GPU Image Stitching Hybrid Pipeline Workflow 8 GPU Technology Conference 2018-03-28
Experimentation for Performance } Is the essence to portable designs for performance } Ability to programmatically adapt } T o hardware landscape } Modify algorithms at a high level of abstraction as new techniques are discovered } Easily identify bottlenecks } Modify traversal strategies } Decomposition strategies } Must maintain high level abstractions from analysis to execution } Improved Profiling and debugging à Experimentation for performance 9 GPU Technology Conference 2018-03-28
Hybrid Task Graph Scheduler } Maintains explicit dataflow } Focus on representation } Separation of concerns } Persists through analysis and } Coarse-grain parallelism implementation } Hide latency of data motion } Experimentation for performance } Memory management } Debug, Profile, Visualize performance using the dataflow representation } Efforts spilling over } Computational T omography } Fast Image – high performance image processing (prior to MITS) } Radio Astronomy Radio Frequency Interference Mitigation 10 GPU Technology Conference 2018-03-28
HTGS Model Methodology } Blends dataflow and task graph 1. Start with parallel algorithm } Nodes – Tasks 2. Represent it as a dataflow graph } Edges – Dataflow between tasks 3. Map it onto an HTGS task graph 4. Implement graph using API & C++ API annotate for memory } Header only 5. Refine & optimize 11 GPU Technology Conference 2018-03-28
HTGS API } Task interface } Specialty tasks } Bookkeeper task } Initialize } Manages complex data dependencies } Execute } Maintains state of computation } CUDA Task } Can-T erminate } Binds task to NVIDIA CUDA GPU } Shutdown } Execution Pipeline Task } Creates copies of a task graph } Each task binds to one or more CPU ¨ Each copy bound to a specified GPU thread(s) } Memory Manager } Edges between tasks are thread safe data } Attaches memory edge to a task queues ¨ getMemory(“nameOfEdge”) } Apply binding to accelerator ¨ Binds memory allocation to address space } GPU (NVIDIA/AMD), FPGA, … ¨ CPU, GPU, etc. 12 GPU Technology Conference 2018-03-28
Sample Code to Build Graph (RFI Mitigation) ReadStreamTask *readTask = new ReadStreamTask(inputFileName); MADTask *madTask = new MADTask(numMADThreads, …); WriteResultTask *writeResultTask = new WriteResultTask(…); // build HTGS graph auto graph = new htgs::TaskGraphConf<htgs::VoidData, htgs::VoidData>(); graph->addEdge(readTask, madTask); graph->addEdge(madTask, writeResultTask); graph->addMemoryManagerEdge("DataBlock", readTask, new DataBlockAllocator(size), numDataBlocks, htgs::MMType::Static); graph->writeDotToFile("MADGraph-Pre-Exec.dot"); // Launch the graph htgs::TaskGraphRuntime * runtime = new htgs::TaskGraphRuntime(graph); // Launch runtime and produce/consume data to/from graph . . . 13 GPU Technology Conference 2018-03-28
Pre-Execution Graph MM(static): DataBlock x1 } Memory manager “DataBlock” char } Ensures system stays within memory limits get R eadStreamT ask x1 } Separate threads for read/write } Asynchronous I/O 1 } MADTask pool of 40 threads MADT ask x40 } Dual 10-core CPU w/ hyperthreading } Parallel processing 1 } ~90x speedup over sequential W riteR esultT ask x1 14 GPU Technology Conference 2018-03-28
Experiments with HTGS Image Stitching for Microscopy | Matrix Multiplication | LU Decomposition 15 GPU Technology Conference 2018-03-28
Microscopy Image Stitching } Grid of overlapping images } Compute pair-wise relative displacement } 17k x 22k pixels per image } Studying cell growth over time } >300 time series images } ImageJ software took ~6 hours to stitch MemManagerFFT Read FFT BK PCIAM CCF Stem cell data with red outline for each tile Stitching Multi-GPU HTGS Implementation 16 GPU Technology Conference 2018-03-28
2013-05-20 Stitching Results results Reference code: >3 hours Speed Effective Time Threads up Speedup Speedup: Sequential / Implementation Sequential 10 min 37 s 21x 1 CPU Effective Speedup: Pipelined Multi- 1 min 20 s 7.7x 162.4x 19 Reference / Implementation Threaded Simple GPU 9 min 17 s 1.08x 22.7x 1 CPU-GPU Dual quad-core Xeon Pipelined-GPU, 1 GPU 43.6s 14.6x 305.5x 11 32 GB DDR3 2 Tesla C2070s Pipelined-GPU, 2 GPUs 24.5 s 26x 512.3x 15 17 GPU Technology Conference 2018-03-28
Stitching Results Effective Time Speedup Threads Speedup Sequential 4.1min 52.48x 1 CPU Pipelined Multi-Threaded 13 s 18.9x 993x 40 Simple GPU 2.1 min 1.95x 102.46x 1 Dual 10-core Xeon CPU-GPU Pipelined-GPU, 1 GPU 17.3 s 14.2x 746.28x 40 128 GB DDR3 3 Tesla K40 Pipelined-GPU, 2 GPUs 9.7 s 25.36x 1331x 40 Pipelined-GPU, 3 GPUs 8.3 s 29.6x 1555.5x 40 18 GPU Technology Conference 2018-03-28
Motivation – Hybrid Pipeline Workflows } Performance gains from HTGS Simple GPU Profile HTGS 19 GPU Technology Conference 2018-03-28
Multi-GPU Stitching 20 GPU Technology Conference 2018-03-28
Recommend
More recommend