XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs Cheng Li* 1 , Abdul Dakkak* 1 , Jinjun Xiong 2 , Wei Wei 3 , Lingjie Xu 3 , Wen-mei Hwu 1 University of Illinois Urbana-Champaign 1 , IBM Research 2 , Alibaba Group 3 {cli99, dakkak, w-hwu}@illinois.edu, jinjun@us.ibm.com, {w.wei, lingjie.xu}@alibaba-inc.com Video: https://youtu.be/v95JfmM66eE
Background § Machine Learning (ML) models are used in many application domains § Understanding ML inference performance is an increasingly pressing but challenging task Slow adoption of DL innovations 2
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ML Model � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � A graph where each vertex is a layer (or operator) and an edge represents data transfer � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � Relu Pooling Example: ResNet50 Padding � � � � � � � � � � � � � � � � � � � � � � � � � � � 1 2 2 3 4 4 4 5 6 6 6 6 6 7 8 8 BN P P P F S Fully Connected Softmax Convolution BatchNorm Module 1 Module 2 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 256 � 56 � 56 � � � 256 � 56 � 56 256 � 56 � 56 256 � 56 � 56 BN BN BN + 6 5 � 6 2 5 2 6 4 5 6 5 � � 5 � 6 5 � 6 6 5 6 5 � 2 5 � 6 5 � 6 6 5 6 256 � 56 � 56 256 � 56 � 56 256 � 56 � 56 256 � 56 � 56 � + 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 256 � 56 � 56 BN � � � BN BN BN Module 3 Module 4 256 � 56 � 56 128 � 28 � 28 128 � 28 � 28 128 � 28 � 28 128 � 28 � 28 128 � 28 � 28 128 � 28 � 28 512 � 28 � 28 512 � 28 � 28 512 � 28 � 28 512 � 28 � 28 � � � + BN BN BN 8 2 � 8 2 5 5 5 1 2 1 6 2 2 � 2 � � � 5 2 1 2 6 8 8 5 � � � 5 2 2 6 8 8 512 � 28 � 28 512 � 28 � 28 512 � 28 � 28 512 � 28 � 28 128 � 28 � 28 128 � 28 � 28 128 � 28 � 28 128 � 28 � 28 128 � 28 � 28 128 � 28 � 28 512 � 28 � 28 � + � � � BN BN BN BN Module 5 Module 6 512 � 28 � 28 256 � 14 � 14 256 � 14 � 14 256 � 14 � 14 256 � 14 � 14 256 � 14 � 14 256 � 14 � 14 1024 � 14 � 14 1024 � 14 � 14 1024 � 14 � 14 1024 � 14 � 14 � � � + BN BN BN 4 1 � 4 1 5 1 1 0 0 1 2 � 2 4 2 4 4 2 � � � 2 1 0 1 1 8 4 4 � � � 2 1 1 8 4 4 256 � 14 � 14 256 � 14 � 14 256 � 14 � 14 256 � 14 � 14 256 � 14 � 14 256 � 14 � 14 1024 � 14 � 14 1024 � 14 � 14 1024 � 14 � 14 1024 � 14 � 14 1024 � 14 � 14 � � � � + BN BN BN BN Module 7 Module 8 1024 � 14 � 14 512 � 7 � 7 512 � 7 � 7 512 � 7 � 7 512 � 7 � 7 512 � 7 � 7 512 � 7 � 7 2048 � 7 � 7 2048 � 7 � 7 2048 � 7 � 7 2048 � 7 � 7 � � � + BN BN BN 7 � 7 1 � 0 2 2 2 0 0 8 4 4 4 4 0 � 8 8 1 � � 2 4 7 7 � � � 1 7 7 4 2048 � 7 � 7 2048 � 7 � 7 2048 � 7 � 7 2048 � 7 � 7 512 � 7 � 7 512 � 7 � 7 512 � 7 � 7 512 � 7 � 7 512 � 7 � 7 512 � 7 � 7 2048 � 7 � 7 � � � � + BN BN BN BN � 3
ML Inference Pipeline Pre-processing Prediction Post-processing Input Image Top1 - Image decoding Input Output Unpacking into pairs Model prediction Tensor - Resizing Tensor (dog, 0.99) (label, probabilities) using - Normalization and sorting framework API - Type conversion 4
XSP Motivation Model Pre-proess Predict Post-process Input Model Output § A holistic view of the model execution is needed Conv Bias Relu Framework Concat FC § Existing profiling tools are Data Relu Conv Bias disjoint – Profiling at different Malloc Free CUDNN Transpose CUDNN granularities means System switching between tools ConvKernel cudaMalloc cudaFree – No correlation between flop DRAM DRAM count SP Read Write profiles Levels of the HW/SW stack 5
XSP Motivation Model Pre-proess Predict Post-process Input Model Output § Inference is impacted by the interplay between Conv Bias Relu Framework levels of the HW/SW stack Concat FC Data Relu Conv Bias § Any of them can be a bottleneck Malloc Free CUDNN Transpose CUDNN System ConvKernel cudaMalloc cudaFree flop DRAM DRAM count SP Read Write Levels of the HW/SW stack 6
��������������� ���������������� ���������������� Current DL Profiling on GPUs Using code insertion Model Input Model Output 1 Pre-Process Inference Post-Process Layer 2 Data Conv BN Relu … SoftMax Using framework profiler Kernel1 Kernel2 Kernel3 Name=ShuffleTensor Name=OffsetComp Name=VoltaCUDNN_128x64 GPU Kernel 3 Grid= �������� Grid= �������� Grid= �������� Using nvprof or Nsight GPU Metrics SP Flop Count=62GFlop DRAM Read Bytes=12.1MB DRAM Write Bytes=296MB Achieved Occupancy=13.2% One has to manually perform the Model-, layer-, and GPU kernel-level profiles of MLPerf ResNet50 v1.5 difficult task of correlating these with batch size 256 on a Volta GPU disjoint profiles 7
An Approach - Modifying Frameworks § NGC frameworks (TensorFlow, PyTorch, etc.) are instrumented with NVTX markers – GPU profile with layer annotations, lacks framework profiling – May inhibit frameworks from performing some optimizations – Does not work for DL models that use customized frameworks § TensorFlow profiler – framework profile with some GPU profiling – Does not work for other frameworks § Vendor lock-in & limited applicability 8
XSP: Across-stack Profiling § Incorporates profile data from different sources to obtain a holistic and hierarchical view of DL workloads – Innovatively leverages distributed tracing § Accurately captures the profiles at each HW/SW stack level despite the profiling overhead – Leveled experimentation methodology § Coupled with an automated analysis pipeline § Reveals insights that would otherwise be difficult to discern 9
Distributed Tracing § Designed to monitor distributed applications (e.g. microservices) § Key Concepts – Span : a named, timed operation representing a piece of the workflow • Start & end timestamps • Tags & Logs : key-value pairs of user-defined annotation or logging messages for spans • SpanContext : a state to refer to a distinct span – Trace : a tree of spans – Tracer : an object that creates and publishes spans 10
Host 1 Host 0 An Example Application Tracer(s) Tracer(s) Publish Spans Publish Spans {context} Server Tracing Server Host A {context} {context} Tracing Workflow Time B E A {context} {context} {context} B Application Timeline Spans C D F C D E An application with services (A, B, C, D, E, F) that have causal relationships F 11
Leveraging Distributed Tracing in XSP HW/SW Stack Tracers § Observe the similarity between Level 0 Tracer 0 E0 0 profiling and distributed tracing (user-code) § Turn profilers into tracers ..... Tracer 1 Level 1 E1 0 E1 x § Convert profiled events into spans ..... ..... E2 0 E2 z Tracer 2, 3 E2 y Level 2 § Multiple tracers can exist within ..... Level N Tracer M a stack level Tracer 1 Tracer 2 Tracer 3 Tracer 0 E2 E2 § Tracers can be enabled/disabled E1 E0 Events Events Events Events XSP Design 12
Constructing Parent/Child Relationships § Tracers use the system clock § Spans are time intervals and assigned with levels § During the profile analysis, check interval inclusion – If interval s1 contains interval s2 and s1 is a level higher than s2, then s1 is a parent of s2 Time E0 0 ..... E1 0 E1 x Spans E2 0 E2 1 E2 y ..... Time Interval Inclusion 13
Capturing Asynchronous Events § E.g. Asynchronous GPU kernel launches § Capture both the kernel launch and execution spans – Use the kernel launch span to figure out the parent span – Use the kernel execution span to get performance information or figure out its children spans Time Conv ..... kernel execution cudaLaunchKernel 14
Capturing Parallel Events § E.g. Two conv layers overlap, and each invokes GPU kernels § Serialize the conv layers to get their correlations to GPU kernels § Or more complex post-processing Time model conv1 ..... conv2 ..... ..... kernel1 ..... kernel2 Two conv layers overlap 15
Recommend
More recommend