XSP: Across-Stack Profiling and Analysis of Machine Learning Models - PowerPoint PPT Presentation

XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs Cheng Li* 1 , Abdul Dakkak* 1 , Jinjun Xiong 2 , Wei Wei 3 , Lingjie Xu 3 , Wen-mei Hwu 1 University of Illinois Urbana-Champaign 1 , IBM Research 2 , Alibaba Group 3 {cli99, dakkak, w-hwu}@illinois.edu, jinjun@us.ibm.com, {w.wei, lingjie.xu}@alibaba-inc.com Video: https://youtu.be/v95JfmM66eE

Background § Machine Learning (ML) models are used in many application domains § Understanding ML inference performance is an increasingly pressing but challenging task Slow adoption of DL innovations 2

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ML Model � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � A graph where each vertex is a layer (or operator) and an edge represents data transfer � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � Relu Pooling Example: ResNet50 Padding � � � � � � � � � � � � � � � � � � � � � � � � � � � 1 2 2 3 4 4 4 5 6 6 6 6 6 7 8 8 BN P P P F S Fully Connected Softmax Convolution BatchNorm Module 1 Module 2 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 256 � 56 � 56 � � � 256 � 56 � 56 256 � 56 � 56 256 � 56 � 56 BN BN BN + 6 5 � 6 2 5 2 6 4 5 6 5 � � 5 � 6 5 � 6 6 5 6 5 � 2 5 � 6 5 � 6 6 5 6 256 � 56 � 56 256 � 56 � 56 256 � 56 � 56 256 � 56 � 56 � + 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 64 � 56 � 56 256 � 56 � 56 BN � � � BN BN BN Module 3 Module 4 256 � 56 � 56 128 � 28 � 28 128 � 28 � 28 128 � 28 � 28 128 � 28 � 28 128 � 28 � 28 128 � 28 � 28 512 � 28 � 28 512 � 28 � 28 512 � 28 � 28 512 � 28 � 28 � � � + BN BN BN 8 2 � 8 2 5 5 5 1 2 1 6 2 2 � 2 � � � 5 2 1 2 6 8 8 5 � � � 5 2 2 6 8 8 512 � 28 � 28 512 � 28 � 28 512 � 28 � 28 512 � 28 � 28 128 � 28 � 28 128 � 28 � 28 128 � 28 � 28 128 � 28 � 28 128 � 28 � 28 128 � 28 � 28 512 � 28 � 28 � + � � � BN BN BN BN Module 5 Module 6 512 � 28 � 28 256 � 14 � 14 256 � 14 � 14 256 � 14 � 14 256 � 14 � 14 256 � 14 � 14 256 � 14 � 14 1024 � 14 � 14 1024 � 14 � 14 1024 � 14 � 14 1024 � 14 � 14 � � � + BN BN BN 4 1 � 4 1 5 1 1 0 0 1 2 � 2 4 2 4 4 2 � � � 2 1 0 1 1 8 4 4 � � � 2 1 1 8 4 4 256 � 14 � 14 256 � 14 � 14 256 � 14 � 14 256 � 14 � 14 256 � 14 � 14 256 � 14 � 14 1024 � 14 � 14 1024 � 14 � 14 1024 � 14 � 14 1024 � 14 � 14 1024 � 14 � 14 � � � � + BN BN BN BN Module 7 Module 8 1024 � 14 � 14 512 � 7 � 7 512 � 7 � 7 512 � 7 � 7 512 � 7 � 7 512 � 7 � 7 512 � 7 � 7 2048 � 7 � 7 2048 � 7 � 7 2048 � 7 � 7 2048 � 7 � 7 � � � + BN BN BN 7 � 7 1 � 0 2 2 2 0 0 8 4 4 4 4 0 � 8 8 1 � � 2 4 7 7 � � � 1 7 7 4 2048 � 7 � 7 2048 � 7 � 7 2048 � 7 � 7 2048 � 7 � 7 512 � 7 � 7 512 � 7 � 7 512 � 7 � 7 512 � 7 � 7 512 � 7 � 7 512 � 7 � 7 2048 � 7 � 7 � � � � + BN BN BN BN � 3

ML Inference Pipeline Pre-processing Prediction Post-processing Input Image Top1 - Image decoding Input Output Unpacking into pairs Model prediction Tensor - Resizing Tensor (dog, 0.99) (label, probabilities) using - Normalization and sorting framework API - Type conversion 4

XSP Motivation Model Pre-proess Predict Post-process Input Model Output § A holistic view of the model execution is needed Conv Bias Relu Framework Concat FC § Existing profiling tools are Data Relu Conv Bias disjoint – Profiling at different Malloc Free CUDNN Transpose CUDNN granularities means System switching between tools ConvKernel cudaMalloc cudaFree – No correlation between flop DRAM DRAM count SP Read Write profiles Levels of the HW/SW stack 5

XSP Motivation Model Pre-proess Predict Post-process Input Model Output § Inference is impacted by the interplay between Conv Bias Relu Framework levels of the HW/SW stack Concat FC Data Relu Conv Bias § Any of them can be a bottleneck Malloc Free CUDNN Transpose CUDNN System ConvKernel cudaMalloc cudaFree flop DRAM DRAM count SP Read Write Levels of the HW/SW stack 6

�� Current DL Profiling on GPUs Using code insertion Model Input Model Output 1 Pre-Process Inference Post-Process Layer 2 Data Conv BN Relu … SoftMax Using framework profiler Kernel1 Kernel2 Kernel3 Name=ShuffleTensor Name=OffsetComp Name=VoltaCUDNN_128x64 GPU Kernel 3 Grid= �� Grid= �� Grid= �� Using nvprof or Nsight GPU Metrics SP Flop Count=62GFlop DRAM Read Bytes=12.1MB DRAM Write Bytes=296MB Achieved Occupancy=13.2% One has to manually perform the Model-, layer-, and GPU kernel-level profiles of MLPerf ResNet50 v1.5 difficult task of correlating these with batch size 256 on a Volta GPU disjoint profiles 7

An Approach - Modifying Frameworks § NGC frameworks (TensorFlow, PyTorch, etc.) are instrumented with NVTX markers – GPU profile with layer annotations, lacks framework profiling – May inhibit frameworks from performing some optimizations – Does not work for DL models that use customized frameworks § TensorFlow profiler – framework profile with some GPU profiling – Does not work for other frameworks § Vendor lock-in & limited applicability 8

XSP: Across-stack Profiling § Incorporates profile data from different sources to obtain a holistic and hierarchical view of DL workloads – Innovatively leverages distributed tracing § Accurately captures the profiles at each HW/SW stack level despite the profiling overhead – Leveled experimentation methodology § Coupled with an automated analysis pipeline § Reveals insights that would otherwise be difficult to discern 9

Distributed Tracing § Designed to monitor distributed applications (e.g. microservices) § Key Concepts – Span : a named, timed operation representing a piece of the workflow • Start & end timestamps • Tags & Logs : key-value pairs of user-defined annotation or logging messages for spans • SpanContext : a state to refer to a distinct span – Trace : a tree of spans – Tracer : an object that creates and publishes spans 10

Host 1 Host 0 An Example Application Tracer(s) Tracer(s) Publish Spans Publish Spans {context} Server Tracing Server Host A {context} {context} Tracing Workflow Time B E A {context} {context} {context} B Application Timeline Spans C D F C D E An application with services (A, B, C, D, E, F) that have causal relationships F 11

Leveraging Distributed Tracing in XSP HW/SW Stack Tracers § Observe the similarity between Level 0 Tracer 0 E0 0 profiling and distributed tracing (user-code) § Turn profilers into tracers ..... Tracer 1 Level 1 E1 0 E1 x § Convert profiled events into spans ..... ..... E2 0 E2 z Tracer 2, 3 E2 y Level 2 § Multiple tracers can exist within ..... Level N Tracer M a stack level Tracer 1 Tracer 2 Tracer 3 Tracer 0 E2 E2 § Tracers can be enabled/disabled E1 E0 Events Events Events Events XSP Design 12

Constructing Parent/Child Relationships § Tracers use the system clock § Spans are time intervals and assigned with levels § During the profile analysis, check interval inclusion – If interval s1 contains interval s2 and s1 is a level higher than s2, then s1 is a parent of s2 Time E0 0 ..... E1 0 E1 x Spans E2 0 E2 1 E2 y ..... Time Interval Inclusion 13

Capturing Asynchronous Events § E.g. Asynchronous GPU kernel launches § Capture both the kernel launch and execution spans – Use the kernel launch span to figure out the parent span – Use the kernel execution span to get performance information or figure out its children spans Time Conv ..... kernel execution cudaLaunchKernel 14

Capturing Parallel Events § E.g. Two conv layers overlap, and each invokes GPU kernels § Serialize the conv layers to get their correlations to GPU kernels § Or more complex post-processing Time model conv1 ..... conv2 ..... ..... kernel1 ..... kernel2 Two conv layers overlap 15

XSP: Across-Stack Profiling and Analysis of Machine Learning Models - PowerPoint PPT Presentation

XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs Cheng Li* 1 , Abdul Dakkak* 1 , Jinjun Xiong 2 , Wei Wei 3 , Lingjie Xu 3 , Wen-mei Hwu 1 University of Illinois Urbana-Champaign 1 , IBM Research 2 , Alibaba Group 3

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Stack and Queue Stack Overview Stack ADT Basic operations of stack Pushing, popping

Call Stack Stack Bottom Memory region managed with stack discipline Procedures and the Call

Stack ADT Tiziana Ligorio 1 Todays Plan Questons? Stack ADT 2 Abstract Data Types

Stack machines (Using slides adapted from the book) Stacks A stack machine maintains an

The Stack Eric McCreath The Stack The stack is a simple but useful data structure in computer

Sorting with Pop Stacks Stack sorting Pop stack sorting 1-pop-stack sortability 2-pop-stack

Compilers Stack Machines Alex Aiken Stack Machines Only storage is a stack An

Cloud-iQ New features including xSP reporting Crayon Channel Team Cloud-iQ updates The Cloud-iQ

SDN Peering with XSP Ezra Kissel Indiana University Internet2 Joint Techs / TIP2013 January

Expression Profiling Mark Voorhies 4/4/2011 Mark Voorhies Expression Profiling Review

ADT Stack 1 Stacks of Coins and Plates 2 Stacks of Rocks and Books TOP OF THE STACK TOP OF

ADT Stack 1 Stacks of Coins and Plates 2 Stacks of Rocks and Books TOP OF THE STACK TOP OF

Re-arquitetando o Re-arquitetando o Stack Overflow Stack Overflow ou como construmos o Stack

CS180 Recitation Apr 13, 2012 Stack Data structure Stack Class public class Stack { 1 private

Leaving no one behind The role of evidence-building and profiling to include displacement in

PRISM: Zooming in Persistent RAM Storage Behavior Juyoung Jung and Dr. Sangyeun Cho y g g gy

Modern tools to debug GStreamer applications Guillaume Desmottes

Mapping expansion history: Baryon Acoustic Oscillation signal in galaxy distribution HOMOGENEOUS

Small-scale structure Annika Peter CCAPP The Ohio State University Microphysics

STE Studies using AIRS Data Laura Pan, NCAR Collaborators: Andrew Gettelman and Bill Randel

Governor Cuomos Plan for Reopening NY as of f May 4, , 2020 The following charts have been

Back on Track: Stage 5 Stage 5 to continue from Oct. 18 Nov. 14 Face covering requirement

Generic Linux Debugging By Ansis Atteka Open vSwitch Team May 17th, 2018 In NSBU we do...