TASO: Optimizing Deep Learning with Automatic Generation of Graph Substitutions Zhihao Jia , Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken Stanford University SOSP’19 12/14/19 1
Current Rule-based DNN Optimizations Input Fuse conv + relu Input conv3x3 conv1x1 conv conv conv3x3 conv1x1 + relu + relu + relu relu relu relu conv3x3 … conv3x3 add add relu relu Computation Graph Optimized Graph Rule-based Optimizer 2
Current Rule-based DNN Optimizations Fuse conv + relu TensorFlow currently Fuse conv + includes ~200 rules batch normalization (~53,000 LOC) Fuse multi. convs … Rule-based Optimizer 3
Limitations of Rule-based Optimizations Robustness Experts’ heuristics do not apply to all DNNs/hardware When I turned on XLA (TensorFlow’s graph optimizer), the With XLA, my program is almost 2x slower than training speed is about 20% slower . without XLA 4
Limitations of Rule-based Optimizations Scalability Robustness New operators and graph Experts’ heuristics do not apply to all DNNs/hardware structures require more rules TensorFlow currently uses ~4K LOC to optimize convolution 5
Limitations of Rule-based Optimizations Scalability Performance Robustness New operators and graph Miss subtle optimizations for Experts’ heuristics do not apply to all DNNs/hardware structures require more rules specific DNNs/hardware 6
Motivating Example Input Input Input Input Input Conv3x3 Conv3x3 Conv3x3 Conv3x3 Conv1x1 Conv3x3 Conv3x3 + Relu + Relu + Relu + Relu + Relu + Relu + Relu Conv3x3 Split Conv3x3 Conv3x3 Conv3x3 + Relu Conv3x3 Relu Add Add Add Relu Relu Relu Enlarge Fuse Fuse Fuse convs convs conv & relu conv & add The final graph is 30% faster on V100 but 10% slower on K80. 7
DNN Graph Optimizations DNN Graph Hardware Operators Architectures Backends How should we address the complexity of designing DNN graph optimizations? 8
TASO: Tensor Algebra SuperOptimizer • Key idea: replace manually-designed graph optimizations with automated generation and verification of graph substitutions for deep learning • Less engineering effort: 53,000 LOC for manual graph optimizations in TensorFlow → 1,400 LOC in TASO • Better performance: outperform existing optimizers by up to 2.8x 9
Graph Substitution Y 1 Y 2 Y 2 Y 1 Split Conv3x3 Conv3x3 Conv3x3 Concat W 2 W 1 X W 1 W 2 X 10
TASO Workflow Graph Graph … … Graph Subst. Subst. Optimizer Generator Verifier Operator Specifications Candidate Verified Substitutions Substitutions 11
TASO Workflow Search-Based Graph Optimizer Optimized Input Comp. … Comp. Graph Graph Verified Substitutions 12
Key Challenges 1. How to generate potential substitutions? Graph fingerprints 2. How to verify their correctness? Operator specifications + theorem prover 13
Subst. Graph Subst. Generator Optimizer Verifier Graph Substitution Generator Enumerate all possible graphs up to a fixed size using available operators … Operators supported by hardware backend 14
Subst. Graph Subst. Generator Optimizer Verifier Graph Substitution Generator 66M graphs with up to 4 operators Directly evaluating all pairs requires a quadratic number of tests. 15
Subst. Graph Subst. Generator Optimizer Verifier Graph Substitution Generator Compute output fingerprints I 1 with random input tensors … I K O 1 O 1 O 1 O 1 O 1 O 1 O 1 O 1 … … … … … … … … O K O K O K O K O K O K O K O K 16
Subst. Graph Subst. Generator Optimizer Verifier Graph Substitution Generator Pairs of graphs with identical I 1 fingerprint are candidate substitutions … I K O 1 O 1 O 1 O 1 O 1 O 1 O 1 O 1 … … … … … … … … O K O K O K O K O K O K O K O K 17
Subst. Graph Subst. Generator Optimizer Verifier Graph Substitution Generator TASO generates ~29,000 substitutions by enumerating graphs w/ up to 4 operators 743 substitutions remain after applying pruning techniques to eliminate redundancy 18
Subst. Graph Subst. Generator Optimizer Verifier Graph Substitution Verifier … … Graph Subst. Verifier Verified Candidate Substitutions Substitutions P1. conv is distributive over concatenation ∀𝑦, 𝑥 % , 𝑥 & . P2. conv is bilinear … 𝐷𝑝𝑜𝑤 𝑦, 𝐷𝑝𝑜𝑑𝑏𝑢 𝑥 % , 𝑥 & = Pn. 𝐷𝑝𝑜𝑑𝑏𝑢 𝐷𝑝𝑜𝑤(𝑦, 𝑥 % ), 𝐷𝑝𝑜𝑤 𝑦, 𝑥 & Operator Specifications 19
Y 1 Y 2 Verification Workflow Y 1 Y 2 Split Conv Conv Conv Concat W 2 W 1 X W 1 W 2 X ∃𝑦, 𝑥 % , 𝑥 & . (Conv(x, w 1 ), Conv (x, w 2 )) Split(Conv(x, Concat(w 1 , w 2 ))) 𝐷𝑝𝑜𝑤 𝑦, 𝑥 % ), 𝐷𝑝𝑜𝑤(𝑦, 𝑥 & ≠ 𝑇𝑞𝑚𝑗𝑢 𝐷𝑝𝑜𝑤 𝑦, 𝐷𝑝𝑜𝑑𝑏𝑢 𝑥 % , 𝑥 & Theorem UNSAT Prover P1. ∀𝑦, 𝑥 % , 𝑥 & . 𝐷𝑝𝑜𝑤 𝑦, 𝐷𝑝𝑜𝑑𝑏𝑢 𝑥 % , 𝑥 & = 𝐷𝑝𝑜𝑑𝑏𝑢 𝐷𝑝𝑜𝑤(𝑦, 𝑥 % ), 𝐷𝑝𝑜𝑤 𝑦, 𝑥 & P2. … Operator Specifications 20
Verification Effort TASO generates all 743 substitutions in 5 minutes, and verifies them against 43 operator properties in 10 minutes Supporting a new operator requires a few hours of human effort to discover its properties Operator specifications in TASO ≈ 1,400 LOC Manual graph optimizations in TensorFlow ≈ 53,000 LOC 21
Subst. Graph Subst. Generator Optimizer Verifier Search-Based Graph Optimizer 1 • Goal : applying verified substitutions to obtain an optimized graph • Cost model 2 • Based on the sum of individual operators’ cost • Measure the cost of each operator on hardware • Cost-based backtracking search • Backtrack local optimal solutions • Optimizing a DNN model takes less than 10 minutes 1. Z. Jia et al. Optimizing DNN Computation with Relaxed Graph Substitutions. In SysML’19. 2. Z. Jia et al. Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks. ICML’18. 22
End-to-end Inference Performance (V100 GPU w/ cuDNN) TensorFlow TensorRT MetaFlow TASO w/ cuDNN 15 12 Runtime(ms) 1.3x 9 6 2.8x 1.4x 1.0x 3 1.4x 0 ResNet-50 NasNet-A ResNeXt-50 NasRNN BERT-Large Competitive on Larger speedups on standard models emerging models 23
End-to-end Inference Performance (V100 GPU w/ TVM) TVM TASO w/ TVM 15 12 Runtime(ms) 9 1.3x 6 1.8x 1.3x 3 1.1x 1.0x 0 ResNet-50 NasNet-A ResNeXt-50 NasRNN BERT-Large Similar speedups on the TVM backend 24
Heatmap of Used Substitutions Not covered in TensorFlow How many times a subst. is used to optimize a DNN Different DNN models require different substitutions. 25
Conclusion TASO is the first DNN optimizer that automatically generates substitutions • Less engineering effort • Better performance • Formal verification https://github.com/jiazhihao/taso • Support DNN models in ONNX, TensorFlow, and PyTorch 26
Scalability Analysis 3 1as1et-A 5es1eXt-50 Relative SSeeGuS 2.5 BE5T 2 1.5 1 0 1 2 3 4 Maxmum GraSh SubstitutiRn Size 27
concat Case Study: NASNet Add Add Add Add Add Conv Conv Avg Avg Avg Conv Conv Conv Add: element-wise addition 1x1 1x1 3x3 3x3 3x3 1x1 1x1 1x1 Conv: standard conv DWC: depth-wise conv DWC DWC DWC DWC DWC 3x3 5x5 3x3 5x5 3x3 Input 1 Input 2 Conv Add 1x1 Y Y Conv Conv Concat DWC 1x1 1x1 Add 5x5 DWC 3x3 W 3 W 4 W 3 W 4 Avg Avg DWC DWC Concat Concat 3x3 3x3 3x3 5x5 X W 1 W 2 W 1 W 2 X A X 1 X 2 X 1 X 2 28
Future Work: Query Optimizations • A database query is expressed as a tree of relational operators • Query optimizations are tree transformations 29
Contribution • Replacing current manually-designed graph optimizations with automatic generation of graph substitutions for deep learning • Less engineering effort: 53,000 LOC for graph optimizations in TensorFlow → 1,400 LOC • Better performance: outperform existing optimizers by up to 2.8x • Correctness: formal verification of graph substitutions 30
Limitations of Rule-based Optimizations Scalability Performance Robustness New operators and graph Miss subtle optimizations for Experts’ heuristics do not apply to all DNNs/hardware structures require more rules specific DNNs/hardware conv add conv 3x3 add 1x1 conv W 3 conv conv 3x3 conv concat DWC 1x1 1x1 3x3 5x5 W 3 conv conv W 3 W 4 W 3 W 4 concat DWC DWC 3x3 1x1 concat concat 3x3 5x5 pad X W 2 W 2 W 1 X 3x3 W 1 W 2 W 1 W 2 W 1 X 1 X 2 X 1 X 2 Only apply to specific hardware Only apply to specialized graph structures 31
TASO: Tensor Algebra SuperOptimizer Key idea: automatically generate graph substitutions and verify them Graph Subst. Graph Subst. Generator Verifier … … Operator Specifications Verified Candidate Substitutions Substitutions 32
TASO: Tensor Algebra SuperOptimizer Search-Based Graph Optimizer … Graph Subst. Verified Substitutions Input Comp. Verifier Graph Optimized … Comp. Graph Graph Subst. Candidate Substitutions Operator Generator Specifications TASO 33
Recommend
More recommend