1 Yuanming Hu, Luke Anderson, Tzu-Mao Li, Qi Sun, Nathan Carr, Jonathan Ragan-Kelley, Fredo Durand (ICLR 2020) DiffTaichi: Differentiable Programming for Physical Simulation End2end optimization of neural network controllers with gradient descent Yuanming Hu MIT CSAIL
内容概览 2 ✦ Taichi 项⽬盯简介 (10min) ✦ DiffTaichi 可微编程原理痢 (ICLR 2020, 20min) ✦ Tachi 与 DiffTaichi ⼊兦⻔闩教程 (5min) ✦ Q&A (10 min)
3 Two Missions of the Taichi Project ✦ Explore novel language abstractions and compilation approaches for visual computing ✦ Practically simplify the process of computer graphics development/deployment
The Life of a Taichi Kernel 4 Python C++ Simplifications Taichi Frontend AST IR Kernel Registration (@ti.kernel) Reverse Mode AST Lowering Autodiff Template Instantiation Type Checking (Sparse) Access Lowering Template Taichi Hierarchical Inst. Cache SSA IR AS Simplifications x86_64 Python AST Transform Backend Compiler Loop Vectorize GPU LLVM (x64/NVPTX) Taichi AST Generation Kernel Launch Bound Inference & Compile-Time & Computation (static if, Scratch Pad Insertion Data Structure Info loop unroll, const fold…)
5 Moving Least Squares Material Point Method Hu, Fang, Ge, Qu, Zhu, Pradhana, Jiang (SIGGRAPH 2018)
6 Moving Least Squares Material Point Method Hu, Fang, Ge, Qu, Zhu, Pradhana, Jiang (SIGGRAPH 2018)
7 Moving Least Squares Material Point Method Hu, Fang, Ge, Qu, Zhu, Pradhana, Jiang (SIGGRAPH 2018)
8 Top view Side view Back view Sparse Topology Optimization Liu, Hu, Zhu, Matusik, Sifakis (SIGGRAPH Asia 2018)
9 #voxels= 1,040,875,347 Grid resolution= 3000 × 2400 × 1600 Sparse Topology Optimization Liu, Hu, Zhu, Matusik, Sifakis (SIGGRAPH Asia 2018)
10 Want High-Resolution?
11 Want High-Resolution?
12 Want Performance?
high-level Productivity programming low-level programming Performance
How to get here? high-level Abstractions that Exploit Productivity programming Domain-Specific Knowledge! low-level programming Performance
15 3 million particles simulated with MLS-MPM; rendered with path tracing. Using programs written in Taichi .
16 Bounding Volume Spatial Sparsity: Regions of interest only occupy a small fraction of the bounding volume. Region of Interest
Particles 17 1x1x1 4x4x4 16x16x16
18 Essential Computation Data Structure Overhead 1% Hash table lookup: 10s of clock cycles Indirection: cache/TLB misses Node allocation: locks, atomics, barriers Branching: misprediction / warp divergence … Low-level engineering reduces data 99% structure overhead, but harms productivity and couples algorithms and data structures, making it difficult to In reality… explore different data structure designs and find the optimal one.
19 Our Solution: The Taichi Programming Language 10x shorter code, 4.55x faster 1) Decouple computation from data structures High-Performance CPU/GPU Kernels Computational Kernels (Sparse) Data Structures Ours v.s. State-of-the-art: IR MLS-MPM 13x shorter code, 1.2x faster & FEM Kernel 13x shorter code, 14.5x faster MGPCG 7x shorter code, 1.9x faster Optimizing 2D Laplace operator 1024 2 sparse grid with 8 2 Sparse CNN 9x shorter code, 13x faster Compiler 3) Hierarchical data 2) Imperative computation Runtime System structure description language language 4) Intermediate 5) Auto parallelization , representation (IR) & memory management, … data structure access optimizations
20 Defining Computation Finite Difference Stencil Taichi Kernel Program on sparse data structures as if they are dense ; • Parallel for-loops (Single-Program-Multiple-Data, like CUDA/ispc); • Loop over only active elements in the sparse data structure; • Complex control flows (e.g. If, While) supported. •
21
22 Results 10.0x shorter code 4.55x higher performance High-Performance CPU/GPU Kernels Ours v.s. State-of-the-art: MLS-MPM 13x shorter code, 1.2x faster FEM Kernel 13x shorter code, 14.5x faster MGPCG 7x shorter code, 1.9x faster Sparse CNN 9x shorter code, 13x faster
The Life of a Taichi Kernel 23 Simplifications Taichi Frontend AST IR Kernel Registration (@ti.kernel) Reverse Mode AST Lowering Autodiff Template Instantiation Type Checking (Sparse) Access Lowering Template Taichi Hierarchical Inst. Cache SSA IR Simplifications x86_64 Python AST Transform Backend Compiler Loop Vectorize GPU LLVM (x64/NVPTX) Taichi AST Generation Kernel Launch Bound Inference & Compile-Time & Computation (static if, Scratch Pad Insertion Data Structure Info loop unroll, const fold…)
• 24 Taichi’s Intermediate Representation (IR) CHI 气 C HI H ierarchical I nstructions 「阴阳,气之大者也。」 ——《庄子·则阳》 ~ 300 B.C.
25 Optimization-Oriented Intermediate Representation Design ✦ Hierarchical IR ๏ Keeps loop information ๏ Static scoping ๏ Strictly (strongly) & statically typed ✦ Static Single Assignment (SSA) ✦ Progressive lowering. ~70 Instructions in total.
26 Why can’t traditional compilers do the optimizations? 1) Index analysis 2) Instruction granularity 3) Data access semantics
27 The Granularity Spectrum access1(i,j) x[i, j] access2(i,j) Taichi IR Machine code End2end access Level-wise Access LLVM IR (CHI) Coarser Finer
28 Hidden Optimization Analysis Difficulty Opportunities Level-wise Access Taichi IR LLVM IR Machine code End2end access (CHI) Coarser Finer
29 Productivity Taichi: 10.0x shorter code 4.55x higher performance 2) abstraction-specific compiler optimization 3) algorithm data structure decoupling high-level interface 1) data structure abstraction data structure library + low-level general-purpose compiler interface Performance
30 Hu, Anderson, Li, Sun, Carr, Ragan-Kelley, Durand (ICLR 2020) DiffTaichi: Differentiable Programming on Taichi (for physical simulation and many other apps) End2end optimization of neural network controllers with gradient descent
Exposure: A White-Box Photo Post-Processing Framework 31 (TOG 2018) Yuanming Hu 1,2 Hao He 1,2 Chenxi Xu 1,3 Baoyuan Wang 1 Stephen Lin 1 1 Microsoft Research 2 MIT CSAIL 3 Peking University
32 Exposure: Learn image operations , instead of pixels . Differentiable Photo Differentiable Photo Deep Reinforcement Generative Adversarial Postprocessing Model Postprocessing Model Learning Networks resolution independent resolution independent content preserving content preserving Learn image operations , human-understandable human-understandable instead of pixels Training without pairs Modelling Optimization
Iteration 58 33 Hand-written CUDA 132x faster than TensorFlow Iteration 0 ChainQueen: Differentiable MLS-MPM Hu, Liu, Spielberg, Tenenbaum Freeman, Wu, Rus, Matusik (ICRA 2019)
The Life of a Taichi Kernel 34 Simplifications Taichi Frontend AST IR Kernel Registration (@ti.kernel) Reverse Mode AST Lowering Autodiff Template Instantiation Type Checking (Sparse) Access Lowering Template Taichi Hierarchical Inst. Cache SSA IR Simplifications x86_64 Python AST Transform Backend Compiler Loop Vectorize GPU LLVM (x64/NVPTX) Taichi AST Generation Kernel Launch Bound Inference & Compile-Time & Computation (static if, Scratch Pad Insertion Data Structure Info loop unroll, const fold…)
35 Differentiable Programming v.s. Deep Learning: What are they? ∂ L L ( x ) ∂ x Optimization/Learning via gradient descent !
36 Differentiable Programming v.s. Deep Learning: What are the differences? ✦ Deep learning operations: ๏ convolution, batch normalization, pooling… ✦ Differentiable programming further enables ๏ Stencils, gathering/scattering, fine-grained branching and loops… ๏ More expressive & higher performance for irregular operations ✦ Granularity ๏ Why not TensorFlow/PyTorch? ‣ Physical simulator written in TF is 132x slower than CUDA [Hu et al. 2019, ChainQueen] ✦ Reverse-Mode Automatic Differentiation is the key component to differentiable programming
37 The DiffTaichi Programming Language & Compiler: Automatic Differentiation for Physical Simulation Key language designs: Differentiable • Imperative • Parallel • Megakernels • 4.2x shorter code compared to hand-engineered CUDA. 188x faster than TensorFlow. Please check out our paper for more details.
38 Control Weights/biases 1 Weights/biases 2 Hidden Controller Output Phase Goal State Network FC, tanh FC, tanh NN Controller NN Controller NN Controller Parameterization Differentiable Differentiable Differentiable Loss Function Initial State Simulation Simulation Simulation State 2047 State 0 State 1 … 2045 time steps … Time step 0 Time step 1 Time step 2047 Our language allows programmers to easily build differentiable physical modules that work in deep neural networks. The whole program is end-to-end differentiable .
39
40
41 Reverse-Mode Auto Differentiation ✦ Example: ✦
42 Two-Scale AutoDiff
43 Related Work (DiffSim=DiffTaichi)
Recommend
More recommend