Checkmate Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization Paras Jain Joint work with: Ajay Jain, Ani Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez checkmateai.github.io
BigGAN (2018) VideoBERT (2019) GPT-2 (2019) Image generation Video generation Text generation Brock et al. 2019 Radford et al. 2019 Sun et al. 2019 2 checkmateai.github.io
1600 1200 Emerging trend: Parameter count (10 6 ) 800 Rapid growth in model size 400 0 R R D B G e - e E P F s e R T C n p T - e N L 2 L t a 5 a b 0 r g V e 2 3 Figure adapted from NVIDIA checkmateai.github.io
? State-of-the-art models have hit a memory capacity wall . �e�-G�� �e�o�� �s��e 15G� GPU memory ��m�� Cited memory as limiting factor 10G� Chen et al. 2016 Liu et al. 2019 RAM usage Gomez et al. 2017 Dai et al. 2019 per-GPU Pohlen et al. 2017 Child et al. 2019 5G� 0G� Limited GPU memory is AlexNet, 2012 VGG19, 2014 Inception v3, 2015 ResNet-152, 2015 DenseNet-201, 2016 ResNeXt-101, 2016 FCN8s, 201� ���ns�o��e�, 201� Ro��R��, 2018 �i�GAN, 2018 slowing progress in new deep learning models! 4 checkmateai.github.io
? State-of-the-art models have hit a memory capacity wall . �e�-G�� �e�o�� �s��e 15G� Problem: GPU memory ��m�� Cited memory as limiting factor 10G� How do we efficiently train large Chen et al. 2016 Liu et al. 2019 RAM usage Gomez et al. 2017 Dai et al. 2019 per-GPU Pohlen et al. 2017 Child et al. 2019 5G� models beyond memory limits? 0G� Limited GPU memory is AlexNet, 2012 VGG19, 2014 Inception v3, 2015 ResNet-152, 2015 DenseNet-201, 2016 ResNeXt-101, 2016 FCN8s, 201� ���ns�o��e�, 201� Ro��R��, 2018 �i�GAN, 2018 slowing progress in new deep learning models! 5 checkmateai.github.io
Compute is outstripping DRAM TOPS per GiB capacity growth capacity 6 checkmateai.github.io
Backprop is optimized for compute efficiency, not RAM usage RAM Compute-optimized backprop Compute 7 checkmateai.github.io
Ideal: scalable algorithm for backprop that adapts to RAM constraints RAM Compute-optimized backprop RAM-optimized backprop Compute 8 checkmateai.github.io
This work: optimal space-time tradeoff for backpropagation RAM Compute-optimized backprop Checkmate explores RAM-optimized backprop optimal trade-off 5x larger inputs w/ 2x cost Compute 9 checkmateai.github.io
RAM-hungry backprop policy Keep all layers in RAM RAM Compute-optimized backprop Compute 10 checkmateai.github.io
RAM-hungry backpropagation policy Keep all layers in RAM RAM used ∇ E Forward Pass E A B C D E Label D Loss C ∇ A ∇ B ∇ C ∇ D ∇ E B Backward Pass A Time 11 checkmateai.github.io
RAM-hungry backpropagation policy Keep all layers in RAM RAM used ∇ E ∇ D Forward Pass E A B C D E Label D Loss C ∇ A ∇ B ∇ C ∇ D ∇ E B Backward Pass A Time 12 checkmateai.github.io
RAM-hungry backpropagation policy Keep all layers in RAM RAM used ∇ E ∇ D ∇ C Forward Pass E A B C D E Label D Loss C ∇ A ∇ B ∇ C ∇ D ∇ E B Backward Pass A Time 13 checkmateai.github.io
RAM-hungry backpropagation policy Keep all layers in RAM RAM used Peak RAM ∇ E ∇ D ∇ C ∇ B ∇ A Forward Pass E A B C D E Label D Loss C ∇ A ∇ B ∇ C ∇ D ∇ E B Backward Pass A Time 14 checkmateai.github.io
RAM-optimized backpropagation policy Recompute all layers as needed RAM RAM-optimized backprop Compute 15 checkmateai.github.io
RAM-optimized backpropagation policy Recompute all layers RAM used Peak RAM (no recomputation) Forward Pass A B C D E Label D Loss C ∇ A ∇ B ∇ C ∇ D ∇ E B Backward Pass A Time How can we use less memory? Free early & recompute 16 checkmateai.github.io
RAM-optimized backpropagation policy Recompute all layers RAM used Peak RAM (no recomputation) Forward Pass A B C D E Label Loss C D E ∇ E ∇ D ∇ A ∇ B ∇ C ∇ D ∇ E B B C A Backward Pass A Time How can we use less memory? Free early & recompute 17 checkmateai.github.io
RAM-optimized backpropagation policy Recompute all layers RAM used Peak RAM (no recomputation) Forward Pass A B C D E Label Loss C D E A B C ∇ E ∇ D ∇ A ∇ B ∇ C ∇ D ∇ E B B C A Backward Pass A Time How can we use less memory? Free early & recompute 18 checkmateai.github.io
RAM-optimized backpropagation policy Recompute all layers RAM used Peak RAM (no recomputation) Forward Pass A B C D E Label Peak RAM Loss C D E A B C ∇ C ∇ E ∇ D ∇ A ∇ B ∇ C ∇ D ∇ E B B C A Backward Pass A Time How can we use less memory? Free early & recompute 19 checkmateai.github.io
How to choose which layers to recompute? Forward Pass A B C C C C C D E Loss ∇ A ∇ B ∇ C ∇ C ∇ C ∇ C ∇ C ∇ D ∇ E Backward Pass 20 checkmateai.github.io
How to choose which layers to recompute? Forward Pass Label 21 checkmateai.github.io
How to choose which layers to recompute? Forward Pass Label Backward Pass 22 checkmateai.github.io
Label Challenges of heuristics: 1. Variable runtime per layer 10 6 × slower 23 checkmateai.github.io
Label 10 3 × Challenges of heuristics: more RAM 1. Variable runtime per layer 2. Variable RAM usage per layer 24 checkmateai.github.io
Label Challenges of heuristics: 1. Variable runtime per layer 2. Variable RAM usage per layer 3. Real DNNs are non-linear 25 checkmateai.github.io
Prior work is suboptimal in general setting! Greedy heuristic [Chen 2016] Challenges: [XLA authors 2017, 2020] 1. Variable runtime per layer Divide-and-conquer heuristic [Griewank 2000] [Kowarz 2006] 2. Variable RAM usage per layer [Siskind 2018] [Kumar 2019] 3. Real DNNs are non-linear Optimal for specific architecture [Gruslys 2016] [Feng 2018] [Beaumont 2019] 26 checkmateai.github.io
Can we optimally trade-off RAM for compute? RAM Let’s be: Checkpoint every node 1. Hardware-aware 2. RAM-aware Recompute 3. DAG flexibility all layers Compute 27 checkmateai.github.io
A system for optimal tensor rematerialization Solve for 10s-1hr GPU, CPU, Hardware + Train for 1mo TPU support RAM aware Acc��a�e G�a�h O��imal ��l�e� c��� m�del �e��i�e In�eger Linear Program Nea�-���imal ��l�e� Fle�ible T�o pha�e ro�nding �ea�ch ��ace 28 checkmateai.github.io
A system for optimal tensor rematerialization Acc��a�e G�a�h O��imal ��l�e� c��� m�del �e��i�e In�eger Linear Program Nea�-���imal ��l�e� Fle�ible T�o pha�e ro�nding �ea�ch ��ace 29 checkmateai.github.io
A system for optimal tensor rematerialization A B C ∇ C ∇ B ∇ A 30 checkmateai.github.io
A system for optimal tensor rematerialization Layer t=1 A B C ∇ C ∇ B ∇ A 𝑆 !,# ∈ {0, 1} t=2 A B C ∇ C ∇ B ∇ A t=3 A B C ∇ C ∇ B ∇ A Stage t=4 A B C ∇ C ∇ B ∇ A t=5 A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A t=6 31 checkmateai.github.io
A system for optimal tensor rematerialization Layer t=1 A B C ∇ C ∇ B ∇ A 𝑆 !,# ∈ {0, 1} t=2 A B C ∇ C ∇ B ∇ A t=3 A B C ∇ C ∇ B ∇ A Stage t=4 A B C ∇ C ∇ B ∇ A t=5 A B C ∇ C ∇ B ∇ A A B C ∇ C ∇ B ∇ A t=6 32 checkmateai.github.io
A system for optimal tensor rematerialization Layer Layer t=1 𝑇 !,# ∈ {0, 1} 𝑆 !,# ∈ {0, 1} t=2 t=3 Stage t=4 t=5 t=6 33 checkmateai.github.io
A system for optimal tensor rematerialization Layer Layer t=1 𝑇 !,# ∈ {0, 1} 𝑆 !,# ∈ {0, 1} t=2 S = What is in R = What is t=3 Stage memory? computed? t=4 t=5 t=6 34 checkmateai.github.io
Example of optimal “S” (SegNet) A system for optimal tensor rematerialization Layer Layer t=1 𝑇 !,# ∈ {0, 1} 𝑆 !,# ∈ {0, 1} t=2 t=3 Stage t=4 t=5 t=6 35 checkmateai.github.io
A system for optimal tensor rematerialization Acc��a�e G�a�h O��imal ��l�e� c��� m�del �e��i�e In�eger Linear Program Nea�-���imal ��l�e� Fle�ible T�o pha�e ro�nding �ea�ch ��ace 36 checkmateai.github.io
A system for optimal tensor rematerialization Minimize forward + Decision variables backward cost 𝑇 !,# ∈ {0, 1} Layer 𝑗 stored for stage 𝑢 +,,,- $ $ 𝐷 . 𝑆 /,. min 𝑆 !,# ∈ {0, 1} Layer 𝑗 (re)computed in stage 𝑢 Use R matrix to create linear objective 37 checkmateai.github.io
A system for optimal tensor rematerialization Minimize forward + Decision variables backward cost 𝑇 !,# ∈ {0, 1} Layer 𝑗 stored for stage 𝑢 +,,,- $ $ 𝐷 . 𝑆 /,. min 𝑆 !,# ∈ {0, 1} Layer 𝑗 (re)computed in stage 𝑢 Correctness 𝑆 !,$ ≤ 𝑆 !,# + 𝑇 !,# “A layer’s dependencies must 𝑇 !,# ≤ 𝑆 !%&,# + 𝑇 !%&,# be computed before evaluation” “A layer must be computed before it can be stored in RAM” 38 checkmateai.github.io
Recommend
More recommend