Memory-Optimal Direct Convolutions for Maximizing Classification Accuracy in Embedded Devices Albert Gural 1 , Boris Murmann 1 1 Stanford University The 36 th International Conference on Machine Learning Long Beach, California June 11, 2019
Introduction Embedded devices are increasingly targets of machine learning for IoT • Microsoft EdgeML • Bonsai [1]: decision tree achieves 94.38% on MNIST-2 in 2KB • ProtoNN [2]: nearest neighbors achieves 93.25% on MNIST-2 in 2KB • FastGRNN [3]: RNN achieves 98.20% on MNIST in 6KB • Google TensorFlow Lite for MCUs [4] • Hard memory constraints make deep learning difficult • “Bonsai is not compared to deep convolutional neural networks as they have not yet been • demonstrated to fit on such tiny IoT devices” [1] But CNNs typically have SOTA performance for image classification tasks • Can we do better with CNNs? • Goal: MNIST classifier in 2KB •
Introduction Deep CNN implementation research • typically focused on speed FFT, Winograd, gemm • Minimal research prioritizing memory • reduction Memory-Efficient Convolution [5] • Memory-Efficient Convolution [5] improves memory use of gemm methods, but still has overhead Zero-Memory Overhead [6] performs • direct convolutions for zero overhead beyond input/output activation storage Zero-Memory Overhead [6]
Introduction Deep CNN implementation research • 28 × 28 × 1 typically focused on throughput 176 10 FFT, Winograd, gemm • AvgPool 2x2 Conv 3x3 Conv 3x3 Conv 3x3 MaxPool 2x2 Flatten Dense Minimal research prioritizing memory • reduction Memory-Efficient Convolution [5] • improves memory use of gemm methods, but still has overhead Zero-Memory Overhead [6] performs • direct convolutions for zero overhead Negative-Memory Overhead beyond input/output activation storage Can do even better by replacing • input activations while computing output activations
Replace Method 2 5 features 1 4 input pixel 0 3 output pixel height 11 6 9 stale pixel 10 kernel width … … 𝑔 𝑝𝑣𝑢 ≤ 𝑔 𝑗𝑜 … … 𝑔 𝑝𝑣𝑢 > 𝑔 𝑗𝑜
Herringbone Method … 25 cost; 20 free 30 cost; 32 free 55 cost; 60 free Order of Convolutions Herringbone tile 0 1 2 3 4 5 6 7 8 15 16 17 18 19 20 21 9 22 28 29 30 31 32 33 10 23 34 39 40 41 42 43 11 24 35 44 48 49 50 51 12 25 36 45 52 55 56 57 13 26 37 46 53 58 60 61 14 27 38 47 54 59 62 63
Herringbone Method In paper, we demonstrate optimality for lossless, per- … layer, direct convolutions 25 cost; 20 free 30 cost; 32 free 55 cost; 60 free Order of Convolutions Herringbone tile 0 1 2 3 4 5 6 7 8 15 16 17 18 19 20 21 9 22 28 29 30 31 32 33 10 23 34 39 40 41 42 43 11 24 35 44 48 49 50 51 12 25 36 45 52 55 56 57 13 26 37 46 53 58 60 61 14 27 38 47 54 59 62 63
Transpose Implementation Transpose method: process a row, transpose, process a row, transpose, … 0 1 2 3 Successor: 𝑘 = 𝑗 mod 𝐼 ⋅ 𝑋 + 𝑗/𝐼 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 mem layout A 0 4 8 0 4 8 1 5 9 2 6 10 3 7 11 mem layout B 1 5 9 For each start: 2 6 10 Check if start > any other element in its cycle 3 7 11 If not, rotate elements in the cycle
Convolution Strategy Comparison Conv 1 Conv 2 Conv 3 FC
Applicability
Case Study Arduino SRAM (2048B) Program SRAM NN workspace (1960B) serial comm. Output Serialized CNN Classes + Input Images Stack
Case Study SRAM (2048B) 611c1c0150318141b1532a27304888b8bc8e67062038e88784217b578e0efd047480558181f06fe8114475add415fe81d527ec42a3ead2c862d 28feb482fc6d4e7edd1aea57f685f7d8948f6841c6b33258fc5711cd0707446d404138fb231989e9b70981b0183cc38412578774407764ea141 cf9b18a2e08e2e64de7562bf6d28b7df6eb38509483f11e91a3d001ca7db26e09d6088f7589c72715f1e7cf4c9d71f5685849580b016f2150e2 17812fb5d60d6f5cf46420917c4a4797cd83fd2871a087f0183112871fa8784600ce27f8d1f8ed31c302ee7bbf07ea57ec7f8073e7e47957731 8389b88df8381783282cef87d8e0838ff827f78cc1478e5be8d78bd8a79e86ed8742a1698872180d4c635470d03c1762e37c0da766287f8718e NN workspace (1960B) 8c6889a89b88d0c02080e4ddfa3f73ba3a4267c0fd14e7f825042c259f1e85798cf58f188583ca788442c828608e78488f608df88a888488580 875380774bf08edc8e7a908e8e72bd72e4218e74e448f39f1fd315c72948ece4f5eae8049d89fff871b722d83ac60e38d788791838867845a78 3f87287aec2df8082e7d18c80e41788cb8eafc2ab3f2872854ef1028cd717c078c1de2a2f708d58b648872fc331834ebca48772d1583f21d678 NN serialization (1525B) 71ec85b8074ee7dd83888b61c78dfd70df88227788a8817b837887881f78b3801c837b77d88fce824478d08e79e07dc1e0877e8745d06d89d37 38c548fdc88858318d1d7e721855d47630dc1889d788a458f378b7c9147ca788ff8093cfe88574877b8142707388cf898787a7c71383a8fae08 974c0078fc756f88d7628e288dd18f0d88330f8b76213289a2c08880d7273271f27e87d8e7b77a8f80b9888ffa88811877f0b867f1b4f04bd48 Weights and Biases Network Topology f87f88e96418778877881888772f744004a4b87574db264736063827118387031d32ddc312808f7c87f8f75073837887757a7848c8a1a77e88e NN activations (435B) 84f7768668c278881cd770d3663f3f7c8703be8e423cf14f8683f87b63418370286340f327d86cdee423ec0422473b8c50307e37c9817e80555 7b54106788c741f788d07c1d17217e7ae8d623fe24ff48ed87f323081303e40421633c84143d76f882577472e8e3f1f2175088678a85271e493 f67d8f4668708fe7728d788782f387773788274288d870d2e48ceb7753f3144f8e524385508f1777c2e88fdcbe21318893f78ae677877d8178e 83f8537255f1382b88312323154313d450652b7c87418073c187e888b437878888e8fb88782783c52d2b88de2771023820746e561125c083132 37488e4282608346e21d42231d3444a2ef23321887600f51e687a1fcf48c8cdbe887157300df41ffd0f1df827f8f1104e3f2157e1f643f8beee Stack (88B) 7b80155e435011151001c1e12ee1f4223ece1f342ee1c27fb0ef8f5e2221e031751032e611f1c1480b448b5775155b5842c804538d708773f24 308788d0078fb10240def3117e05227d09648373133d572e55a11d0402467e01677017212083874782c6f68578f7774853085712187404ee811 4d24f38222a02278287f2a4487661787f188b787888288880cc87c70872d77417778bf39c87861747857ef3342d625e071814718270ef761308 3c5618437be61412c2eb234c4d4e0ec13c7d0a1822637f853473b302e30ed20e00af2e2511f4c3d0c44231213473f1c10952520320411101251 82f3cb4e30333d07aebdb9ed47748758df4dd7b53e52e40f21ee343df10f4bde0271582f7e18c4d2432fb62b7186357f787f06f2788171f101c f7858e5e8487083283b8ed6a77e2d2884843d3d983e6dede578ef8b7a8e78608f18788f887c82e28d07768683571c5d1722a18645f717532667 582482c7f78890c887878882188e332a7c73d8fd7c1852418328797c7f815878801575f7278272e381bb17ed1bd4e4848754e7e72230313811e 705d7c8d478f38488878da7e5b82b075e5816665012826c781f7ece383c80335202e373f20250d323c003f5e68086738787135d2c22f817af8e e80db08787f81818b4853872837f78d7377e12857b781d78f83880e607832e2e72f321730448f4d3f5c38876768137c77e7e158ff9708df8e88 237d7287b788385787c88387f8dc77817b67878427f8080d1a47f1aca2e0 Arduino 28 × 28 × 1 176 10 Program SRAM AvgPool 2x2 Conv 3x3 Conv 3x3 Conv 3x3 MaxPool 2x2 Flatten Dense serial comm. Serialized CNN Output Classes + Input Images
Results Fits in 2KB SRAM • Network Topology • Weights and Biases • Intermediate Activations • Achieves 99.15% Test Accuracy on MNIST • Comparison to MNIST-2 and MNIST-10 results from [1,2,3]
Summary Applicability • Replace strategy applies to any CNN • Herringbone/Transpose strategies apply to many 2D classification CNNs • Use Scenario • Tiny MCUs with negligible caching • Maximize accuracy given memory constraint • Maximize free memory given fixed NN • Applications • Microrobotic vision • Touchpad input classification • Spectrogram classification of 1D signals • Voice, gesture recognition • Activity tracking • Spectrogram of “yes” keyword from [7] Biometric security • Other sensors •
References 1. Kumar, Ashish, Saurabh Goyal, and Manik Varma. "Resource-efficient machine learning in 2 KB RAM for the internet of things." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR.org, 2017. 2. Gupta, Chirag, et al. "Protonn: Compressed and accurate knn for resource-scarce devices." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR.org, 2017. 3. Kusupati, Aditya, et al. "Fastgrnn: A fast, accurate, stable and tiny kilobyte sized gated recurrent neural network." Advances in Neural Information Processing Systems. 2018. 4. TensorFlow Lite for Microcontrollers. URL: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/experimental/micro 5. Cho, Minsik, and Daniel Brand. "MEC: memory-efficient convolution for deep neural network." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR.org, 2017. 6. Zhang, Jiyuan, Franz Franchetti, and Tze Meng Low. "High performance zero-memory overhead direct convolutions." arXiv preprint arXiv:1809.10170 (2018). 7. Warden, Pete. "Speech commands: A dataset for limited-vocabulary speech recognition." arXiv preprint arXiv:1804.03209 (2018). Code: https://github.com/agural/memory-optimal-direct-convolutions Poster: Pacific Ballroom #89
Recommend
More recommend