DeepCache: Principled Cache for Mobile Deep Vision Mengwei Xu 1 , Mengze Zhu 1 , Yunxin Liu 2 Felix Xiaozhu Lin 3 , Xuanzhe Liu 1 1 Peking University, 2 Microsoft Research, 3 Purdue University
Background: Mobile Vision • Your mobile device sees what you see, and does what you cannot do • Core: computer vision algorithm Augmented Game Reality Recognition Face Beauty & Detection
Background: CNN-based Vision • Convolutional Neural Network (CNN) is the state-of-the-art vision algorithm. CNN model: a graph of computation nodes convolution operation (convolution, pooling, activation, etc) (input feature map * kernel = out feature map) • CNN is accurate, but also resource-hungry.
Background: Optimizing CNN Workloads Algorithm-level Compression Hardware-level Acceleration quantization CPU • • pruning GPU • • factorization DSP • • distilling AI-specific chips • • • Our approach: leveraging the temporal locality of mobile video stream • Similar but not identical previous frame Current frame • Object movement/appearance • Camera movement • Light variation • etc…
Caching Mobile Vision – a naïve approach • Just cache/reuse the final results based on input image i th input ouput Inference Class: elephant frame Engine Pos: (-1.5, 7.9)
Caching Mobile Vision – a naïve approach • Just cache/reuse the final results based on input image i th input ouput Inference Class: elephant frame Engine Pos: (-1.5, 7.9) Similar? (i+1) th frame
Caching Mobile Vision – a naïve approach • Just cache/reuse the final results based on input image i th input ouput Inference Class: elephant frame Engine Pos: (-1.5, 7.9) YES: Let’s Reuse Similar? (i+1) th Class: elephant frame Pos: (-1.5, 7.9)
Caching Mobile Vision – a naïve approach • Just cache/reuse the final results based on input image i th input ouput Inference Class: elephant frame Engine Pos: (-1.5, 7.9) YES: Let’s Reuse Similar? NO: Do Computation (i+1) th Inference Class: elephant frame Engine Pos: (-1.1, 9.3)
Caching Mobile Vision – a naïve approach • Just cache/reuse the final results based on input image • Why it’s not enough? • Coarse-grained: whole image as comparison unit • Cannot handle position-sensitive tasks Two images are similar • Similar background • Similar animals But the elephant position is different!
Caching Mobile Vision – De DeepCa Cache • Treat image as a collection of blocks, and cache/reuse them at a fine granularity. cache & reuse KEY Reuse the CNN IDEA computations of similar regions previous frame current frame
Caching Mobile Vision – De DeepCa Cache • Treat image as a collection of blocks, and cache/reuse them at a fine granularity. i th input ouput Inference Class: elephant frame Engine Pos: (-1.5, 7.9) layer-level matching reusable cache/reuse regions (i+1) th Inference Class: elephant frame Engine Pos: (-1.1, 9.3) (revised)
Challenges of DeepCache • Scene variation – the overall background may be moved between frames • Moving camera, autonomous driving, drone, etc offset previous frame current frame
Challenges of DeepCache • Cache erosion – reusability tends to diminish at its deeper layers Eroded Re-computation Required 3x3 Convolution Reusable region (5x5) Reusable region (3x3) on input feature map on output feature map
Challenges of DeepCache • Cache erosion – reusability tends to diminish at its deeper layers previous frame current frame previous frame 1. Merge smaller ones B2 B1 B1 B2 B1 B2 (a) The “best” match, with (b) The “proper” match, highest matching score with a high matching score 2. Good news: early layers contribute most of the computation cost and also suffer less cache erosion.
DeepCache Design: Overview • Design Principles • Two modules • No cloud offloading • Image matcher • No efforts from developers • Cache-aware inference engine • No modification to models Vision Applications conv pool conv fc Pre-processing Input Output (resizing etc.) previous frame Cache storage Deep learning engine Raw image w/ DeepCache Cache and reuse Reusable Image conv pool conv fc Operating System regions match current frame Cache-aware CNN inference engine Camera Processors (CPU, GPU, etc.) Storage
DeepCache Design: Image Matching • Principles: high similarity, low overhead, and merged to big ones. • Input: two raw images • Output: a set of matched rectangles • (x 1 , y 1 , w, h) in current frame -> (x 2 , y 2 , w, h) in previous frame
DeepCache Design: Image Matching • Step 1: dividing the current frame into an NxN grid. • N is a configurable parameter (default: 10 x 10). Previous Frame Current Frame
DeepCache Design: Image Matching • Step 2: find the most-matched block in previous frame for each divided block • Motion estimation: diamond search (x 1 , y 1 ) (x 2 , y 2 ) Previous Frame Current Frame
DeepCache Design: Image Matching • Step 3: calculate the average block movement (offset): (M x , M y ). • Filter those outliers (x 1 , y 1 ) (x 2 , y 2 ) Previous Frame Current Frame
DeepCache Design: Image Matching • Step 4: calculate the similarity between block ( x 1 , y 1 ) in current frame and the block ( x 1 +M x , y 2 + M y ) with average movement in previous frame • Metrics: Peak Signal to Noise Ratio (PSNR) PSRN: 24 (reusable) (x 1 , y 1 ) (x 1 +M x , y 2 + M y ) PSRN: 21 (reusable) Previous Frame Current Frame
DeepCache Design: Image Matching • Step 5: merge blocks into larger ones if possible (x 1 , y 1 ) (x 1 , y 1 ) (x 1 +M x , y 2 + M y ) Previous Frame Current Frame
DeepCache Design: Image Matching • Optimization 1: skip block matching in Step 2 (k-skip) 2-skip 3-skip 8/16 blocks 6/16 blocks computed computed • Optimization 2: in Step 4, reuse the matching scores computed in Step 2 • Not always applicable: depends on the average movement
DeepCache Design: Cache-aware CNN Inference • Propagation : the reusable regions passed from image matching is not unchangeable during execution among CNN layers. reusable reusable reusable (27, 27, 21, 7) reusable (53, 53, 45, 15) (53, 53, 45, 15) ⇒ (32, 32, 21, 7) Pooling Convolution ReLu ⇒ (63, 63, 45, 15) ⇒ (63, 63, 45, 15) (100, 100, 100, 40) Kernel=3x3 Kernel=11x11 ⇒ (120, 120, 100, 40) Stride=2 Stride=2 Padding=1 Padding=5 Because of cache erosion! But what affects cache erosion?
DeepCache Design: Cache-aware CNN Inference • Propagation : the reusable regions passed from image matching is not unchangeable during execution among CNN layers. partial erosion full erosion no erosion
DeepCache Design: Cache-aware CNN Inference • Why Propagation ? Why not match the input of each layer? • Low return: feature maps are high dimensional data, difficult to interpret. • High cost: matching feature map requires a lot of computations (40 × compared to propagation for ResNet). 60 DeepCache MIL-50% MIL-75% MIL-100% 48.59 50 Normalized Latency 43.67 40 37.82 35.77 34.31 • DeepCache : match input images once, 30.98 28.63 30 26.67 and using propagation later 21.24 • MIL : matching inter-layer 20 10 3.44 3.57 3.67 2.25 2.30 1.84 1.00 1.00 1.00 1.00 1.00 0 AlexNet GoogLeNet ResNet-50 YOLO Dave-orig
DeepCache Design: Cache-aware CNN Inference • Cache/Reuse : reuse the computation results at the output of convolutions. 35.0 29.2 Convolutional Others Fully-connected 30.0 Normalized Latency (%) Convolution is often the dominant 25.0 21.0 layer (> 80% overall computations) 20.0 13.0 15.0 9.5 10.0 6.5 5.9 5.7 4.4 2.9 5.0 0.0 0.8 0.4 0.4 0.2 0.0 0.0 0.0 0.0 0.0 0.0 conv1 relu1 lrn1 pool1 conv2 relu2 lrn2 conv3 relu3 conv4 relu4 conv5 relu5 fc6 relu6 fc7 relu7 fc8 prob • Mind the data locality during reuse! • Depends on the convolution implementation: img2col + gemm, unrolled, etc.
DeepCache Implementation • Image matching is implemented based on RenderScript • A programming framework on Android for intensive computations • GPU-support, generic, high data-parallel • Cache-awareness is built upon ncnn • Popular deep learning inference framework for mobile devices • High speed, lightweight, no dependency
Evaluation – Setup • Popular CNN models and datasets • Platform: Nexus 6, Android 6.0 • Alternative • ncnn without cache • Coarse-grained cache used in [1] DeepMon [1] Huynh Nguyen Loc, Youngki Lee, and Rajesh Krishna Balan. 2017. DeepMon: Mobile GPU-based Deep Learning Framework for Continuous Vision Applications . In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys’17)
Evaluation – Execution Speedup • DeepCache saves 15% ~ 28% model execution time (2X DeepMon) • The speedup depends on model architecture • Deeper layers, less savings no-cache DeepMon DeepCache 165 1.2 80 no-cache DeepCache Normalized Processing Latency 70 1 1 1 1 1 Conv-layer Latency (ms) 1.0 0.932 0.927 0.929 0.898 0.870 0.862 60 0.858 0.849 deeper layers 0.803 0.8 50 0.720 40 0.6 30 0.4 20 0.2 10 0 0.0 Conv_1 Conv_10 Conv_20 Conv_30 Conv_40 Conv_50 REC_1 REC_2 REC_3 DET DRV
Recommend
More recommend