DeepCache: Principled Cache for Mobile Deep Vision Mengwei Xu 1 , - PowerPoint PPT Presentation

DeepCache: Principled Cache for Mobile Deep Vision Mengwei Xu 1 , Mengze Zhu 1 , Yunxin Liu 2 Felix Xiaozhu Lin 3 , Xuanzhe Liu 1 1 Peking University, 2 Microsoft Research, 3 Purdue University

Background: Mobile Vision • Your mobile device sees what you see, and does what you cannot do • Core: computer vision algorithm Augmented Game Reality Recognition Face Beauty & Detection

Background: CNN-based Vision • Convolutional Neural Network (CNN) is the state-of-the-art vision algorithm. CNN model: a graph of computation nodes convolution operation (convolution, pooling, activation, etc) (input feature map * kernel = out feature map) • CNN is accurate, but also resource-hungry.

Background: Optimizing CNN Workloads Algorithm-level Compression Hardware-level Acceleration quantization CPU • • pruning GPU • • factorization DSP • • distilling AI-specific chips • • • Our approach: leveraging the temporal locality of mobile video stream • Similar but not identical previous frame Current frame • Object movement/appearance • Camera movement • Light variation • etc…

Caching Mobile Vision – a naïve approach • Just cache/reuse the final results based on input image i th input ouput Inference Class: elephant frame Engine Pos: (-1.5, 7.9)

Caching Mobile Vision – a naïve approach • Just cache/reuse the final results based on input image i th input ouput Inference Class: elephant frame Engine Pos: (-1.5, 7.9) Similar? (i+1) th frame

Caching Mobile Vision – a naïve approach • Just cache/reuse the final results based on input image i th input ouput Inference Class: elephant frame Engine Pos: (-1.5, 7.9) YES: Let’s Reuse Similar? (i+1) th Class: elephant frame Pos: (-1.5, 7.9)

Caching Mobile Vision – a naïve approach • Just cache/reuse the final results based on input image i th input ouput Inference Class: elephant frame Engine Pos: (-1.5, 7.9) YES: Let’s Reuse Similar? NO: Do Computation (i+1) th Inference Class: elephant frame Engine Pos: (-1.1, 9.3)

Caching Mobile Vision – a naïve approach • Just cache/reuse the final results based on input image • Why it’s not enough? • Coarse-grained: whole image as comparison unit • Cannot handle position-sensitive tasks Two images are similar • Similar background • Similar animals But the elephant position is different!

Caching Mobile Vision – De DeepCa Cache • Treat image as a collection of blocks, and cache/reuse them at a fine granularity. cache & reuse KEY Reuse the CNN IDEA computations of similar regions previous frame current frame

Caching Mobile Vision – De DeepCa Cache • Treat image as a collection of blocks, and cache/reuse them at a fine granularity. i th input ouput Inference Class: elephant frame Engine Pos: (-1.5, 7.9) layer-level matching reusable cache/reuse regions (i+1) th Inference Class: elephant frame Engine Pos: (-1.1, 9.3) (revised)

Challenges of DeepCache • Scene variation – the overall background may be moved between frames • Moving camera, autonomous driving, drone, etc offset previous frame current frame

Challenges of DeepCache • Cache erosion – reusability tends to diminish at its deeper layers Eroded Re-computation Required 3x3 Convolution Reusable region (5x5) Reusable region (3x3) on input feature map on output feature map

Challenges of DeepCache • Cache erosion – reusability tends to diminish at its deeper layers previous frame current frame previous frame 1. Merge smaller ones B2 B1 B1 B2 B1 B2 (a) The “best” match, with (b) The “proper” match, highest matching score with a high matching score 2. Good news: early layers contribute most of the computation cost and also suffer less cache erosion.

DeepCache Design: Overview • Design Principles • Two modules • No cloud offloading • Image matcher • No efforts from developers • Cache-aware inference engine • No modification to models Vision Applications conv pool conv fc Pre-processing Input Output (resizing etc.) previous frame Cache storage Deep learning engine Raw image w/ DeepCache Cache and reuse Reusable Image conv pool conv fc Operating System regions match current frame Cache-aware CNN inference engine Camera Processors (CPU, GPU, etc.) Storage

DeepCache Design: Image Matching • Principles: high similarity, low overhead, and merged to big ones. • Input: two raw images • Output: a set of matched rectangles • (x 1 , y 1 , w, h) in current frame -> (x 2 , y 2 , w, h) in previous frame

DeepCache Design: Image Matching • Step 1: dividing the current frame into an NxN grid. • N is a configurable parameter (default: 10 x 10). Previous Frame Current Frame

DeepCache Design: Image Matching • Step 2: find the most-matched block in previous frame for each divided block • Motion estimation: diamond search (x 1 , y 1 ) (x 2 , y 2 ) Previous Frame Current Frame

DeepCache Design: Image Matching • Step 3: calculate the average block movement (offset): (M x , M y ). • Filter those outliers (x 1 , y 1 ) (x 2 , y 2 ) Previous Frame Current Frame

DeepCache Design: Image Matching • Step 4: calculate the similarity between block ( x 1 , y 1 ) in current frame and the block ( x 1 +M x , y 2 + M y ) with average movement in previous frame • Metrics: Peak Signal to Noise Ratio (PSNR) PSRN: 24 (reusable) (x 1 , y 1 ) (x 1 +M x , y 2 + M y ) PSRN: 21 (reusable) Previous Frame Current Frame

DeepCache Design: Image Matching • Step 5: merge blocks into larger ones if possible (x 1 , y 1 ) (x 1 , y 1 ) (x 1 +M x , y 2 + M y ) Previous Frame Current Frame

DeepCache Design: Image Matching • Optimization 1: skip block matching in Step 2 (k-skip) 2-skip 3-skip 8/16 blocks 6/16 blocks computed computed • Optimization 2: in Step 4, reuse the matching scores computed in Step 2 • Not always applicable: depends on the average movement

DeepCache Design: Cache-aware CNN Inference • Propagation : the reusable regions passed from image matching is not unchangeable during execution among CNN layers. reusable reusable reusable (27, 27, 21, 7) reusable (53, 53, 45, 15) (53, 53, 45, 15) ⇒ (32, 32, 21, 7) Pooling Convolution ReLu ⇒ (63, 63, 45, 15) ⇒ (63, 63, 45, 15) (100, 100, 100, 40) Kernel=3x3 Kernel=11x11 ⇒ (120, 120, 100, 40) Stride=2 Stride=2 Padding=1 Padding=5 Because of cache erosion! But what affects cache erosion?

DeepCache Design: Cache-aware CNN Inference • Propagation : the reusable regions passed from image matching is not unchangeable during execution among CNN layers. partial erosion full erosion no erosion

DeepCache Design: Cache-aware CNN Inference • Why Propagation ? Why not match the input of each layer? • Low return: feature maps are high dimensional data, difficult to interpret. • High cost: matching feature map requires a lot of computations (40 × compared to propagation for ResNet). 60 DeepCache MIL-50% MIL-75% MIL-100% 48.59 50 Normalized Latency 43.67 40 37.82 35.77 34.31 • DeepCache : match input images once, 30.98 28.63 30 26.67 and using propagation later 21.24 • MIL : matching inter-layer 20 10 3.44 3.57 3.67 2.25 2.30 1.84 1.00 1.00 1.00 1.00 1.00 0 AlexNet GoogLeNet ResNet-50 YOLO Dave-orig

DeepCache Design: Cache-aware CNN Inference • Cache/Reuse : reuse the computation results at the output of convolutions. 35.0 29.2 Convolutional Others Fully-connected 30.0 Normalized Latency (%) Convolution is often the dominant 25.0 21.0 layer (> 80% overall computations) 20.0 13.0 15.0 9.5 10.0 6.5 5.9 5.7 4.4 2.9 5.0 0.0 0.8 0.4 0.4 0.2 0.0 0.0 0.0 0.0 0.0 0.0 conv1 relu1 lrn1 pool1 conv2 relu2 lrn2 conv3 relu3 conv4 relu4 conv5 relu5 fc6 relu6 fc7 relu7 fc8 prob • Mind the data locality during reuse! • Depends on the convolution implementation: img2col + gemm, unrolled, etc.

DeepCache Implementation • Image matching is implemented based on RenderScript • A programming framework on Android for intensive computations • GPU-support, generic, high data-parallel • Cache-awareness is built upon ncnn • Popular deep learning inference framework for mobile devices • High speed, lightweight, no dependency

Evaluation – Setup • Popular CNN models and datasets • Platform: Nexus 6, Android 6.0 • Alternative • ncnn without cache • Coarse-grained cache used in [1] DeepMon [1] Huynh Nguyen Loc, Youngki Lee, and Rajesh Krishna Balan. 2017. DeepMon: Mobile GPU-based Deep Learning Framework for Continuous Vision Applications . In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys’17)

Evaluation – Execution Speedup • DeepCache saves 15% ~ 28% model execution time (2X DeepMon) • The speedup depends on model architecture • Deeper layers, less savings no-cache DeepMon DeepCache 165 1.2 80 no-cache DeepCache Normalized Processing Latency 70 1 1 1 1 1 Conv-layer Latency (ms) 1.0 0.932 0.927 0.929 0.898 0.870 0.862 60 0.858 0.849 deeper layers 0.803 0.8 50 0.720 40 0.6 30 0.4 20 0.2 10 0 0.0 Conv_1 Conv_10 Conv_20 Conv_30 Conv_40 Conv_50 REC_1 REC_2 REC_3 DET DRV

DeepCache: Principled Cache for Mobile Deep Vision Mengwei Xu 1 , - PowerPoint PPT Presentation

DeepCache: Principled Cache for Mobile Deep Vision Mengwei Xu 1 , Mengze Zhu 1 , Yunxin Liu 2 Felix Xiaozhu Lin 3 , Xuanzhe Liu 1 1 Peking University, 2 Microsoft Research, 3 Purdue University Background: Mobile Vision Your mobile device sees

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

MOBILE ADVERTISING Agenda Get off to a mobile start with Media Impact! Why mobile? MI

Data-Dependent Sample Complexities for Deep Neural Networks Tengyu Ma Colin Wei Stanford

Cache Creek Placer Area Fee Proposal History of Placer Mining at Cache Creek Prospecting in

Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London, Ontario

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory

A Framework for Rule Processing in Reconfigurable Network Systems Michael Attig and John Lockwood

UNUM UNified Universal Microprocessor Framework Nirav Dave, Michael Pellauer, & Arvind

LIVE STREAMING AT SCALE Jordi Cenzano | Director of engineering mmsys2019

S TALE DNS R ECORDS AND IP A DDRESS R E -U SE c l oudstrife.sec l ab.cs.ucsb.edu 34.215.255.68

Okuli : Extending Mobile Interaction Through Near-Field Visible Light Sensing Chi Zhang, Joshua

Rhode Island Department of Revenue Central Collections Unit Created in 2018, the Central

Perception of Average Value in Multiclass Scatterplots Michael Gleicher, Michael Correll,

Fundamentals and Management Students Photos during Simulation Experience Colleen Duggan, RN, MSN

DeepCache: Principled Cache for Mobile Deep Vision Mengwei Xu 1 , - PowerPoint PPT Presentation

DeepCache: Principled Cache for Mobile Deep Vision Mengwei Xu 1 , Mengze Zhu 1 , Yunxin Liu 2 Felix Xiaozhu Lin 3 , Xuanzhe Liu 1 1 Peking University, 2 Microsoft Research, 3 Purdue University Background: Mobile Vision Your mobile device sees

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

MOBILE ADVERTISING Agenda Get off to a mobile start with Media Impact! Why mobile? MI

Data-Dependent Sample Complexities for Deep Neural Networks Tengyu Ma Colin Wei Stanford

Cache Creek Placer Area Fee Proposal History of Placer Mining at Cache Creek Prospecting in

Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London, Ontario

General Cache Mechanics CPU Block: unit of data in cache and memory. (a.k.a. line) Memory

A Framework for Rule Processing in Reconfigurable Network Systems Michael Attig and John Lockwood

UNUM UNified Universal Microprocessor Framework Nirav Dave, Michael Pellauer, &amp; Arvind

LIVE STREAMING AT SCALE Jordi Cenzano | Director of engineering mmsys2019

S TALE DNS R ECORDS AND IP A DDRESS R E -U SE c l oudstrife.sec l ab.cs.ucsb.edu 34.215.255.68

Okuli : Extending Mobile Interaction Through Near-Field Visible Light Sensing Chi Zhang, Joshua

Rhode Island Department of Revenue Central Collections Unit Created in 2018, the Central

Perception of Average Value in Multiclass Scatterplots Michael Gleicher, Michael Correll,

Fundamentals and Management Students Photos during Simulation Experience Colleen Duggan, RN, MSN

UNUM UNified Universal Microprocessor Framework Nirav Dave, Michael Pellauer, & Arvind