Adaptive Distributed Convolutional Neural Network Inference at the Network Edge with ADCNN 17 August 2020 ICPP 2020 Sai Qian Zhang , Jieyu Lin, Qi Zhang
Executing DNN Inference Tasks for End Users Option 2: cloud only Option 1: edge only Image Cloud data center Image Edge devices user data u s e r d a t a Audio Audio user data user data user data Video Video user data ... ... Limited computing capability Large communication overhead ● Using edge device to handle the end user data leads to a long processing time, while using cloud server to process the end user data acquires a large communication delay.
Motivation ● Edge devices ○ Resource-limited ○ Pervasive ● A daptive D istributed C onvolutional N eural N etwork (ADCNN) ○ We propose a framework for agile execution of inference tasks on edge clusters for Convolutional Neural Networks (CNNs) ● Challenges ○ Reduce the inference latency while keeping the accuracy performance ○ Device heterogeneity and performance fluctuation ○ Applicable to different CNN models
Agenda ● Background ● CNN partitioning strategies ● ADCNN framework ● Modification on CNN architecture ● Evaluation ● Conclusion
CNN Background -- Convolutional Layer Output feature maps Filter 1 3 Filter 3 3 Filter 2 3 224 3 224 224 ... ... ... 224 Input feature maps Input feature maps Filter K ● The weight filters slide across the ifmaps. The dot product between the entries of each ifmap and weight filter are calculated at each position.
Background -- CNN Workload Characteristics Processing time for VGG16 ● Earlier layers take much longer to process than the later layers.
CNN Partitioning Strategies: CNN Channelwise Partitioning Convolution ofmaps ifmaps Filter 1 ... C/2 K/2 ... ... C/2 K/2 ... ... ... Filter K ... N H R M W U Workload of device 1 Workload of device 2 ● In channelwise partition, each node needs to exchange their partially accumulated ofmaps to produce final ofmaps, which may lead to a significant communication overhead.
CNN Partitioning Strategies: Spatial Partitioning 0.2 0.2 B A ifmap data halo 0.6 0.6 A B A B 0.4 0.3 0.9 C D D C 0.4 0.3 0.9 C D Data halo transmission among tiles (a) (b) (c) ● In spatial partition, each tile needs to transmit their data halo in order to compute the correct result.
Fully Decomposable Spatial Partition (FDSP) 0.2 0.2 0.0 B B A A 0.6 0.6 0.0 0.4 0.3 0.9 0.0 0.0 0.0 0.4 0.3 0.9 C D C D Normal Spatial Partition Fully Decomposable Spatial Partition (FDSP) ● The cross-tile information transfer can be eliminated by padding the edge pixels with zeros.
ADCNN Framework Step 1 Step 2 Edge device cluster Progressive Retraining Conv node Tiles Input ... ... ... Central node ... ... Dog Conv ... ... ... ... ... ... node ... ... ... Original CNN model Output CNN model
ADCNN Framework Edge device cluster Conv node ... ... ... Central Input node ... Results tiles Conv node ... ... ... ● The Conv nodes need to transmit the intermediate results to the Central node, which may still cause a significant communication overhead.
Modification on CNN Topology Apply clipped ReLU Quantization RLE 1 2 3 Unroll the neurons 4 1.3 -0.5 2.3 0.1 1.1 0.0 1.8 0.0 1.0 0.0 2.0 0.0 [1,0,2,0,1,0,0,0, 1.2 -0.2 0.1 0.1 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 [1,1,2,1,1,7,2,3] 0,0,0,0,2,0,0,0] -3.8 0.3 0.1 0.0 0.1 0.0 0.0 -2.2 0.0 0.0 0.0 0.0 2.5 0.1 0.2 -1.3 1.8 0.0 0.0 0.0 2.0 0.0 0.0 0.0 Output from the CONV nodes ● We modify the CNN model for reducing this communication overhead. ● We adopt progressive retraining by adding the modification on the CNN architecture
ADCNN Architecture ADCNN System CONV nodes Input tiles Central node cluster i_id:6 i_id:6 i_id:6 t_id:1 t_id:2 t_id:4 Input n_id:1 n_id:1 n_id:N ... partition 1 Stats 2 Statistics Intermediate results collection d:[-0.9,...,1.1],i_id:6,t_id:1,n_id:1 ... # received d:[0.3,...,-0.8],i_id:6,t_id:2,n_id:1 ... results d:[-0.4,...,0.2],i_id:6,t_id:4,n_id:N Layer Dog computation 4 3 ● ADCNN takes advantage of the fine-grained, fully independent tiles generated by FDSP and adapt it to dynamic conditions, allowing it to achieve fine-grained load balancing across heterogeneous edge nodes.
Accuracy Evaluation VGG16 Fully Convolutional Network ● We evaluate different CNN models from different applications. ● Accuracy degradations are around 1% for 8 by 8 FDSP on the input sample.
Inference Latency Comparison ● We implement ADCNN system with nine identical Raspberry Pi devices which simulate the edge devices. Among these nine devices, eight are used as Conv nodes, and the rest one is used as the Central node. ● Baselines: ○ Single device scheme ○ Remote cloud scheme ● ADCNN decreases the average processing latency by 6.68x and 4.42x, respectively.
ADCNN Performance in Dynamic Environment Variation on Inference Latency Changes on Tile Assignment ● We adjust the CPU processing speed on four of the Conv nodes (node 5,6,7,8) in the middle of the processing 50 input images, and detect its impact on tile assignment and overall inference latency. ● ADCNN can handle the dynamic condition on the the node performance effectively.
Conclusion ● We introduce ADCNN, a distributed inference framework which jointly optimize CNN architecture and computing system for better performance in dynamic network environments. ● ADCNN applies FDSP to partition the compute-intensive convolutional layers into many small independent computational tasks which can be executed in parallel on separate edge devices. ● ADCNN system can take advantage of the fine-grained, fully independent tiles generated by FDSP and adapt it to dynamic conditions, allowing it to achieve fine-grained load balancing across heterogeneous edge nodes. ● Compared to existing distributed CNN inference approaches, ADCNN provides up to 2.8x lower latency, while achieving a competitive inference accuracy. Additionally, ADCNN can quickly adapt to the variations on edge device performance.
Recommend
More recommend