Stitch: Fusible Heterogeneous Accelerators Enmeshed with Many-Core Architecture for Wearables Cheng Tan , Manupa Karunaratne, Tulika Mitra, Li-Shiuan Peh
Emerging Wearables • Software programmable to support diverse applications Here Maps on Pokemon go on Samsung gear s2 Apple Watch Bus stop detection Health care apps app (user defined) on smart watches on LG Watch Urban Navigation on smart glass page 1
Emerging Wearables • Software programmable to support diverse applications Here Maps on Pokemon go on Samsung gear s2 Apple Watch Performance Requirement ( 10000 MIPS ) Bus stop detection Health care apps app (user defined) on smart watches on LG Watch Urban Navigation on smart glass page 2
Emerging Wearables • Software programmable to support diverse applications Here Maps on Pokemon go on Samsung gear s2 Apple Watch Performance Power Requirement constraint ( 10000 MIPS ) ( 500 mW ) Bus stop detection Health care apps app (user defined) on smart watches on LG Watch Urban Navigation on smart glass page 3
Wearable SoC Architecture Trend Core Count DMIPS/watt Power (mW) DMIPS Core Count Trend DMIPS/watt Trend Power Trend DMIPS Trend 100000 10000 1000 100 10 1 Jan-2013 Nov-2013 Aug-2014 May-2015 Feb-2016 Nov-2016 Sep-2017 Sony Qualcomm Samsung Huawei Samsung Samsung Gear S3 Watch 2 Smartwatch1 toq, ARM Gear S Gear S2 ARM Cortex-A7 ARM Cortex-A7 ARM Cortex-M3 Cortex-M3 ARM Cortex-A7 ARM Cortex-A7 chronology Asus LG Motorola Sony Motorola Zenwatch 3 Moto 360 2ed Smartwatch2 Moto 360 1st G Watch R ARM Cortex-A7 ARM Cortex-A7 ARM Cortex-A7 ARM Cortex-M4 ARM Cortex-A8 page 4
Wearable SoC Architecture Trend Core Count DMIPS/watt Power (mW) DMIPS Core Count Trend DMIPS/watt Trend Power Trend DMIPS Trend 100000 10000 MIPS 10000 1000 100 500 mW 10 1 Jan-2013 Nov-2013 Aug-2014 May-2015 Feb-2016 Nov-2016 Sep-2017 Sony Qualcomm Samsung Huawei Samsung Samsung Gear S3 Watch 2 Smartwatch1 toq, ARM Gear S Gear S2 ARM Cortex-A7 ARM Cortex-A7 ARM Cortex-M3 Cortex-M3 ARM Cortex-A7 ARM Cortex-A7 chronology Asus LG Motorola Sony Motorola Zenwatch 3 Moto 360 2ed Smartwatch2 Moto 360 1st G Watch R ARM Cortex-A7 ARM Cortex-A7 ARM Cortex-A7 ARM Cortex-M4 ARM Cortex-A8 page 5
Motivating Case Study • Finger gesture recognition application page 6
Motivating Case Study • Finger gesture recognition application • State-of-the-art smartwatch Ø Odroid board emulating the state-of-the-art smartwatch ✗ Ø Time per gesture: 13 ms > 10 ms Ø Cannot meet the target throughput 4-core ARM Cortex-A7 Meeting throughput no Time per Gesture (ms) 13 Power (mW) 469 Frequency 1200 Technology 28nm page 7
Wearable Application Characteristics Finger gesture application IFFT1 IFFT1 FFT1 (6-stage pipeline, 16 kernels) IFFT2 IFFT2 FFT2 IFFT3 IFFT3 Window FFT3 Update Filter Classify moving feature IFFT4 IFFT4 FFT4 IFFT5 IFFT5 FFT5 IFFT6 IFFT6 FFT6 Acc/Gyro (X, Y, Z) Abundant parallelism -> many-core architecture Memory Controller R R R R FFT IFFT acc acc Filter Accelerators acc IFFT FFT R R R R (e.g., ASIC, FPGA, CGRA, and acc Cla- acc ssify Reconfigurable Functional Unit) acc IFFT Updt acc acc R R R R R R R R page 8
Wearable Application Characteristics Finger gesture application IFFT1 IFFT1 FFT1 (6-stage pipeline, 16 kernels) IFFT2 IFFT2 FFT2 IFFT3 IFFT3 Window FFT3 Update Filter Classify moving feature IFFT4 IFFT4 FFT4 IFFT5 IFFT5 FFT5 IFFT6 IFFT6 FFT6 Acc/Gyro (X, Y, Z) Power budget -> simple in-order core Each tile: 8.75 mW Memory Controller R R R R R R R R R R R R In-order CPU R R R R page 9
Wearable Application Characteristics Finger gesture application IFFT1 IFFT1 FFT1 (6-stage pipeline, 16 kernels) IFFT2 IFFT2 FFT2 IFFT3 IFFT3 Window FFT3 Update Filter Classify moving feature IFFT4 IFFT4 FFT4 IFFT5 IFFT5 FFT5 IFFT6 IFFT6 FFT6 Acc/Gyro (X, Y, Z) Improve performance/power -> accelerators Each tile: 8.75 mW Memory Controller R R R R Accelerators R R R R (e.g., ASIC, FPGA, CGRA, and Reconfigurable Functional Unit) R R R R R R R R page 10
Wearable Application Characteristics Finger gesture application IFFT1 IFFT1 IFFT1 FFT1 (6-stage pipeline, 16 kernels) IFFT2 IFFT2 IFFT2 FFT2 IFFT3 IFFT3 IFFT3 Window FFT3 Update Update Filter Classify moving feature feature IFFT4 IFFT4 IFFT4 FFT4 IFFT5 IFFT5 IFFT5 FFT5 IFFT6 IFFT6 IFFT6 FFT6 Acc/Gyro (X, Y, Z) Different kernels -> heterogeneous accelerators Memory Controller R R R R FFT IFFT acc acc Filter acc IFFT FFT R R R R acc Cla- acc ssify acc IFFT Updt acc acc R R R R R R R R page 11
Wearable Application Characteristics Finger gesture application IFFT1 IFFT1 IFFT1 FFT1 (6-stage pipeline, 16 kernels) IFFT2 IFFT2 IFFT2 FFT2 IFFT3 IFFT3 IFFT3 Window FFT3 Update Update Filter Classify moving feature feature IFFT4 IFFT4 IFFT4 FFT4 IFFT5 IFFT5 IFFT5 FFT5 IFFT6 IFFT6 IFFT6 FFT6 Acc/Gyro (X, Y, Z) Different kernels -> heterogeneous accelerators Acc 1 Acc 2 Acc 1 Acc 2 Memory Controller R R R R Acc 3 Acc 1 Acc 3 Acc 1 Xbar patch 8 Heterogeneous Switch R R R R Accelerator A M T A Acc 2 Acc 2 Acc 1 Acc 1 R R R R In-order CPU Acc 3 Acc 1 Acc 3 Acc 1 R R R R page 12
Wearable Application Characteristics Finger gesture application IFFT1 IFFT1 IFFT1 FFT1 (6-stage pipeline, 16 kernels) IFFT2 IFFT2 IFFT2 FFT2 IFFT3 IFFT3 IFFT3 Window FFT3 Update Update Filter Classify moving feature feature IFFT4 IFFT4 IFFT4 FFT4 IFFT5 IFFT5 IFFT5 FFT5 IFFT6 IFFT6 IFFT6 FFT6 Acc/Gyro (X, Y, Z) Imbalanced workload -> fusible accelerators Compiler decides the Acc CPU fusion of accelerators offline Acc 1 Acc 2 Acc 1 Acc 2 Memory Controller R R R R Acc 3 Acc 1 Acc 3 Acc 1 Stitch Xbar patch 8 Heterogeneous Switch R R R R compiler Accelerator A M tool chain T Acc 2 Acc 2 Acc 1 Acc 1 R R R R Actual fusion In-order CPU Acc 3 Acc 1 Acc 3 Acc 1 happens at runtime R R R R page 13
Stitch Architecture - Overview • Many-core architecture with simple in-order CPU and accelerator • Heterogeneous customizable accelerators – polymorphic patches • Patches are able to fuse together to alleviate the bottleneck kernels • The fusion of patches is directed offline by our compiler tool chain Stitch compiler tool chain Acc 1 Acc 2 Acc 1 Acc 2 Memory Controller Memory Controller R R R R R R R R Acc 3 Acc 1 Acc 3 Acc 1 NIC Patch Patch R R R R R R R R AT-AS AT-AS L1-D Acc 2 Acc 2 Acc 1 Acc 1 R R R R R R R R Router L1-I In-order CPU In-order CPU Acc 3 Acc 1 Acc 3 Acc 1 R R R R R R R R page 14
Patch Architecture • Heterogeneous customizable accelerators – polymorphic patches • Patch architecture motivated by representative wearable kernels AES # x - x x x 1 + + - > > l d + - + ^ - + + + | | & + | s - - s DTW FFT t t page 15
Patch Architecture • Heterogeneous customizable accelerators – polymorphic patches • Patch architecture motivated by representative wearable kernels X + > + + + ‘Hot’ patterns > + ld ^ Simple computation fragment {AT}: arithmetic + memory access 95.7% Multiple rounds of {MA}: Multiply + arithmetic 47.8% Longest Common {AA}: arithmetic + arithmetic 34.8% Substring (LCS) identification {AS}: arithmetic + shift 21.7% {SA}: shift + arithmetic 21.7% page 16
Patch Architecture • Heterogeneous customizable accelerators – polymorphic patches • Patch architecture motivated by representative wearable kernels Ø 8 x Acc1 -> {AT-MA} Ø 4 x Acc2 -> {AT-AS} Ø 4 x Acc3 -> {AT-SA} Acc 1 Acc 2 Acc 1 Acc 2 Memory Controller R R R R Acc 3 Acc 1 Acc 3 Acc 1 NIC Patch R R R R AT-AS L1-D Acc 2 Acc 2 Acc 1 Acc 1 R R R R Router L1-I In-order CPU Acc 3 Acc 1 Acc 3 Acc 1 R R R R page 17
Patch Architecture • AT-MA Ø ALU, SPM access; Multiplier, ALU • AT-SA Control ALU M Signals Ø ALU, SPM access; Shifter, ALU LMAU Local Mem • AT-AS ALU Ø ALU, SPM access; ALU, Shifter Output 1 Output 2 (a) patch {AT-MA} Control Control Signals ALU Shift ALU ALU Signals LMAU LMAU Local Local Shift Mem Mem ALU Output 1 Output 2 Output 1 Output 2 (b) patch {AT-SA} (c) patch {AT-AS} page 18
Patch Architecture • AT-MA Ø ALU, SPM access; Multiplier, ALU • AT-SA Ø ALU, SPM access; Shifter, ALU AT-MA AT-SA AT-AS • AT-AS Acc 2 Acc 3 Acc 1 Ø ALU, SPM access; ALU, Shifter v T indicates the memory access operation v A scratchpad memory is attached beside the CPU v Both CPU and accelerator can access the SPM Acc 1 Acc 2 Acc 1 Acc 2 Memory Controller R R R R SPM Acc 3 Acc 1 Acc 3 NIC Acc 1 Patch R R R R AT-AS L1-D Acc 2 Acc 2 Acc 1 Acc 1 R R R R Router L1-I In-order CPU Acc 3 Acc 1 Acc 3 Acc 1 R R R R page 19
Recommend
More recommend