A 1 9 .4 nJ/ Decision 3 6 4 K Decisions/ s I n-Mem ory Random Forest Classifier in 6 T SRAM Array Mingu Kang, Sujan Gonugondla, Naresh Shanbhag University of Illinois at Urbana Champaign
Machine Learning under Resource Constraints Embedded statistical inference: IoT, sensor-rich platforms Decision making under resource constraints Limited form factor, battery-powered, real-time 2
The Random Forest (RF) Algorithm Random Forest [ 1] Ensemble of many (a few hundreds) decision trees High accuracy Simple computation (only comparisons) Suitable for multi-class classifications Inherent error-resiliency (from ensemble nature) RF algorithm [ 1] L. Breiman, Machine Learning2001 3
Implementation Challenges Implementation challenges Non-uniform tree structure - Variations in depth, # of nodes, symmetricity Frequent memory access ( � �,� , � �,� � - Memory dominates the system efficiency Irregular data access pattern: ��� �,� � Prior Art: Software and FPGA implementations. No ASIC. Fails to take advantage of RF algorithm inherent error-resiliency 4
Proposed Solution: Deep In-memory Architecture (DIMA) with DSS DIMA [ 2-4] : Embedded analog processing Storage density, normal read & write function preserved FR: functional read BLP: bitline processor (subtraction, comparison) CBLP: cross BLP (aggregation) RDL: ADC & residual digital logic Deterministic sub-sampling (DSS) Regularizes memory access pattern [ 2] M.Kang, et al., ICASSP14 [ 3] M.Kang, et al., Arxiv16 [ 4] M.Kang, et al., US Patent no. 9,697,877 5
RF Chip Architecture SRAM bitcell array Stores up to 42 groups Each group has 4 sub-group (1 sub group = 1 tree) Input buffer Stores 4: 1 sub-sampled pixels in 4 sections for DSS Cross bar (CB) 31 CB units per sub-group enabled in parallel Comparator (COMP) 128 analog comparators ( ∆� �� ��. ∆� ��� ) Proposed architecture - IREG : pixel index register, RSREG : RSS register 6
Functional READ (FR) � ��� � ��� Δ� Δ� �� �� ��� � Δ� �� ∝ � � � �� ∝ � 0.5 � � � Δ� ��� Functional read ( FR) Conventional read - B : bit precision, L : column mux ratio Fetches and computes the linear combination of stored data into analog ( LB ) times more data access per read & precharge Savings in energy & delay at the cost of reduced SNR 7
In-memory Bitline Processing Subtraction � � 1 → � � � � @ 2 � � ���������� � � � � � � � � and � in the same column Store � � ∝ � � � , ∆� � � � ∝ � � � ∆� �� � � � � ��� � � Comparison: ∆� �� ∆� > ��� < 1 0.7 : variation due to possible cominations of 0.9 ( T MSB , X MSB ) at the T MSB ‐ X MSB value 0.695 0.8 0.7 0.69 V BL (V) 0.6 0.685 0.5 0.68 0.4 X MSB T MSB 0.3 15 0 0 15 0.675 T MSB = 0 X MSB = 0 0.2 0.67 0.1 0 ‐15 ‐10 ‐5 0 5 10 15 T MSB ‐ X MSB A colum n of SRAM array Measured subtraction in a 6 5 nm CMOS 8
Deterministic Sub-sampling (DSS) Random sub-sampling (RSS) Requires complex cross bar (e.g., 256: 1 for 256-pixel � ) Deterministic sub-sampling (DSS) before RSS Sub-samples � to generate four sub-images � �,�,�,� Reduces cross bar complexity (e.g., 256: 1 → 64: 1) More than 3× and 4× energy and layout area savings 4: 1 chosen due to accuracy vs. sub-sampling ratio trade-off Proposed RF algorithm 9
Application & Measured Results Training (off-chip) 200 images per class employed for training Bit precision: 8, tree depth: 6, 64 trees Testing Randomly chosen 200 testing images from test data set KUL Belgium traffic sign dataset Energy Energy Platform Max Classification # of per delay ( 6 5 nm tree rate Accuracy ( % ) trees decision product CMOS) Depth ( decisions/ m s) ( nJ) ( fJ·s) Conv. Arch. 6 4 6 1 6 7 / bank 6 0 .4 3 6 1 .6 9 3 .5 Proposed 6 4 6 3 6 4 / bank 1 9 .4 5 3 .2 9 4 Arch. EDP reduction by 6 .8 × 10
Measured Energy vs. Accuracy Trade-off Accuracy vs. # of trees vs. Δ � �� Accuracy BL swing Energy # of trees error resiliency → allows lower BL swing Accuracy vs. energy → higher energy efficiency w .r.t BL sw ing ( Δ � �� ) * * Δ� �� for conv. is 10 × ” Δ� �� per LSB” 11
Chip Summary & Comparison Chip m icrograph Chip sum m ary Technology 65 nm CMOS 1.2 × 1.2 mm Die size 16 KB SRAM capacity (512 × 256 bit-cells) 2.11 × 0.92 um 2 Bit-cell size CTRL CLK freq. 1 GHz CORE 1.0 Supply voltage (V) CTRL 0.75 Com parison w ith state-of-the-art Prior Input Throughput Energy EDP Process Algorithm Dataset Accuracy art size (8b) (decision/s) (nJ/decision) (fJs/decision) Support 130nm Traffic 320 33 1.5M 45G [5] vector 90% CMOS sign video × 240 [40K]* [1250]* [31250]* machine 14nm K-nearest Not 21.5M 3.4 0.2 Not [6] 128 tri-gate neighbor reported [498.8K]* [145.3]* [292.3]* reported 65nm Ours Random KUL traffic 16 19.4 364.4K 52.4 94% CMOS ( M =64) forest signs × 16 (w/ CTRL) [ 5] : J.Park JSSC12, [ 6] : H.Kaul ISSCC16, * scaled to 65 nm CMOS 12
Conclusions First ASIC implementation of RF algorithm low-SNR processing via DIMA and DSS Energy & speed benefits 2.2 × and 3.1 × smaller delay and energy → 6.8 × smaller EDP compared to digital ASIC Higher potential in large-scale applications # of trees up to a few hundreds in real-life applications → Higher error-resiliency → More room to scale ∆� �� for energy efficiency Future work On-chip training to compensate process variations Different algorithms (e.g., boosted ensemble classifier) 13
Acknowledgment This work was supported by Systems on Nanoscale Information fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by SRC and DARPA. 14
Recommend
More recommend