a 1 9 4 nj decision 3 6 4 k decisions s i n mem ory
play

A 1 9 .4 nJ/ Decision 3 6 4 K Decisions/ s I n-Mem ory Random - PowerPoint PPT Presentation

A 1 9 .4 nJ/ Decision 3 6 4 K Decisions/ s I n-Mem ory Random Forest Classifier in 6 T SRAM Array Mingu Kang, Sujan Gonugondla, Naresh Shanbhag University of Illinois at Urbana Champaign Machine Learning under Resource Constraints


  1. A 1 9 .4 nJ/ Decision 3 6 4 K Decisions/ s I n-Mem ory Random Forest Classifier in 6 T SRAM Array Mingu Kang, Sujan Gonugondla, Naresh Shanbhag University of Illinois at Urbana Champaign

  2. Machine Learning under Resource Constraints  Embedded statistical inference: IoT, sensor-rich platforms  Decision making under resource constraints  Limited form factor, battery-powered, real-time 2

  3. The Random Forest (RF) Algorithm  Random Forest [ 1]  Ensemble of many (a few hundreds) decision trees  High accuracy  Simple computation (only comparisons)  Suitable for multi-class classifications  Inherent error-resiliency (from ensemble nature) RF algorithm [ 1] L. Breiman, Machine Learning2001 3

  4. Implementation Challenges  Implementation challenges  Non-uniform tree structure - Variations in depth, # of nodes, symmetricity  Frequent memory access ( � �,� , � �,� � - Memory dominates the system efficiency  Irregular data access pattern: ��� �,� �  Prior Art:  Software and FPGA implementations. No ASIC.  Fails to take advantage of RF algorithm inherent error-resiliency 4

  5. Proposed Solution: Deep In-memory Architecture (DIMA) with DSS  DIMA [ 2-4] :  Embedded analog processing  Storage density, normal read & write function preserved  FR: functional read  BLP: bitline processor (subtraction, comparison)  CBLP: cross BLP (aggregation)  RDL: ADC & residual digital logic  Deterministic sub-sampling (DSS)  Regularizes memory access pattern [ 2] M.Kang, et al., ICASSP14 [ 3] M.Kang, et al., Arxiv16 [ 4] M.Kang, et al., US Patent no. 9,697,877 5

  6. RF Chip Architecture  SRAM bitcell array  Stores up to 42 groups  Each group has 4 sub-group (1 sub group = 1 tree)  Input buffer  Stores 4: 1 sub-sampled pixels in 4 sections for DSS  Cross bar (CB)  31 CB units per sub-group enabled in parallel  Comparator (COMP)  128 analog comparators ( ∆� �� ��. ∆� ��� ) Proposed architecture - IREG : pixel index register, RSREG : RSS register 6

  7. Functional READ (FR) � ��� � ��� Δ� Δ� �� �� ��� � Δ� �� ∝ � � � �� ∝ � 0.5 � � � Δ� ��� Functional read ( FR) Conventional read - B : bit precision, L : column mux ratio  Fetches and computes the linear combination of stored data into analog  ( LB ) times more data access per read & precharge  Savings in energy & delay at the cost of reduced SNR 7

  8. In-memory Bitline Processing  Subtraction � � 1 → � � � � @ 2 � � ���������� � � � � � � � � and � in the same column Store �  � ∝ � � � , ∆� � � � ∝ � � � ∆� �� � � � � ��� � � Comparison: ∆� �� ∆�  > ��� < 1 0.7 : variation due to possible cominations of 0.9 ( T MSB , X MSB ) at the T MSB ‐ X MSB value 0.695 0.8 0.7 0.69 V BL (V) 0.6 0.685 0.5 0.68 0.4 X MSB T MSB 0.3 15 0 0 15 0.675 T MSB = 0 X MSB = 0 0.2 0.67 0.1 0 ‐15 ‐10 ‐5 0 5 10 15 T MSB ‐ X MSB A colum n of SRAM array Measured subtraction in a 6 5 nm CMOS 8

  9. Deterministic Sub-sampling (DSS)  Random sub-sampling (RSS)  Requires complex cross bar (e.g., 256: 1 for 256-pixel � )  Deterministic sub-sampling (DSS) before RSS Sub-samples � to generate  four sub-images � �,�,�,�  Reduces cross bar complexity (e.g., 256: 1 → 64: 1)  More than 3× and 4× energy and layout area savings  4: 1 chosen due to accuracy vs. sub-sampling ratio trade-off Proposed RF algorithm 9

  10. Application & Measured Results  Training (off-chip)  200 images per class employed for training  Bit precision: 8, tree depth: 6, 64 trees  Testing  Randomly chosen 200 testing images from test data set KUL Belgium traffic sign dataset Energy Energy Platform Max Classification # of per delay ( 6 5 nm tree rate Accuracy ( % ) trees decision product CMOS) Depth ( decisions/ m s) ( nJ) ( fJ·s) Conv. Arch. 6 4 6 1 6 7 / bank 6 0 .4 3 6 1 .6 9 3 .5 Proposed 6 4 6 3 6 4 / bank 1 9 .4 5 3 .2 9 4 Arch. EDP reduction by 6 .8 × 10

  11. Measured Energy vs. Accuracy Trade-off Accuracy vs. # of trees vs. Δ � �� Accuracy  BL swing  Energy  # of trees  error resiliency  → allows lower BL swing Accuracy vs. energy → higher energy efficiency w .r.t BL sw ing ( Δ � �� ) * * Δ� �� for conv. is 10 × ” Δ� �� per LSB” 11

  12. Chip Summary & Comparison Chip m icrograph Chip sum m ary Technology 65 nm CMOS 1.2 × 1.2 mm Die size 16 KB SRAM capacity (512 × 256 bit-cells) 2.11 × 0.92 um 2 Bit-cell size CTRL CLK freq. 1 GHz CORE 1.0 Supply voltage (V) CTRL 0.75 Com parison w ith state-of-the-art Prior Input Throughput Energy EDP Process Algorithm Dataset Accuracy art size (8b) (decision/s) (nJ/decision) (fJs/decision) Support 130nm Traffic 320 33 1.5M 45G [5] vector 90% CMOS sign video × 240 [40K]* [1250]* [31250]* machine 14nm K-nearest Not 21.5M 3.4 0.2 Not [6] 128 tri-gate neighbor reported [498.8K]* [145.3]* [292.3]* reported 65nm Ours Random KUL traffic 16 19.4 364.4K 52.4 94% CMOS ( M =64) forest signs × 16 (w/ CTRL) [ 5] : J.Park JSSC12, [ 6] : H.Kaul ISSCC16, * scaled to 65 nm CMOS 12

  13. Conclusions  First ASIC implementation of RF algorithm  low-SNR processing via DIMA and DSS  Energy & speed benefits  2.2 × and 3.1 × smaller delay and energy → 6.8 × smaller EDP compared to digital ASIC  Higher potential in large-scale applications  # of trees up to a few hundreds in real-life applications → Higher error-resiliency → More room to scale ∆� �� for energy efficiency  Future work  On-chip training to compensate process variations  Different algorithms (e.g., boosted ensemble classifier) 13

  14. Acknowledgment  This work was supported by Systems on Nanoscale Information fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by SRC and DARPA. 14

Recommend


More recommend