Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep Neural l Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bit its Research Gr Group 1
Can we tr transform CPU in into a neural accelerator? CPU GPU $ 2
Can we tr transform CPU in into a neural accelerator? GPU CPU Neural Cache ++ Parallelism -- Data Movement 3
Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 45 MB LLC 18 LLC slices 4
Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 2.5MB LLC slice 45 MB LLC TMU CBOX Way 19 Way 20 Way 1 Way 2 32kB data 8kB array bank 18 LLC slices 360 ways 5
Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 2.5MB LLC slice 8kB SRAM array 45 MB LLC BL/BLB 255 0 WL Row TMU decoder CBOX Way 19 Way 20 Way 1 Way 2 255 32kB data 8kB array bank 18 LLC slices 360 ways 5760 arrays 6
Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 2.5MB LLC slice 8kB SRAM array 45 MB LLC BL/BLB 255 0 Array A Bit-Slice 3 WL 0 Bit-Slice 2 1 Bit-Slice 1 1 Bit-Slice 0 0 Bit-Slice 3 Array B Row 0 Bit-Slice 2 TMU 0 decoders Bit-Slice 1 1 CBOX Bit-Slice 0 1 1 A + B 0 0 1 Way 19 Way 20 Way 1 Way 2 255 = A + B Logic 32kB data 8kB array bank 18 LLC slices 360 ways 5760 arrays 7
Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 2.5MB LLC slice 8kB SRAM array Bitline ALU 45 MB LLC BL/BLB 255 0 Array A Bit-Slice 3 WL 0 Bit-Slice 2 1 BL BLB Bit-Slice 1 1 Bit-Slice 0 0 Vref Bit-Slice 3 SA SA Array B Row 0 Bit-Slice 2 TMU 0 A&B ~A & ~B decoders Bit-Slice 1 1 CBOX Bit-Slice 0 1 A^B DR 1 A + B 0 0 S 1 Cout S = A^B^C C_EN EN Way 19 Way 20 D Way 1 Way 2 C Q Cin 255 = A + B Logic 32kB data 8kB array bank 18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs 8
Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 2.5MB LLC slice 8kB SRAM array Bitline ALU 45 MB LLC BL/BLB 255 0 WL BL BLB Array A TMU Array B Vref CBOX SA SA Passive Last Level Cache transformed into ∼ 1 million bit-serial active ALUs Row A&B ~A & ~B decoders A + B ✓ ✓ ✓ Multiply Divide Add A^B DR Way 19 Way 20 Way 1 Way 2 S Cout Configurable Precision S = A^B^C C_EN EN D C Q Cin 255 Bit-serial operation @2.5 GHz = A + B Logic 32kB data 8kB array bank 18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs 9
Why bit it-serial? A + B BL/BLB 255 0 Row decoders Bit-parallel arithmetic 255 Logic 10
Why bit it-serial? A + B BL/BLB 255 0 Word 3 } Word 2 Array A Word 1 Word 0 Word 3 Row } Word 2 Array B decoders Bit-parallel arithmetic Word 1 Word 0 } A + B 255 Logic 11
Why bit it-serial? A + B BL/BLB 255 0 Word 3 } Word 2 Array A Word 1 Word 0 WL1 Word 3 Row } Word 2 Array B decoders Bit-parallel arithmetic Word 1 Word 0 WL2 } A + B 255 S Logic 12
Why bit it-serial? A + B BL/BLB 255 0 Word 3 } Word 2 Array A Word 1 Word 0 WL1 Word 3 Row } Word 2 Array B decoders Bit-parallel arithmetic Word 1 Word 0 WL2 } A + B 255 S S Logic Carry propagation across bitlines C 13
Why bit it-serial? A + B BL/BLB 255 0 Word 3 } Word 2 Array A Word 1 Word 0 WL1 Word 3 Row } Word 2 Array B decoders Bit-parallel arithmetic Word 1 Word 0 WL2 } A + B 255 S S S Logic Carry propagation across bitlines C C 14
Why bit it-serial? A + B BL/BLB 255 0 Word 3 } Word 2 Array A Word 1 Word 0 WL1 ! High complexity Word 3 Row } Word 2 Array B decoders Bit-parallel arithmetic Word 1 Word 0 WL2 ! Loss of throughput and efficiency } A + B 255 S S S S Logic Carry propagation across bitlines C C C 15
Why bit it-serial? A + B BL/BLB 255 0 Row decoders Bit-serial arithmetic 255 Logic 16
Why bit it-serial? A + B Word 3 Word 2 Word 1 Word 0 Transposed data 255 BL/BLB 0 Array A } Array B Row } decoders Bit-serial arithmetic } A + B 255 S S S S Sum 0 0 0 0 Carry 17
Why bit it-serial? A + B Word 3 Word 2 Word 1 Word 0 Transposed data 255 BL/BLB 0 Bit-Slice 3 Array A } Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 WL1 Array B Row } decoders Bit-serial arithmetic WL2 } A + B 255 S S S S Sum 0 0 0 0 Carry Cycle 1 18
Why bit it-serial? A + B Word 3 Word 2 Word 1 Word 0 Transposed data 255 BL/BLB 0 Bit-Slice 3 Array A } Bit-Slice 2 Bit-Slice 1 WL1 Bit-Slice 0 Array B Row } decoders Bit-serial arithmetic WL2 } A + B 255 S S S S Sum C C C C Carry Cycle 2 19
Why bit it-serial? A + B Word 3 Word 2 Word 1 Word 0 Transposed data 255 BL/BLB 0 Bit-Slice 3 Array A } Bit-Slice 2 WL1 Bit-Slice 1 Bit-Slice 0 Array B Row } WL2 decoders Bit-serial arithmetic } A + B 255 S S S S Sum C C C C Carry Cycle 3 20
Why bit it-serial? A + B Word 3 Word 2 Word 1 Word 0 Transposed data 255 BL/BLB 0 Bit-Slice 3 Array A WL1 } Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 ✓ Low area complexity Array B WL2 Row } decoders Bit-serial arithmetic ✓ High throughput } A + B ✓ Configurable & High precision 255 S S S S Sum C C C C Carry Cycle 4 21
Outline • Motivation • Bit-Serial Arithmetic • Transpose • Mapping of Convolution to Array • Methodology • Results 22
In-SRAM Ari In rithmetic 18-core Xeon processor 2.5MB LLC slice 8kB SRAM array Bitline ALU 45 MB LLC BL/BLB 255 0 Array A Bit-Slice 3 WL 0 Bit-Slice 2 1 BL BLB Bit-Slice 1 1 Bit-Slice 0 0 Vref Bit-Slice 3 SA SA Array B Row 0 Bit-Slice 2 TMU 0 A&B ~A & ~B decoders Bit-Slice 1 1 CBOX Bit-Slice 0 1 A^B DR 1 A + B 0 0 S 1 Cout S = A^B^C C_EN EN Way 19 Way 20 D Way 1 Way 2 C Q Cin 255 = A + B Logic 32kB data 8kB array bank 18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs 23
Logical Operations In Lo In-SRAM Bitlines Changes BLn BLBn BLB0 BL0 Row Decoder-O Row Decoder Additional Wordlines row decoder Single-ended Vref Vref Sense Amplifiers SA SA SA SA Reconfigurable SA SA sense amplifiers Differential Sense Amplifiers 24
Lo Logical Operations In In-SRAM A AND B BLn B BLBn BLB0 BL0 A A Row Decoder Row Decoder 0 1 0 1 B 1 0 0 1 Vref Vref Single-ended SA SA SA SA Sense Amplifiers 0 1 A AND B 25
Logical Operations In Lo In-SRAM BLn B BLBn BLB0 BL0 A A Row Decoder Row Decoder 0 1 0 1 B 1 0 0 1 Vref Vref Single-ended SA SA SA SA Sense Amplifiers 1 0 0 1 A NOR B A AND B 26
Addition In In-SRAM 256 Bitlines B P BLn BLBn A BLB0 BL0 A 0 1 1 A 1 1 0 Row Decoder Row Decoder BL BLB B 0 0 1 B 1 1 1 Vref SA SA P 0 0 0 A&B ~A & ~B P 1 0 0 P 2 0 0 A^B DR Vref Vref SA SA SA SA S Cout S = A^B^C Carry 0 0 C_EN EN D C Q Sum 0 0 Cin 27
Addition [C [Cycle 1] B P BLn BLBn A BLB0 BL0 A 0 1 1 A 1 1 0 Row Decoder Row Decoder B 0 0 1 B 1 1 1 P 0 1 0 0 P 1 0 0 P 2 0 0 Vref Vref SA SA SA SA Carry 1 0 0 Sum 0 1 28
Addition [C [Cycle 2] B P BLn BLBn A BLB0 BL0 A 0 1 1 A 1 1 0 Row Decoder Row Decoder B 0 0 1 B 1 1 1 P 0 1 0 P 1 1 0 1 0 P 2 0 0 Vref Vref SA SA SA SA Carry 1 0 1 1 Sum 29
Addition [C [Cycle 3] P BLn BLBn BLB0 BL0 A 0 1 1 A 1 1 0 Row Decoder Row Decoder B 0 0 1 B 1 1 1 P 0 1 0 P 1 1 1 P 2 1 0 0 Vref Vref SA SA SA SA Carry 1 0 Sum 30
Mult ltiplication In In-SRAM BLBn BLn BLB0 BL0 A 0 1 1 A 1 1 0 Row Decoder Row Decoder B 0 0 1 B 1 1 1 P 0 0 0 P 1 0 0 P 2 0 0 P 3 0 0 0 0 Vref Vref SA SA SA SA 0 Carry 0 Sum Tag 0 0 31
Multiplication [C [Cycle 1] BLn BLBn BLB0 BL0 A 1 A 0 A 0 1 1 X A 1 0 1 B 1 B 0 Row Decoder Row Decoder B 0 0 1 A 1 B 0 A 0 B 0 B 1 1 1 A 1 B 1 A 0 B 1 P 0 0 0 P 1 0 0 P 2 P 2 P 1 P 0 0 0 P 3 0 0 0 0 Vref Vref SA SA SA SA 0 0 Carry Sum 1 0 Tag 0 32
Mult ltiplication [C [Cycle 2] BLn BLBn BLB0 BL0 A 1 A 0 A 0 1 1 X A 1 0 1 B 1 B 0 Row Decoder Row Decoder B 0 0 1 A 1 B 0 A 0 B 0 B 1 1 1 A 1 B 1 A 0 B 1 P 0 P 0 <- A 0 B 0 1 0 0 P 1 0 0 P 2 P 2 P 1 P 0 0 0 P 3 0 0 0 0 Vref Vref SA SA SA SA 0 0 Carry 0 Sum 1 0 Tag 1 33
Mult ltiplication [C [Cycle 3] BLn BLBn BLB0 BL0 A 1 A 0 A 0 1 1 X A 1 0 1 B 1 B 0 Row Decoder Row Decoder B 0 0 1 A 1 B 0 A 0 B 0 B 1 1 1 A 1 B 1 A 0 B 1 P 0 P 0 <- A 0 B 0 1 0 P 1 P 1 <- A 1 B 0 1 0 0 P 2 P 2 P 1 P 0 0 0 P 3 0 0 0 0 Vref Vref SA SA SA SA 0 0 Carry 0 Sum 1 0 Tag 1 34
Multiplication [C [Cycle 4] BLn BLBn BLB0 BL0 A 1 A 0 A 0 1 1 X A 1 0 1 B 1 B 0 Row Decoder Row Decoder B 0 0 1 A 1 B 0 A 0 B 0 B 1 1 1 A 1 B 1 A 0 B 1 P 0 P 0 <- A 0 B 0 0 1 P 1 P 1 <- A 1 B 0 0 1 P 2 P 2 P 1 P 0 0 0 P 3 0 0 0 0 Vref Vref SA SA SA SA 0 0 Carry Sum 1 0 Tag 0 1 35
Recommend
More recommend