Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep - PowerPoint PPT Presentation

Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep Neural l Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bit its Research Gr Group 1

Can we tr transform CPU in into a neural accelerator? CPU GPU $ 2

Can we tr transform CPU in into a neural accelerator? GPU CPU Neural Cache ++ Parallelism -- Data Movement 3

Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 45 MB LLC 18 LLC slices 4

Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 2.5MB LLC slice 45 MB LLC TMU CBOX Way 19 Way 20 Way 1 Way 2 32kB data 8kB array bank 18 LLC slices 360 ways 5

Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 2.5MB LLC slice 8kB SRAM array 45 MB LLC BL/BLB 255 0 WL Row TMU decoder CBOX Way 19 Way 20 Way 1 Way 2 255 32kB data 8kB array bank 18 LLC slices 360 ways 5760 arrays 6

Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 2.5MB LLC slice 8kB SRAM array 45 MB LLC BL/BLB 255 0 Array A Bit-Slice 3 WL 0 Bit-Slice 2 1 Bit-Slice 1 1 Bit-Slice 0 0 Bit-Slice 3 Array B Row 0 Bit-Slice 2 TMU 0 decoders Bit-Slice 1 1 CBOX Bit-Slice 0 1 1 A + B 0 0 1 Way 19 Way 20 Way 1 Way 2 255 = A + B Logic 32kB data 8kB array bank 18 LLC slices 360 ways 5760 arrays 7

Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 2.5MB LLC slice 8kB SRAM array Bitline ALU 45 MB LLC BL/BLB 255 0 Array A Bit-Slice 3 WL 0 Bit-Slice 2 1 BL BLB Bit-Slice 1 1 Bit-Slice 0 0 Vref Bit-Slice 3 SA SA Array B Row 0 Bit-Slice 2 TMU 0 A&B ~A & ~B decoders Bit-Slice 1 1 CBOX Bit-Slice 0 1 A^B DR 1 A + B 0 0 S 1 Cout S = A^B^C C_EN EN Way 19 Way 20 D Way 1 Way 2 C Q Cin 255 = A + B Logic 32kB data 8kB array bank 18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs 8

Transforming caches in into massively parallel vector ALUs 18-core Xeon processor 2.5MB LLC slice 8kB SRAM array Bitline ALU 45 MB LLC BL/BLB 255 0 WL BL BLB Array A TMU Array B Vref CBOX SA SA Passive Last Level Cache transformed into ∼ 1 million bit-serial active ALUs Row A&B ~A & ~B decoders A + B ✓ ✓ ✓ Multiply Divide Add A^B DR Way 19 Way 20 Way 1 Way 2 S Cout Configurable Precision S = A^B^C C_EN EN D C Q Cin 255 Bit-serial operation @2.5 GHz = A + B Logic 32kB data 8kB array bank 18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs 9

Why bit it-serial? A + B BL/BLB 255 0 Row decoders Bit-parallel arithmetic 255 Logic 10

Why bit it-serial? A + B BL/BLB 255 0 Word 3 } Word 2 Array A Word 1 Word 0 Word 3 Row } Word 2 Array B decoders Bit-parallel arithmetic Word 1 Word 0 } A + B 255 Logic 11

Why bit it-serial? A + B BL/BLB 255 0 Word 3 } Word 2 Array A Word 1 Word 0 WL1 Word 3 Row } Word 2 Array B decoders Bit-parallel arithmetic Word 1 Word 0 WL2 } A + B 255 S Logic 12

Why bit it-serial? A + B BL/BLB 255 0 Word 3 } Word 2 Array A Word 1 Word 0 WL1 Word 3 Row } Word 2 Array B decoders Bit-parallel arithmetic Word 1 Word 0 WL2 } A + B 255 S S Logic Carry propagation across bitlines C 13

Why bit it-serial? A + B BL/BLB 255 0 Word 3 } Word 2 Array A Word 1 Word 0 WL1 Word 3 Row } Word 2 Array B decoders Bit-parallel arithmetic Word 1 Word 0 WL2 } A + B 255 S S S Logic Carry propagation across bitlines C C 14

Why bit it-serial? A + B BL/BLB 255 0 Word 3 } Word 2 Array A Word 1 Word 0 WL1 ! High complexity Word 3 Row } Word 2 Array B decoders Bit-parallel arithmetic Word 1 Word 0 WL2 ! Loss of throughput and efficiency } A + B 255 S S S S Logic Carry propagation across bitlines C C C 15

Why bit it-serial? A + B BL/BLB 255 0 Row decoders Bit-serial arithmetic 255 Logic 16

Why bit it-serial? A + B Word 3 Word 2 Word 1 Word 0 Transposed data 255 BL/BLB 0 Array A } Array B Row } decoders Bit-serial arithmetic } A + B 255 S S S S Sum 0 0 0 0 Carry 17

Why bit it-serial? A + B Word 3 Word 2 Word 1 Word 0 Transposed data 255 BL/BLB 0 Bit-Slice 3 Array A } Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 WL1 Array B Row } decoders Bit-serial arithmetic WL2 } A + B 255 S S S S Sum 0 0 0 0 Carry Cycle 1 18

Why bit it-serial? A + B Word 3 Word 2 Word 1 Word 0 Transposed data 255 BL/BLB 0 Bit-Slice 3 Array A } Bit-Slice 2 Bit-Slice 1 WL1 Bit-Slice 0 Array B Row } decoders Bit-serial arithmetic WL2 } A + B 255 S S S S Sum C C C C Carry Cycle 2 19

Why bit it-serial? A + B Word 3 Word 2 Word 1 Word 0 Transposed data 255 BL/BLB 0 Bit-Slice 3 Array A } Bit-Slice 2 WL1 Bit-Slice 1 Bit-Slice 0 Array B Row } WL2 decoders Bit-serial arithmetic } A + B 255 S S S S Sum C C C C Carry Cycle 3 20

Why bit it-serial? A + B Word 3 Word 2 Word 1 Word 0 Transposed data 255 BL/BLB 0 Bit-Slice 3 Array A WL1 } Bit-Slice 2 Bit-Slice 1 Bit-Slice 0 ✓ Low area complexity Array B WL2 Row } decoders Bit-serial arithmetic ✓ High throughput } A + B ✓ Configurable & High precision 255 S S S S Sum C C C C Carry Cycle 4 21

Outline • Motivation • Bit-Serial Arithmetic • Transpose • Mapping of Convolution to Array • Methodology • Results 22

In-SRAM Ari In rithmetic 18-core Xeon processor 2.5MB LLC slice 8kB SRAM array Bitline ALU 45 MB LLC BL/BLB 255 0 Array A Bit-Slice 3 WL 0 Bit-Slice 2 1 BL BLB Bit-Slice 1 1 Bit-Slice 0 0 Vref Bit-Slice 3 SA SA Array B Row 0 Bit-Slice 2 TMU 0 A&B ~A & ~B decoders Bit-Slice 1 1 CBOX Bit-Slice 0 1 A^B DR 1 A + B 0 0 S 1 Cout S = A^B^C C_EN EN Way 19 Way 20 D Way 1 Way 2 C Q Cin 255 = A + B Logic 32kB data 8kB array bank 18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs 23

Logical Operations In Lo In-SRAM Bitlines Changes BLn BLBn BLB0 BL0 Row Decoder-O Row Decoder Additional Wordlines row decoder Single-ended Vref Vref Sense Amplifiers SA SA SA SA Reconfigurable SA SA sense amplifiers Differential Sense Amplifiers 24

Lo Logical Operations In In-SRAM A AND B BLn B BLBn BLB0 BL0 A A Row Decoder Row Decoder 0 1 0 1 B 1 0 0 1 Vref Vref Single-ended SA SA SA SA Sense Amplifiers 0 1 A AND B 25

Logical Operations In Lo In-SRAM BLn B BLBn BLB0 BL0 A A Row Decoder Row Decoder 0 1 0 1 B 1 0 0 1 Vref Vref Single-ended SA SA SA SA Sense Amplifiers 1 0 0 1 A NOR B A AND B 26

Addition In In-SRAM 256 Bitlines B P BLn BLBn A BLB0 BL0 A 0 1 1 A 1 1 0 Row Decoder Row Decoder BL BLB B 0 0 1 B 1 1 1 Vref SA SA P 0 0 0 A&B ~A & ~B P 1 0 0 P 2 0 0 A^B DR Vref Vref SA SA SA SA S Cout S = A^B^C Carry 0 0 C_EN EN D C Q Sum 0 0 Cin 27

Addition [C [Cycle 1] B P BLn BLBn A BLB0 BL0 A 0 1 1 A 1 1 0 Row Decoder Row Decoder B 0 0 1 B 1 1 1 P 0 1 0 0 P 1 0 0 P 2 0 0 Vref Vref SA SA SA SA Carry 1 0 0 Sum 0 1 28

Addition [C [Cycle 2] B P BLn BLBn A BLB0 BL0 A 0 1 1 A 1 1 0 Row Decoder Row Decoder B 0 0 1 B 1 1 1 P 0 1 0 P 1 1 0 1 0 P 2 0 0 Vref Vref SA SA SA SA Carry 1 0 1 1 Sum 29

Addition [C [Cycle 3] P BLn BLBn BLB0 BL0 A 0 1 1 A 1 1 0 Row Decoder Row Decoder B 0 0 1 B 1 1 1 P 0 1 0 P 1 1 1 P 2 1 0 0 Vref Vref SA SA SA SA Carry 1 0 Sum 30

Mult ltiplication In In-SRAM BLBn BLn BLB0 BL0 A 0 1 1 A 1 1 0 Row Decoder Row Decoder B 0 0 1 B 1 1 1 P 0 0 0 P 1 0 0 P 2 0 0 P 3 0 0 0 0 Vref Vref SA SA SA SA 0 Carry 0 Sum Tag 0 0 31

Multiplication [C [Cycle 1] BLn BLBn BLB0 BL0 A 1 A 0 A 0 1 1 X A 1 0 1 B 1 B 0 Row Decoder Row Decoder B 0 0 1 A 1 B 0 A 0 B 0 B 1 1 1 A 1 B 1 A 0 B 1 P 0 0 0 P 1 0 0 P 2 P 2 P 1 P 0 0 0 P 3 0 0 0 0 Vref Vref SA SA SA SA 0 0 Carry Sum 1 0 Tag 0 32

Mult ltiplication [C [Cycle 2] BLn BLBn BLB0 BL0 A 1 A 0 A 0 1 1 X A 1 0 1 B 1 B 0 Row Decoder Row Decoder B 0 0 1 A 1 B 0 A 0 B 0 B 1 1 1 A 1 B 1 A 0 B 1 P 0 P 0 <- A 0 B 0 1 0 0 P 1 0 0 P 2 P 2 P 1 P 0 0 0 P 3 0 0 0 0 Vref Vref SA SA SA SA 0 0 Carry 0 Sum 1 0 Tag 1 33

Mult ltiplication [C [Cycle 3] BLn BLBn BLB0 BL0 A 1 A 0 A 0 1 1 X A 1 0 1 B 1 B 0 Row Decoder Row Decoder B 0 0 1 A 1 B 0 A 0 B 0 B 1 1 1 A 1 B 1 A 0 B 1 P 0 P 0 <- A 0 B 0 1 0 P 1 P 1 <- A 1 B 0 1 0 0 P 2 P 2 P 1 P 0 0 0 P 3 0 0 0 0 Vref Vref SA SA SA SA 0 0 Carry 0 Sum 1 0 Tag 1 34

Multiplication [C [Cycle 4] BLn BLBn BLB0 BL0 A 1 A 0 A 0 1 1 X A 1 0 1 B 1 B 0 Row Decoder Row Decoder B 0 0 1 A 1 B 0 A 0 B 0 B 1 1 1 A 1 B 1 A 0 B 1 P 0 P 0 <- A 0 B 0 0 1 P 1 P 1 <- A 1 B 0 0 1 P 2 P 2 P 1 P 0 0 0 P 3 0 0 0 0 Vref Vref SA SA SA SA 0 0 Carry Sum 1 0 Tag 0 1 35

Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep - PowerPoint PPT Presentation

Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep Neural l Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bit its Research Gr Group 1 Can

Serial Communications time. 3 4 Serial Interfaces Serial vs. Parallel Different from a

Unit D time. Serial Communications D.3 D.4 Serial vs. Parallel Parallel Interfaces Serial

SPI Serial Port (in AVR Microcontrollers) Contents Serial communication with SPI Serial

Section 13 Section 13 ADSP-BF533 Serial Communications a 13-1 1 BF533 Serial Communications

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

Serial Peripheral Interface (SPI) Synchronous serial data transfers Multipoint serial

8051 Serial Port and Timer/Counter Serial Port Timer Counter Chatchai Jantaraprim

Lecture 13 : Lecture 13 : Special Bit Instructions Todays Goals L Learn bit-set and

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

USART Serial Port in AVR Microcontrollers (Chapter 11 of the Mazidis book) 1 Contents

1.5. I/O 135 Serial Communication Simplex Duplex Half-Duplex 136 Serial Communication

Gaud Software Factory Ralph Back Ivan Porres Gaud Software Factory It is a place to build

AGENDA 2018 Subject Teachers School Goals School Rules & Discipline Issues

2018 Networking AGENDA 2018 Subject Teachers School Goals School Rules &

Portraiture Overlapping Value Who is the artist that created their self portrait?

Lymphoma? Craig Moskowitz, MD Physician in Chief, Cancer Service line Sylvester Comprehensive

Jackline C. Koech University of Massachusetts Supervisor: David Peterson Antiproton Source,

Fiscal Challenges of Public Sector Pensions Public Sector Pension Reform: Addressing Pressing

Fast Direct Methods for Gaussian Processes Mike ONeil Departments of Mathematics New York

Sambuz

Useful Links

Newsletter

Mail Us

Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep - PowerPoint PPT Presentation

Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep Neural l Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bit its Research Gr Group 1 Can

Serial Communications time. 3 4 Serial Interfaces Serial vs. Parallel Different from a

Unit D time. Serial Communications D.3 D.4 Serial vs. Parallel Parallel Interfaces Serial

SPI Serial Port (in AVR Microcontrollers) Contents Serial communication with SPI Serial

Section 13 Section 13 ADSP-BF533 Serial Communications a 13-1 1 BF533 Serial Communications

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

Serial Peripheral Interface (SPI) Synchronous serial data transfers Multipoint serial

8051 Serial Port and Timer/Counter Serial Port Timer Counter Chatchai Jantaraprim

Lecture 13 : Lecture 13 : Special Bit Instructions Todays Goals L Learn bit-set and

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

USART Serial Port in AVR Microcontrollers (Chapter 11 of the Mazidis book) 1 Contents

1.5. I/O 135 Serial Communication Simplex Duplex Half-Duplex 136 Serial Communication

Gaud Software Factory Ralph Back Ivan Porres Gaud Software Factory It is a place to build

AGENDA 2018 Subject Teachers School Goals School Rules &amp; Discipline Issues

2018 Networking AGENDA 2018 Subject Teachers School Goals School Rules &amp;

Portraiture Overlapping Value Who is the artist that created their self portrait?

Lymphoma? Craig Moskowitz, MD Physician in Chief, Cancer Service line Sylvester Comprehensive

Jackline C. Koech University of Massachusetts Supervisor: David Peterson Antiproton Source,

Fiscal Challenges of Public Sector Pensions Public Sector Pension Reform: Addressing Pressing

Fast Direct Methods for Gaussian Processes Mike ONeil Departments of Mathematics New York

Sambuz

Useful Links

Newsletter

Mail Us

AGENDA 2018 Subject Teachers School Goals School Rules & Discipline Issues

2018 Networking AGENDA 2018 Subject Teachers School Goals School Rules &