IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY - PowerPoint PPT Presentation

IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM

AGENDA � The AI computational challenge � Introduction to associative computing � Examples � An NLP use case � What’s next?

THE CHALLENGE IN AI COMPUTING Use Case Example AI Requirement � 32 bit FP Neural network learning � Multi precision Neural network inference, data mining, etc. � Scaling Data center � Sort-search Top-K, recommendation, speech, classify image/video � Heavy computation Non linearity, Softmax, exponent , normalize � Bandwidth Required for speed and power

CURRENT SOLUTION Question General Very Wide DRAM CPU Purpose Bus GPU Answer Thousands of Cores Tens of Cores • Bottleneck when register file data needs to be replaced on a regular basis ➢ Limits performance ➢ Increases power consumption • Does not scale with the search, sort, and rank requirements of applications like recommender systems, NLP, speech recognition, and data mining that requires functions like Top-K and Softmax.

GPU VS CPU VS FPGA

GPU VS CPU VS FPGA VS APU

GSI’S SOLUTION   APU—ASSOCIATIVE PROCESSING UNIT Millions Processors Question Simple APU Simple CPU & Narrow Bus Associative Memory Answer • Computes in-place directly in the memory array—removes the I/O bottleneck • Significantly increases performances • Reduces power

IN-MEMORY COMPUTING CONCEPT

THE COMPUTING MODEL FOR THE   PAST 80 YEARS 0101 Address Decoder Read 1000 1000 1100 Read Write 1001 ALU 0010 Write Sense Amp /IO Drivers

THE CHANGE—IN-MEMORY COMPUTING 0101 Read Simple Controller Read 1000 Read 1100 1001 Write 0101 NOR Read Write NOR 0010 • Patented in-memory logic using only Read/Write operations • Any logic/arithmetic function can be generated internally

CAM/ ASSOCIATIVE SEARCH 1 in the combines key goes to the Records read enable 0 0 RE 0 1 0 KEY: 1 0 1 Values 0 1 Search 1 1 1 0 1 0110 1 0 0 RE 1 0 1 1 1 RE 0 1 0 Duplicate vales with inverse data 0 0 0 RE 0 1 1 Duplicate the key with inverse. Move the original key next to the inverse data 0 0 1 = match 11

TCAM SEARCH WITH STANDARD MEMORY CELLS 0 Don’t care Don’t care 0 1 0 1 1 1 1 1 1 0 0 Don’t care 0 12

TCAM SEARCH WITH STANDARD MEMORY CELLS 1 in the combines key goes to the read enable 0 0 0 RE 0 1 1 0 1 Insert zero instead of don’t-care 1 0 1 1 1 1 0 KEY: 0 0 0 RE 0 1 Search 0 0 1 0110 RE 0 1 0 0 0 0 RE Duplicate data. Inverse only to those which are not don’t care 0 1 1 Duplicate the key with inverse. Move The original Key next to the 1 = match 0 1 = match inverse data 13

COMPUTING IN THE BIT LINES Vector A Vector B C=f(A,B) Each bit line becomes a processor and storage millions of bit lines = millions of processors

NEIGHBORHOOD COMPUTING Shift vector C=f(A,SL(B,1)) Parallel shift of bit lines @ 1 cycle sections Enables neighborhood operations such as convolutions

SEARCH & COUNT 5 -17 20 3 20 54 8 20 Count = 3 Search 20 1 1 1 1 1 1 • Search (binary or ternary) all bit lines in 1 cycle • 128 M bit lines => 128 Peta search/sec • Key applications for search and count for predictive analytics: • Recommender systems • K-nearest neighbors (using cosine similarity search) • Random forest • Image histogram • Regular expression

DATABASE SEARCH AND UPDATE � Content-based search, record can be placed anywhere � Update, modify, insert, delete is immediate � Exact Match � CAM/TCAM � Similarity Match � In-Place Aggregate

TRADITIONAL STORAGE CAN DO MUCH MORE Standard memory 1 bit 2 bits cell 3 Bits 2 input NOR, 3 input NOR, 1 TCAM cell Standard 2 Input NOR + 1 Output memory 4 State CAM … cell Standard memory cell Standard memory cell …

CPU/GPGPU VS APU

ARCHITECTURE

SECTION COMPUTING TO IMPROVE PERFORMANCE 24 rows MLB section 0 Connecting Mux Memory control MLB section 1 Connecting mux . . . Instr. Buffer 21

COMMUNICATION BETWEEN SECTIONS … Shift between sections enable neighborhood operations (filters , CNN etc.) … Store, compute, search and move data anywhere 22

APU CHIP LAYOUT 2M bit processors or 128K vector processors runs at 1G Hz with up to 2 Peta OPS peak performance

EVALUATION BOARD PERFORMANCE � Precision : � Unlimited : from 1 bit to 160 bits or more. � 6.4 TOPS (FP) – 8 Peta OPS for one bit computing or 16 bit exact search � Similarity Search, Top-k , min, max , Softmax, � O(1) complexity in μ s, any size of K compared to ms with current solutions � In-memory IO � 2 Petabit/sec > 100X GPGPU/CPU/FPGA � Sparse matrix multiplication � > 100X GPGPU/CPU/FPGA

APU SERVER � 64 APU chips , 256-512GByte DDR, � From 100TFLOPS Up to 128 Peta OPS with peak performance 128TOPS/W � O(1) Top-K, min, max, � 32 Peta bits/sec internal IO � < 1K Watts � > 1000X GPGPs on average � Linearly scalable � Currently 28nm process and scalable to 7nm or less � Well suited to advanced memory technology such as non volatile ReRAM and more

EXAMPLE APPLICATIONS

K-NEAREST NEIGHBORS (K-NN) Simple example: N = 36, 3 groups, 2 dimensions (D = 2 ) for X and Y K = 4 Group Green selected as the majority For actual applications: N = Billions, D = Tens, K = Tens of thousands

K-NN USE CASE IN AN APU Item N Item1 Item2 Item 1 Features of item 1 Features of item 2 Features of item N Item N Item features Item 2 and label Item 3 storage Compute cosine distances for all N in parallel ( ≤ 10 μ s, assuming D=50 features) Q Computing Area Distribute data – 2 ns (to all) K Mins at O(1) complexity ( ≤ 3 μ s) In-Place ranking 4 1 3 2 Majority Calculation With the data base in an APU, computation for all N items done in ≤ 0.05 ms, independent of K

LARGE DATABASE EXAMPLE USING APU SERVERS • Number of items: billions • Features per item: tens to hundreds • Latency: ≤ 1 msec • Throughput: Scales to 1M similarity searches/sec • k-NN: Top 100,000 nearest neighbors

EXAMPLE K-NN FOR RECOGNITION Image Convolution Layer K-NN Feature Classifier Extractor (Associative (Neural Network) Memory) Text BOW, Word Embedding

K-MINS: O(1) ALGORITHM MSB LSB C 0 KMINS(int K, vector C){   M := 1, V := 0;   C 1 FOR b = msb to b = lsb:   D := not(C[b]);   C 2 N := M & D;   cnt = COUNT(N|V)   N IF cnt > K:   M := N;   ELIF cnt < K:   V := N|V;   ELSE: // cnt == K   V := N|V;   EXIT;   ENDIF   ENDFOR   }

K-MINS: THE ALGORITHM C[0] V N|V N M D 0 0 0 1 0 110101 110101 0 KMINS(int K, vector C){   1 1 1 1 1 010100 010100 0 M := 1, V := 0;   1 1 1 1 1 000101 000101 0 FOR b = msb to b = lsb:   1 1 1 1 1 000111 000111 0 D := not(C[b]);   1 1 1 1 1 000001 000001 0 N := M & D;   0 0 0 1 0 111011 111011 0 cnt = COUNT(N|V)   1 1 1 1 1 000101 000101 0 IF cnt > K:   1 1 1 1 1 010110 010110 0 M := N;   1 1 1 1 1 001100 001100 0 ELIF cnt < K:   1 1 1 1 1 011100 011100 0 V := N|V;   0 0 1 0 0 111100 111100 0 ELSE: // cnt == K   0 0 0 1 0 101101 101101 0 V := N|V;   1 1 1 1 1 000101 000101 0 EXIT;   0 0 0 1 0 111101 111101 0 ENDIF   1 1 1 1 1 000101 000101 ENDFOR   0 } 1 1 1 1 1 000101 000101 0 cnt=11

K-MINS: THE ALGORITHM C[1] V N|V N M D 0 0 0 0 0 0 110101 110101 KMINS(int K, vector C){   0 0 0 0 1 0 010100 010100 M := 1, V := 0;   0 1 1 1 1 1 000101 000101 FOR b = msb to b = lsb:   0 1 1 1 1 1 000111 000111 D := not(C[b]);   0 1 1 1 1 1 000001 000001 N := M & D;   0 0 0 0 0 0 111011 111011 cnt = COUNT(N|V)   0 1 1 1 1 1 000101 000101 IF cnt > K:   0 0 0 1 0 0 010110 010110 M := N;   0 1 1 1 1 1 001100 001100 ELIF cnt < K:   0 0 0 0 1 0 011100 011100 V := N|V;   0 0 0 0 0 0 111100 111100 ELSE: // cnt == K   0 0 0 0 0 1 101101 101101 V := N|V;   0 1 1 1 1 1 000101 000101 EXIT;   0 0 0 0 0 0 111101 111101 ENDIF   0 1 1 1 1 1 ENDFOR   000101 000101 } 0 1 1 1 1 1 000101 000101 cnt=8

K-MINS: THE ALGORITHM V C 0 110101 110101 KMINS(int K, vector C){   0 010100 010100 M := 1, V := 0;   1 000101 000101 FOR b = msb to b = lsb:   0 000111 000111 D := not(C[b]);   1 000001 000001 N := M & D;   0 111011 111011 cnt = COUNT(N|V)   1 000101 000101 IF cnt > K:   0 010110 010110 M := N;   0 001100 001100 ELIF cnt < K:   0 011100 011100 V := N|V;   0 111100 111100 ELSE: // cnt == K   0 101101 101101 V := N|V;   1 000101 000101 EXIT;   0 111101 111101 ENDIF   0 000101 000101 ENDFOR   } 0 000101 000101 final output O(1) Complexity

IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY - PowerPoint PPT Presentation

IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM AGENDA The AI computational challenge Introduction to associative computing Examples An NLP use case Whats next? THE CHALLENGE IN AI

In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Lazy Associative Classification Decision Tree Classifier (Eager) Associative Classifier By

Associative arrays Associative arrays map a key to a value Keys and values can be different

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Cache design overview ANY cache can be viewed as k-way associative. What are the pros and cons of

Associative Memory Mnemonists or Memory men Can perform amazing feats of memory

AM07: Characterization of the Novel Associative Memory Chip Prototype Designed in 28 nm CMOS

Associative Fine-Tuning of Biologically Inspired Active Neuro-Associative Knowledge Graphs Adrian

Associative dyadic boolean functions Goals Def: A Boolean function f : { 0 , 1 } 2 { 0 , 1 }

Example: Associative Arrays An environment can be expressed as an associative array, e.g.:

10. Left-associative grammar (LAG) 10.1 Rule types and derivation order 10.1.1 The notion

Associative caches (3 rd Ed: p.496-504, 4 th Ed: 479-487) flexible block placement schemes

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Vectors, Matrices, and Associative Memory Computational Models of Neural Systems Lecture 3.1

Main Memory & DRAM Nima Honarmand Spring 2018 :: CSE 502 Main Memory Big Picture 1)

Language Definition vs. Implementation Most of 251 so far Now a

Real Processing in Memory using Memristors Nishil Talati, Rotem Ben Hur, Nimrod Wald, Ameer Haj

Marr's Theory of the Hippocampus: Part I Computational Models of Neural Systems Lecture 3.3

SecPM: a Secure and Persistent Memory System for Non-volatile Memory Pengfei Zuo, Yu Hua Huazhong

Analysing the Relationship between Learning Styles and Cognitive Traits Sabine Graf Taiyu Lin

Lecturer: Dr. Benjamin Amponsah, Dept. of Psychology, UG, Legon Contact Information:

iBench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos

IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY - PowerPoint PPT Presentation

IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM AGENDA The AI computational challenge Introduction to associative computing Examples An NLP use case Whats next? THE CHALLENGE IN AI

In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Lazy Associative Classification Decision Tree Classifier (Eager) Associative Classifier By

Associative arrays Associative arrays map a key to a value Keys and values can be different

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Cache design overview ANY cache can be viewed as k-way associative. What are the pros and cons of

Associative Memory Mnemonists or Memory men Can perform amazing feats of memory

AM07: Characterization of the Novel Associative Memory Chip Prototype Designed in 28 nm CMOS

Associative Fine-Tuning of Biologically Inspired Active Neuro-Associative Knowledge Graphs Adrian

Associative dyadic boolean functions Goals Def: A Boolean function f : { 0 , 1 } 2 { 0 , 1 }

Example: Associative Arrays An environment can be expressed as an associative array, e.g.:

10. Left-associative grammar (LAG) 10.1 Rule types and derivation order 10.1.1 The notion

Associative caches (3 rd Ed: p.496-504, 4 th Ed: 479-487) flexible block placement schemes

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Vectors, Matrices, and Associative Memory Computational Models of Neural Systems Lecture 3.1

Main Memory &amp; DRAM Nima Honarmand Spring 2018 :: CSE 502 Main Memory Big Picture 1)

Language Definition vs. Implementation Most of 251 so far Now a

Real Processing in Memory using Memristors Nishil Talati, Rotem Ben Hur, Nimrod Wald, Ameer Haj

Marr's Theory of the Hippocampus: Part I Computational Models of Neural Systems Lecture 3.3

SecPM: a Secure and Persistent Memory System for Non-volatile Memory Pengfei Zuo, Yu Hua Huazhong

Analysing the Relationship between Learning Styles and Cognitive Traits Sabine Graf Taiyu Lin

Lecturer: Dr. Benjamin Amponsah, Dept. of Psychology, UG, Legon Contact Information:

iBench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos

Main Memory & DRAM Nima Honarmand Spring 2018 :: CSE 502 Main Memory Big Picture 1)