in memory associative computing
play

IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY - PowerPoint PPT Presentation

IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM AGENDA The AI computational challenge Introduction to associative computing Examples An NLP use case Whats next? THE CHALLENGE IN AI


  1. IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM

  2. AGENDA � The AI computational challenge � Introduction to associative computing � Examples � An NLP use case � What’s next?

  3. THE CHALLENGE IN AI COMPUTING Use Case Example AI Requirement � 32 bit FP Neural network learning � Multi precision Neural network inference, data mining, etc. � Scaling Data center � Sort-search Top-K, recommendation, speech, classify image/video � Heavy computation Non linearity, Softmax, exponent , normalize � Bandwidth Required for speed and power

  4. CURRENT SOLUTION Question General Very Wide DRAM CPU Purpose Bus GPU Answer Thousands of Cores Tens of Cores • Bottleneck when register file data needs to be replaced on a regular basis ➢ Limits performance ➢ Increases power consumption • Does not scale with the search, sort, and rank requirements of applications like recommender systems, NLP, speech recognition, and data mining that requires functions like Top-K and Softmax.

  5. GPU VS CPU VS FPGA

  6. GPU VS CPU VS FPGA VS APU

  7. GSI’S SOLUTION 
 APU—ASSOCIATIVE PROCESSING UNIT Millions Processors Question Simple APU Simple CPU & Narrow Bus Associative Memory Answer • Computes in-place directly in the memory array—removes the I/O bottleneck • Significantly increases performances • Reduces power

  8. IN-MEMORY COMPUTING CONCEPT

  9. THE COMPUTING MODEL FOR THE 
 PAST 80 YEARS 0101 Address Decoder Read 1000 1000 1100 Read Write 1001 ALU 0010 Write Sense Amp /IO Drivers

  10. THE CHANGE—IN-MEMORY COMPUTING 0101 Read Simple Controller Read 1000 Read 1100 1001 Write 0101 NOR Read Write NOR 0010 • Patented in-memory logic using only Read/Write operations • Any logic/arithmetic function can be generated internally

  11. CAM/ ASSOCIATIVE SEARCH 1 in the combines key goes to the Records read enable 0 0 RE 0 1 0 KEY: 1 0 1 Values 0 1 Search 1 1 1 0 1 0110 1 0 0 RE 1 0 1 1 1 RE 0 1 0 Duplicate vales with inverse data 0 0 0 RE 0 1 1 Duplicate the key with inverse. Move the original key next to the inverse data 0 0 1 = match 11

  12. TCAM SEARCH WITH STANDARD MEMORY CELLS 0 Don’t care Don’t care 0 1 0 1 1 1 1 1 1 0 0 Don’t care 0 12

  13. TCAM SEARCH WITH STANDARD MEMORY CELLS 1 in the combines key goes to the read enable 0 0 0 RE 0 1 1 0 1 Insert zero instead of don’t-care 1 0 1 1 1 1 0 KEY: 0 0 0 RE 0 1 Search 0 0 1 0110 RE 0 1 0 0 0 0 RE Duplicate data. Inverse only to those which are not don’t care 0 1 1 Duplicate the key with inverse. Move The original Key next to the 1 = match 0 1 = match inverse data 13

  14. COMPUTING IN THE BIT LINES Vector A Vector B C=f(A,B) Each bit line becomes a processor and storage millions of bit lines = millions of processors

  15. NEIGHBORHOOD COMPUTING Shift vector C=f(A,SL(B,1)) Parallel shift of bit lines @ 1 cycle sections Enables neighborhood operations such as convolutions

  16. SEARCH & COUNT 5 -17 20 3 20 54 8 20 Count = 3 Search 20 1 1 1 1 1 1 • Search (binary or ternary) all bit lines in 1 cycle • 128 M bit lines => 128 Peta search/sec • Key applications for search and count for predictive analytics: • Recommender systems • K-nearest neighbors (using cosine similarity search) • Random forest • Image histogram • Regular expression

  17. DATABASE SEARCH AND UPDATE � Content-based search, record can be placed anywhere � Update, modify, insert, delete is immediate � Exact Match � CAM/TCAM � Similarity Match � In-Place Aggregate

  18. TRADITIONAL STORAGE CAN DO MUCH MORE Standard memory 1 bit 2 bits cell 3 Bits 2 input NOR, 3 input NOR, 1 TCAM cell Standard 2 Input NOR + 1 Output memory 4 State CAM … cell Standard memory cell Standard memory cell …

  19. CPU/GPGPU VS APU

  20. ARCHITECTURE

  21. SECTION COMPUTING TO IMPROVE PERFORMANCE 24 rows MLB section 0 Connecting Mux Memory control MLB section 1 Connecting mux . . . Instr. Buffer 21

  22. COMMUNICATION BETWEEN SECTIONS … Shift between sections enable neighborhood operations (filters , CNN etc.) … Store, compute, search and move data anywhere 22

  23. APU CHIP LAYOUT 2M bit processors or 128K vector processors runs at 1G Hz with up to 2 Peta OPS peak performance

  24. EVALUATION BOARD PERFORMANCE � Precision : � Unlimited : from 1 bit to 160 bits or more. � 6.4 TOPS (FP) – 8 Peta OPS for one bit computing or 16 bit exact search � Similarity Search, Top-k , min, max , Softmax, � O(1) complexity in μ s, any size of K compared to ms with current solutions � In-memory IO � 2 Petabit/sec > 100X GPGPU/CPU/FPGA � Sparse matrix multiplication � > 100X GPGPU/CPU/FPGA

  25. APU SERVER � 64 APU chips , 256-512GByte DDR, � From 100TFLOPS Up to 128 Peta OPS with peak performance 128TOPS/W � O(1) Top-K, min, max, � 32 Peta bits/sec internal IO � < 1K Watts � > 1000X GPGPs on average � Linearly scalable � Currently 28nm process and scalable to 7nm or less � Well suited to advanced memory technology such as non volatile ReRAM and more

  26. EXAMPLE APPLICATIONS

  27. K-NEAREST NEIGHBORS (K-NN) Simple example: N = 36, 3 groups, 2 dimensions (D = 2 ) for X and Y K = 4 Group Green selected as the majority For actual applications: N = Billions, D = Tens, K = Tens of thousands

  28. K-NN USE CASE IN AN APU Item N Item1 Item2 Item 1 Features of item 1 Features of item 2 Features of item N Item N Item features Item 2 and label Item 3 storage Compute cosine distances for all N in parallel ( ≤ 10 μ s, assuming D=50 features) Q Computing Area Distribute data – 2 ns (to all) K Mins at O(1) complexity ( ≤ 3 μ s) In-Place ranking 4 1 3 2 Majority Calculation With the data base in an APU, computation for all N items done in ≤ 0.05 ms, independent of K

  29. LARGE DATABASE EXAMPLE USING APU SERVERS • Number of items: billions • Features per item: tens to hundreds • Latency: ≤ 1 msec • Throughput: Scales to 1M similarity searches/sec • k-NN: Top 100,000 nearest neighbors

  30. EXAMPLE K-NN FOR RECOGNITION Image Convolution Layer K-NN Feature Classifier Extractor (Associative (Neural Network) Memory) Text BOW, Word Embedding

  31. K-MINS: O(1) ALGORITHM MSB LSB C 0 KMINS(int K, vector C){ 
 M := 1, V := 0; 
 C 1 FOR b = msb to b = lsb: 
 D := not(C[b]); 
 C 2 N := M & D; 
 cnt = COUNT(N|V) 
 N IF cnt > K: 
 M := N; 
 ELIF cnt < K: 
 V := N|V; 
 ELSE: // cnt == K 
 V := N|V; 
 EXIT; 
 ENDIF 
 ENDFOR 
 }

  32. K-MINS: THE ALGORITHM C[0] V N|V N M D 0 0 0 1 0 110101 110101 0 KMINS(int K, vector C){ 
 1 1 1 1 1 010100 010100 0 M := 1, V := 0; 
 1 1 1 1 1 000101 000101 0 FOR b = msb to b = lsb: 
 1 1 1 1 1 000111 000111 0 D := not(C[b]); 
 1 1 1 1 1 000001 000001 0 N := M & D; 
 0 0 0 1 0 111011 111011 0 cnt = COUNT(N|V) 
 1 1 1 1 1 000101 000101 0 IF cnt > K: 
 1 1 1 1 1 010110 010110 0 M := N; 
 1 1 1 1 1 001100 001100 0 ELIF cnt < K: 
 1 1 1 1 1 011100 011100 0 V := N|V; 
 0 0 1 0 0 111100 111100 0 ELSE: // cnt == K 
 0 0 0 1 0 101101 101101 0 V := N|V; 
 1 1 1 1 1 000101 000101 0 EXIT; 
 0 0 0 1 0 111101 111101 0 ENDIF 
 1 1 1 1 1 000101 000101 ENDFOR 
 0 } 1 1 1 1 1 000101 000101 0 cnt=11

  33. K-MINS: THE ALGORITHM C[1] V N|V N M D 0 0 0 0 0 0 110101 110101 KMINS(int K, vector C){ 
 0 0 0 0 1 0 010100 010100 M := 1, V := 0; 
 0 1 1 1 1 1 000101 000101 FOR b = msb to b = lsb: 
 0 1 1 1 1 1 000111 000111 D := not(C[b]); 
 0 1 1 1 1 1 000001 000001 N := M & D; 
 0 0 0 0 0 0 111011 111011 cnt = COUNT(N|V) 
 0 1 1 1 1 1 000101 000101 IF cnt > K: 
 0 0 0 1 0 0 010110 010110 M := N; 
 0 1 1 1 1 1 001100 001100 ELIF cnt < K: 
 0 0 0 0 1 0 011100 011100 V := N|V; 
 0 0 0 0 0 0 111100 111100 ELSE: // cnt == K 
 0 0 0 0 0 1 101101 101101 V := N|V; 
 0 1 1 1 1 1 000101 000101 EXIT; 
 0 0 0 0 0 0 111101 111101 ENDIF 
 0 1 1 1 1 1 ENDFOR 
 000101 000101 } 0 1 1 1 1 1 000101 000101 cnt=8

  34. K-MINS: THE ALGORITHM V C 0 110101 110101 KMINS(int K, vector C){ 
 0 010100 010100 M := 1, V := 0; 
 1 000101 000101 FOR b = msb to b = lsb: 
 0 000111 000111 D := not(C[b]); 
 1 000001 000001 N := M & D; 
 0 111011 111011 cnt = COUNT(N|V) 
 1 000101 000101 IF cnt > K: 
 0 010110 010110 M := N; 
 0 001100 001100 ELIF cnt < K: 
 0 011100 011100 V := N|V; 
 0 111100 111100 ELSE: // cnt == K 
 0 101101 101101 V := N|V; 
 1 000101 000101 EXIT; 
 0 111101 111101 ENDIF 
 0 000101 000101 ENDFOR 
 } 0 000101 000101 final output O(1) Complexity

Recommend


More recommend