USC USC INFORMATION INFORMATION SCIENCES SCIENCES INSTITUTE INSTITUTE The Architecture of the DIVA Processing-In-Memory Chip Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett Jeff LaCoss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, Ihn Kim, Gokhan Daglikoca USC Information Sciences Institute ICS’02 June 24, 2002
Outline � Overview of Project Goals and System Architecture � PIM Chip Architecture � Applications and Simulation Results � Prototype Chip Implementation � Conclusion
Increasing Bandwidth P P Processor Memory Processor Memory M M P P M M Host Host The Processor Processor Problem P P M M Processor-memory pairs P P with wide datapaths Solutions Multiple nodes per chip M M Memory-to-memory interconnect
System Architecture Host Host Processor Processor PIM PIM- -to to- -PIM PIM (PowerPC) (PowerPC) Interconnect Interconnect Processor Memory Bus Processor Memory Bus PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM � PIMs as “smart-memory” co-processors
DIVA Key Ideas � First smart-memory PIM device that is – Capable of executing independent threads of control – Designed to support in-memory virtual addressing � Target Applications – Image processing and multimedia (streaming) – Irregular memory accesses (sparse-matrix and pointer- based) � Evolutionary application development – PIMs also support standard memory accesses – System supports familiar parallel programming paradigms
PIM Chip Organization Processing Processing Logic Logic Memory Memory To Neighboring PIM PBUF Memory Port PBUF Memory Port Node Node PIM PIM Parcel PIM Memory Bus Routing Routing Interconnect Component Component PBUF PBUF Memory Port Memory Port Host Interface Host Interface To Neighboring PIM To Host System Memory Bus
Host Memory Interface � Even with an extra arbitration cycle, DIVA PIMs satisfy SDRAM timing by: – Using high-bandwidth embedded memory macros – Running arbitration logic at 4X clock speed – Exploiting long latency allowed by SDRAM standard CLK RAS CAS ROW COL ADDR DATA D0 D1 D2 D3
Node Architecture HOST MEMORY PORT PARCEL BUFFER (“PBUF”) CTL DATA DATA HEADER Node Data Bus Scalar Scalar Scalar 32b Datapath Datapath Registers MEMORY MEMORY 256b 5- -Stage DLX Stage DLX- -like like 5 Instructions Instruction Instruction ICache ICache Pipeline Pipeline MEMORY MEMORY CONTROL CONTROL WideWord WideWord WideWord & 256b & Host Datapath Registers Datapath ARBITER ARBITER Memory Requests ICache Mem Requests Node Memory Requests
WideWord Unit � 256-bit datapath using a 32x256 register file � WideWord operand treated as a packed array of 8-, 16-, or 32-bit objects – Object size specified by instruction � Features – Transfers to/from scalar register file – Data rearrangement instructions – Selective execution
WideWord Permutation Capability � Permutation instructions rearrange subfields – Rearrangement pattern specified by a permutation vector � Two “flavors” of permutation instructions: – General-purpose – Construction of permutation vector in a WideWord register – Hardwired – Permutation vector found in a lookup table of common patterns, e.g., shifts, rotates, shuffles, reductions, etc. – Scalar register value serves as index into the lookup table
WideWord Selective Execution � Only certain subfields of a result are committed during writeback � Subfields participating are determined by a combination of: – Condition codes – EQ, GT, LT, OV – User-settable mask register – Useful for a priori subfield specification – Bits in instruction – Specify whether and what type of selective execution is to be used for that instruction
Example Using Permutations and Selective Execution wr1 a0 a1 a2 a3 a4 a5 a6 a7 r1 = upper_enable r2 = lower_enable wr2 b0 b1 b2 b3 b4 b5 b6 b7 r3 = swap_upper_lower wr3 c0 c1 c2 c3 c4 c5 c6 c7 wr1 a0 a1 a2 a3 a4 a5 a6 a7 mtspr mask ,r2 wprmi_se wr3,wr1,r3 // permute using wr3 // selective execution c0 c1 c2 c3 a0 a1 a2 a3 wr2 b0 b1 b2 b3 b4 b5 b6 b7 mtspr mask ,r1 wprmi_se wr3,wr2,r3 wr3 b4 b5 b6 b7 a0 a1 a2 a3
Applications Program Description Source Data Set Size WideWord Usage parallelism, 4-Kbyte image, 32 Template Matching image correlation Sandia selective, reuse in (TM) 1-Kbyte templates registers, page mode Atlantic matrix transpose 32-Mbyte matrix parallelism, permutation Cornerturn (CT) Aerospace 2M double- parallelism, sparse conjugate CG NAS precision floating-point, gradient elements page mode Transitive Closure Floyd’s all-paths Atlantic parallelism, selective, 256 Kbytes (TC) shortest paths Aerospace reuse in registers Neighborhood relational database Atlantic 500,000 bytes join Aerospace (NH) image processing Natural Join (NJ) Alphatech 72 Kbytes stencil Atlantic random walk 4 Mbytes Pointer (P) Aerospace object-oriented University of 888 Kbytes OO7 database query Wisconsin
Simulation Environment System simulator based on RSIM � System simulator based on RSIM � – Detailed memory subsystem simulation – PIM processor with WideWord, full ISA – Communication: 1-D ring based on PiRCs Conservative assumption: � Conservative assumption � – PIM speed 1/2 of host speed
1-PIM Speedups Speedups over host-only execution 15 10 5 0 TM CT CG TC NJ NH P OO7
Host Execution Time Busy and memory stall times for host- only execution 120 100 80 60 40 20 0 NJ NH CT CG TC P OO7 TM busy memory
Memory Hierarchy Times Host-only and 1-PIM memory stall times 120 100 80 60 40 20 0 CT-H CT-P CG-H CG-P TC-H TC-P NJ-H NJ-P NH-H NH-P 007-H 007-P TM-H TM-P P-H P-P L1 L2 memory
WideWord Performance Gains Speedup of 1 PIM with WideWord over 1 PIM Scalar 20 15 10 5 0 TM CT CG TC
1 st DIVA Prototype PIM Chip � Fabrication technology – TSMC 0.18 µ m � Size – 9.8mm x 9.8mm, 55 million SRAM transistors (2 million logic) SRAM � Package – 35mm, 352 BGA – 241 signal I/O, 111 Vdd or Gnd � Lab results Node Processing Logic, Pbuf SDRAM Interface, PiRC – Running Cornerturn application at 160MHz while dissipating 0.8W – Using WideWord permutations and selective execution
Conclusions and Future Work � DIVA accelerates multimedia (streaming) and irregular computations (sparse, pointer-based) – Average speedup of 3.3 using just 1 PIM � First prototype PIM chip is demonstrating encouraging results � Ongoing Work – Demonstration system incorporating PIMs – Future PIMs with WideWord floating-point capability and address translation
Recommend
More recommend