usc usc
play

USC USC INFORMATION INFORMATION SCIENCES SCIENCES INSTITUTE - PowerPoint PPT Presentation

USC USC INFORMATION INFORMATION SCIENCES SCIENCES INSTITUTE INSTITUTE The Architecture of the DIVA Processing-In-Memory Chip Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett Jeff LaCoss, John Granacki, Jaewook Shin,


  1. USC USC INFORMATION INFORMATION SCIENCES SCIENCES INSTITUTE INSTITUTE The Architecture of the DIVA Processing-In-Memory Chip Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett Jeff LaCoss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, Ihn Kim, Gokhan Daglikoca USC Information Sciences Institute ICS’02 June 24, 2002

  2. Outline � Overview of Project Goals and System Architecture � PIM Chip Architecture � Applications and Simulation Results � Prototype Chip Implementation � Conclusion

  3. Increasing Bandwidth P P Processor Memory Processor Memory M M P P M M Host Host The Processor Processor Problem P P M M Processor-memory pairs P P with wide datapaths Solutions Multiple nodes per chip M M Memory-to-memory interconnect

  4. System Architecture Host Host Processor Processor PIM PIM- -to to- -PIM PIM (PowerPC) (PowerPC) Interconnect Interconnect Processor Memory Bus Processor Memory Bus PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM � PIMs as “smart-memory” co-processors

  5. DIVA Key Ideas � First smart-memory PIM device that is – Capable of executing independent threads of control – Designed to support in-memory virtual addressing � Target Applications – Image processing and multimedia (streaming) – Irregular memory accesses (sparse-matrix and pointer- based) � Evolutionary application development – PIMs also support standard memory accesses – System supports familiar parallel programming paradigms

  6. PIM Chip Organization Processing Processing Logic Logic Memory Memory To Neighboring PIM PBUF Memory Port PBUF Memory Port Node Node PIM PIM Parcel PIM Memory Bus Routing Routing Interconnect Component Component PBUF PBUF Memory Port Memory Port Host Interface Host Interface To Neighboring PIM To Host System Memory Bus

  7. Host Memory Interface � Even with an extra arbitration cycle, DIVA PIMs satisfy SDRAM timing by: – Using high-bandwidth embedded memory macros – Running arbitration logic at 4X clock speed – Exploiting long latency allowed by SDRAM standard CLK RAS CAS ROW COL ADDR DATA D0 D1 D2 D3

  8. Node Architecture HOST MEMORY PORT PARCEL BUFFER (“PBUF”) CTL DATA DATA HEADER Node Data Bus Scalar Scalar Scalar 32b Datapath Datapath Registers MEMORY MEMORY 256b 5- -Stage DLX Stage DLX- -like like 5 Instructions Instruction Instruction ICache ICache Pipeline Pipeline MEMORY MEMORY CONTROL CONTROL WideWord WideWord WideWord & 256b & Host Datapath Registers Datapath ARBITER ARBITER Memory Requests ICache Mem Requests Node Memory Requests

  9. WideWord Unit � 256-bit datapath using a 32x256 register file � WideWord operand treated as a packed array of 8-, 16-, or 32-bit objects – Object size specified by instruction � Features – Transfers to/from scalar register file – Data rearrangement instructions – Selective execution

  10. WideWord Permutation Capability � Permutation instructions rearrange subfields – Rearrangement pattern specified by a permutation vector � Two “flavors” of permutation instructions: – General-purpose – Construction of permutation vector in a WideWord register – Hardwired – Permutation vector found in a lookup table of common patterns, e.g., shifts, rotates, shuffles, reductions, etc. – Scalar register value serves as index into the lookup table

  11. WideWord Selective Execution � Only certain subfields of a result are committed during writeback � Subfields participating are determined by a combination of: – Condition codes – EQ, GT, LT, OV – User-settable mask register – Useful for a priori subfield specification – Bits in instruction – Specify whether and what type of selective execution is to be used for that instruction

  12. Example Using Permutations and Selective Execution wr1 a0 a1 a2 a3 a4 a5 a6 a7 r1 = upper_enable r2 = lower_enable wr2 b0 b1 b2 b3 b4 b5 b6 b7 r3 = swap_upper_lower wr3 c0 c1 c2 c3 c4 c5 c6 c7 wr1 a0 a1 a2 a3 a4 a5 a6 a7 mtspr mask ,r2 wprmi_se wr3,wr1,r3 // permute using wr3 // selective execution c0 c1 c2 c3 a0 a1 a2 a3 wr2 b0 b1 b2 b3 b4 b5 b6 b7 mtspr mask ,r1 wprmi_se wr3,wr2,r3 wr3 b4 b5 b6 b7 a0 a1 a2 a3

  13. Applications Program Description Source Data Set Size WideWord Usage parallelism, 4-Kbyte image, 32 Template Matching image correlation Sandia selective, reuse in (TM) 1-Kbyte templates registers, page mode Atlantic matrix transpose 32-Mbyte matrix parallelism, permutation Cornerturn (CT) Aerospace 2M double- parallelism, sparse conjugate CG NAS precision floating-point, gradient elements page mode Transitive Closure Floyd’s all-paths Atlantic parallelism, selective, 256 Kbytes (TC) shortest paths Aerospace reuse in registers Neighborhood relational database Atlantic 500,000 bytes join Aerospace (NH) image processing Natural Join (NJ) Alphatech 72 Kbytes stencil Atlantic random walk 4 Mbytes Pointer (P) Aerospace object-oriented University of 888 Kbytes OO7 database query Wisconsin

  14. Simulation Environment System simulator based on RSIM � System simulator based on RSIM � – Detailed memory subsystem simulation – PIM processor with WideWord, full ISA – Communication: 1-D ring based on PiRCs Conservative assumption: � Conservative assumption � – PIM speed 1/2 of host speed

  15. 1-PIM Speedups Speedups over host-only execution 15 10 5 0 TM CT CG TC NJ NH P OO7

  16. Host Execution Time Busy and memory stall times for host- only execution 120 100 80 60 40 20 0 NJ NH CT CG TC P OO7 TM busy memory

  17. Memory Hierarchy Times Host-only and 1-PIM memory stall times 120 100 80 60 40 20 0 CT-H CT-P CG-H CG-P TC-H TC-P NJ-H NJ-P NH-H NH-P 007-H 007-P TM-H TM-P P-H P-P L1 L2 memory

  18. WideWord Performance Gains Speedup of 1 PIM with WideWord over 1 PIM Scalar 20 15 10 5 0 TM CT CG TC

  19. 1 st DIVA Prototype PIM Chip � Fabrication technology – TSMC 0.18 µ m � Size – 9.8mm x 9.8mm, 55 million SRAM transistors (2 million logic) SRAM � Package – 35mm, 352 BGA – 241 signal I/O, 111 Vdd or Gnd � Lab results Node Processing Logic, Pbuf SDRAM Interface, PiRC – Running Cornerturn application at 160MHz while dissipating 0.8W – Using WideWord permutations and selective execution

  20. Conclusions and Future Work � DIVA accelerates multimedia (streaming) and irregular computations (sparse, pointer-based) – Average speedup of 3.3 using just 1 PIM � First prototype PIM chip is demonstrating encouraging results � Ongoing Work – Demonstration system incorporating PIMs – Future PIMs with WideWord floating-point capability and address translation

Recommend


More recommend