USC USC INFORMATION INFORMATION SCIENCES SCIENCES INSTITUTE - PowerPoint PPT Presentation

USC USC INFORMATION INFORMATION SCIENCES SCIENCES INSTITUTE INSTITUTE The Architecture of the DIVA Processing-In-Memory Chip Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett Jeff LaCoss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, Ihn Kim, Gokhan Daglikoca USC Information Sciences Institute ICS’02 June 24, 2002

Outline � Overview of Project Goals and System Architecture � PIM Chip Architecture � Applications and Simulation Results � Prototype Chip Implementation � Conclusion

Increasing Bandwidth P P Processor Memory Processor Memory M M P P M M Host Host The Processor Processor Problem P P M M Processor-memory pairs P P with wide datapaths Solutions Multiple nodes per chip M M Memory-to-memory interconnect

System Architecture Host Host Processor Processor PIM PIM- -to to- -PIM PIM (PowerPC) (PowerPC) Interconnect Interconnect Processor Memory Bus Processor Memory Bus PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM � PIMs as “smart-memory” co-processors

DIVA Key Ideas � First smart-memory PIM device that is – Capable of executing independent threads of control – Designed to support in-memory virtual addressing � Target Applications – Image processing and multimedia (streaming) – Irregular memory accesses (sparse-matrix and pointer- based) � Evolutionary application development – PIMs also support standard memory accesses – System supports familiar parallel programming paradigms

PIM Chip Organization Processing Processing Logic Logic Memory Memory To Neighboring PIM PBUF Memory Port PBUF Memory Port Node Node PIM PIM Parcel PIM Memory Bus Routing Routing Interconnect Component Component PBUF PBUF Memory Port Memory Port Host Interface Host Interface To Neighboring PIM To Host System Memory Bus

Host Memory Interface � Even with an extra arbitration cycle, DIVA PIMs satisfy SDRAM timing by: – Using high-bandwidth embedded memory macros – Running arbitration logic at 4X clock speed – Exploiting long latency allowed by SDRAM standard CLK RAS CAS ROW COL ADDR DATA D0 D1 D2 D3

Node Architecture HOST MEMORY PORT PARCEL BUFFER (“PBUF”) CTL DATA DATA HEADER Node Data Bus Scalar Scalar Scalar 32b Datapath Datapath Registers MEMORY MEMORY 256b 5- -Stage DLX Stage DLX- -like like 5 Instructions Instruction Instruction ICache ICache Pipeline Pipeline MEMORY MEMORY CONTROL CONTROL WideWord WideWord WideWord & 256b & Host Datapath Registers Datapath ARBITER ARBITER Memory Requests ICache Mem Requests Node Memory Requests

WideWord Unit � 256-bit datapath using a 32x256 register file � WideWord operand treated as a packed array of 8-, 16-, or 32-bit objects – Object size specified by instruction � Features – Transfers to/from scalar register file – Data rearrangement instructions – Selective execution

WideWord Permutation Capability � Permutation instructions rearrange subfields – Rearrangement pattern specified by a permutation vector � Two “flavors” of permutation instructions: – General-purpose – Construction of permutation vector in a WideWord register – Hardwired – Permutation vector found in a lookup table of common patterns, e.g., shifts, rotates, shuffles, reductions, etc. – Scalar register value serves as index into the lookup table

WideWord Selective Execution � Only certain subfields of a result are committed during writeback � Subfields participating are determined by a combination of: – Condition codes – EQ, GT, LT, OV – User-settable mask register – Useful for a priori subfield specification – Bits in instruction – Specify whether and what type of selective execution is to be used for that instruction

Example Using Permutations and Selective Execution wr1 a0 a1 a2 a3 a4 a5 a6 a7 r1 = upper_enable r2 = lower_enable wr2 b0 b1 b2 b3 b4 b5 b6 b7 r3 = swap_upper_lower wr3 c0 c1 c2 c3 c4 c5 c6 c7 wr1 a0 a1 a2 a3 a4 a5 a6 a7 mtspr mask ,r2 wprmi_se wr3,wr1,r3 // permute using wr3 // selective execution c0 c1 c2 c3 a0 a1 a2 a3 wr2 b0 b1 b2 b3 b4 b5 b6 b7 mtspr mask ,r1 wprmi_se wr3,wr2,r3 wr3 b4 b5 b6 b7 a0 a1 a2 a3

Applications Program Description Source Data Set Size WideWord Usage parallelism, 4-Kbyte image, 32 Template Matching image correlation Sandia selective, reuse in (TM) 1-Kbyte templates registers, page mode Atlantic matrix transpose 32-Mbyte matrix parallelism, permutation Cornerturn (CT) Aerospace 2M double- parallelism, sparse conjugate CG NAS precision floating-point, gradient elements page mode Transitive Closure Floyd’s all-paths Atlantic parallelism, selective, 256 Kbytes (TC) shortest paths Aerospace reuse in registers Neighborhood relational database Atlantic 500,000 bytes join Aerospace (NH) image processing Natural Join (NJ) Alphatech 72 Kbytes stencil Atlantic random walk 4 Mbytes Pointer (P) Aerospace object-oriented University of 888 Kbytes OO7 database query Wisconsin

Simulation Environment System simulator based on RSIM � System simulator based on RSIM � – Detailed memory subsystem simulation – PIM processor with WideWord, full ISA – Communication: 1-D ring based on PiRCs Conservative assumption: � Conservative assumption � – PIM speed 1/2 of host speed

1-PIM Speedups Speedups over host-only execution 15 10 5 0 TM CT CG TC NJ NH P OO7

Host Execution Time Busy and memory stall times for host- only execution 120 100 80 60 40 20 0 NJ NH CT CG TC P OO7 TM busy memory

Memory Hierarchy Times Host-only and 1-PIM memory stall times 120 100 80 60 40 20 0 CT-H CT-P CG-H CG-P TC-H TC-P NJ-H NJ-P NH-H NH-P 007-H 007-P TM-H TM-P P-H P-P L1 L2 memory

WideWord Performance Gains Speedup of 1 PIM with WideWord over 1 PIM Scalar 20 15 10 5 0 TM CT CG TC

1 st DIVA Prototype PIM Chip � Fabrication technology – TSMC 0.18 µ m � Size – 9.8mm x 9.8mm, 55 million SRAM transistors (2 million logic) SRAM � Package – 35mm, 352 BGA – 241 signal I/O, 111 Vdd or Gnd � Lab results Node Processing Logic, Pbuf SDRAM Interface, PiRC – Running Cornerturn application at 160MHz while dissipating 0.8W – Using WideWord permutations and selective execution

Conclusions and Future Work � DIVA accelerates multimedia (streaming) and irregular computations (sparse, pointer-based) – Average speedup of 3.3 using just 1 PIM � First prototype PIM chip is demonstrating encouraging results � Ongoing Work – Demonstration system incorporating PIMs – Future PIMs with WideWord floating-point capability and address translation

USC USC INFORMATION INFORMATION SCIENCES SCIENCES INSTITUTE - PowerPoint PPT Presentation

USC USC INFORMATION INFORMATION SCIENCES SCIENCES INSTITUTE INSTITUTE The Architecture of the DIVA Processing-In-Memory Chip Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett Jeff LaCoss, John Granacki, Jaewook Shin,

Administrative Issues Login into learn.usc.edu and make sure Login into learn usc edu and make

Open PaymentsA New Era of Transparency USC Office of Compliance T Todays Agenda USC

House Ways & Means Higher Education and Technical College Subcommittee January 23, 2019 3

Sept. 19, 2019, University of Illinois at Urbana-Champaign University Senates Conference (USC):

The Brain as a Hierarchical The Brain as a Hierarchical Organization Organization I sabelle

Productivity Growth in Health Care John A. Romley, PhD Associate Professor USC Schaeffer Center

Jaewook Shin , Jacqueline Chame and Mary Hall PACT02 September 23, 2002 USC USC UNIVERSITY

Efficient Online Portfolio with Logarithmic Regret Haipeng Luo (USC) Chen-Yu Wei (USC) Kai Zheng

Decoding in Compressed Sensing Ronald DeVore USC, 2008 p. 1/33 Discrete Compressed Sensing R

CSCI 599: Digital Geometry Processing Spring 2015 Hao Li http://cs599.hao-li.com 1 USC

Some Future Software Engineering Opportunities and Challenges Barry Boehm, USC-CSSE

TRILL Core IDs Joe Touch Postel Center Director USC/ISI Research Associate Professor USC CS

UML Overview Based in parts on UML Distilled from Martin Fowler Alexander Egyed CSCI 612

Extending The Healthspan Of Those With Diabetes And Prediabetes Anne Peters, MD Professor, USC

The Unintended Consequences of the Village Midwife Program in Indonesia Md Nazmul Ahsan (USC)

Risk Fram ew ork A P resentation to the USC B oard of D irectors December 1, 2017 BACKGROUND

Improving Node-level MapReduce Performance using Processing-in-Memory Technologies Mahzabeen

Governance ITechLaw - 17 May 2018 Overview Intro Benoit Van Asbroeck Is there a

Piloting and Sizing Sequential Multiple Assignment Randomized Trials in Dynamic Treatment Regime

Applying the Patient Demographic Data Quality (PDDQ) Framework to Reduce Duplicate Patient

2020 Emergency Solutions Grants CARES Act Application Instructional Guide Community Programs

Almost monotonicity formulas for elliptic and parabolic operators with variable coefficients

30 Transformational Design with Essential Aspect Decomposition: Model-Driven Architecture (MDA)

Reliability Support for the Model Driven Architecture Genana Rodrigues, Graham Roberts,

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

USC USC INFORMATION INFORMATION SCIENCES SCIENCES INSTITUTE - PowerPoint PPT Presentation

USC USC INFORMATION INFORMATION SCIENCES SCIENCES INSTITUTE INSTITUTE The Architecture of the DIVA Processing-In-Memory Chip Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett Jeff LaCoss, John Granacki, Jaewook Shin,

Administrative Issues Login into learn.usc.edu and make sure Login into learn usc edu and make

Open PaymentsA New Era of Transparency USC Office of Compliance T Todays Agenda USC

House Ways &amp; Means Higher Education and Technical College Subcommittee January 23, 2019 3

Sept. 19, 2019, University of Illinois at Urbana-Champaign University Senates Conference (USC):

The Brain as a Hierarchical The Brain as a Hierarchical Organization Organization I sabelle

Productivity Growth in Health Care John A. Romley, PhD Associate Professor USC Schaeffer Center

Jaewook Shin , Jacqueline Chame and Mary Hall PACT02 September 23, 2002 USC USC UNIVERSITY

Efficient Online Portfolio with Logarithmic Regret Haipeng Luo (USC) Chen-Yu Wei (USC) Kai Zheng

Decoding in Compressed Sensing Ronald DeVore USC, 2008 p. 1/33 Discrete Compressed Sensing R

CSCI 599: Digital Geometry Processing Spring 2015 Hao Li http://cs599.hao-li.com 1 USC

Some Future Software Engineering Opportunities and Challenges Barry Boehm, USC-CSSE

TRILL Core IDs Joe Touch Postel Center Director USC/ISI Research Associate Professor USC CS

UML Overview Based in parts on UML Distilled from Martin Fowler Alexander Egyed CSCI 612

Extending The Healthspan Of Those With Diabetes And Prediabetes Anne Peters, MD Professor, USC

The Unintended Consequences of the Village Midwife Program in Indonesia Md Nazmul Ahsan (USC)

Risk Fram ew ork A P resentation to the USC B oard of D irectors December 1, 2017 BACKGROUND

Improving Node-level MapReduce Performance using Processing-in-Memory Technologies Mahzabeen

Governance ITechLaw - 17 May 2018 Overview Intro Benoit Van Asbroeck Is there a

Piloting and Sizing Sequential Multiple Assignment Randomized Trials in Dynamic Treatment Regime

Applying the Patient Demographic Data Quality (PDDQ) Framework to Reduce Duplicate Patient

2020 Emergency Solutions Grants CARES Act Application Instructional Guide Community Programs

Almost monotonicity formulas for elliptic and parabolic operators with variable coefficients

30 Transformational Design with Essential Aspect Decomposition: Model-Driven Architecture (MDA)

Reliability Support for the Model Driven Architecture Genana Rodrigues, Graham Roberts,

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

House Ways & Means Higher Education and Technical College Subcommittee January 23, 2019 3