Comparing Memory Systems for Chip Multiprocessors Jacob Leverich - PowerPoint PPT Presentation

Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University ISCA 2007 1

Cores are the New GHz M M M M M M P P P P P P P P P P P P M M M M M M � 90s: ↑ GHz & ↑ ILP � Problems: power, complexity, ILP limits � 00s: ↑ cores � Multicore, manycore, … 2

What is the New Memory System? M M M M M M P P P P P P P P P P P P M M M M M M Cache- -based Memory based Memory Streaming Memory Cache Streaming Memory Tags Data Array Local Storage Cache Controller DMA Engine 3

The Role of Local Memory Cache- -based Memory based Memory Streaming Memory Cache Streaming Memory Tags Data Array Local Storage Cache Controller DMA Engine � Exploit spatial & temporal locality � Reduce average memory access time � Enable data re-use � Amortize latency over several accesses � Minimize off-chip bandwidth � Keep useful data local 4

Who Manages Local Memory? Cache-based Stream ing � Locality Data Fetch Reactive Proactive Placement Limited mapping Arbitrary Replacement Fixed-policy Arbitrary Granularity Cache block Arbitrary � Communication Coherence Hardware Software Cache-based: Hardware-managed Streaming: Software-managed 5

Potential Advantages of Streaming Memory � Better latency hiding � Overlap DMA transfers with computation � Double buffering is macroscopic prefetching � Lower off-chip bandwidth requirements � Avoid conflict misses � Avoid superfluous refills for output data � Avoid write-back of dead data � Avoid fetching whole lines for sparse accesses � Better energy and area efficiency � No tag & associativity overhead � Fewer off-chip accesses 6

How Much Advantage over Caching? � How do they differ in Performance? � How do they differ in Scaling? � How do they differ in Energy Efficiency? � How do they differ in Programmability? 7

Our Contribution: A Head to Head Comparison Cache - - based Mem ory based Mem ory Cache vs. vs. Stream ing Mem ory Stream ing Mem ory � Unified set of constraints � Same processor core � Same capacity of local storage per core � Same on-chip interconnect � Same off-chip memory channel � Justification � VLSI constraints (e.g., local storage capacity) � No fundamental differences (e.g., core type) 8

Our Conclusions � Caching performs & scales as well as Streaming � Well-known cache enhancements eliminate differences � Stream Programming benefits Caching Memory � Enhances locality patterns � Improves bandwidth and efficiency of caches � Stream Programming easier with Caches � Makes memory system amenable to irregular & unpredictable workloads � Streaming Memory likely to be replaced or at least augmented by Caching Memory 9

Simulation Parameters � 1 – 16 cores: Tensilica LX, 3-way VLIW, 2 FPUs � Clock frequency: 800 MHz – 3.2 GHz � On-chip data memory � Cache-based: 32kB cache, 32B block, 2-way, MESI � Streaming: 24kB scratch pad DMA engine 8kB cache, 32B block, 2-way � Both: 512kB L2 cache, 32B block, 16-way � System � Hierarchical on-chip interconnect � Simple main memory model (3.2 GB/ s – 12.8 GB/ s) 10

Benchmark Applications � No “SPEC Streaming” � � Few available apps with streaming & caching versions � Selected 10 “streaming” applications � Some used to motivate or evaluate Streaming Memory � Co-developed apps for both systems � Caching: C, threads � Streaming: C, threads, DMA library � Optimized both versions as best we could 11

Benchmark Applications � Video processing Irregular � Stereo Depth Extraction � H.264 Encoding � MPEG-2 Encoding � Image processing Unpredictable � JPEG Encode/ Decode � KD-tree Raytracer � 179.art � Scientific and data-intensive � 2D Finite Element Method � 1D Finite Impulse Response � Merge Sort � Bitonic Sort 12

Parallelism Independent of Memory System FEM @ 3.2 GHz MPEG-2 Encoder @ 3.2 GHz 1 1 0.9 0.9 0.8 0.8 Normalized Time Normalized Time 0.7 0.7 12.4x 13.8x 12.4x 13.8x 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 CPU 2 CPUs 4 CPUs 8 CPUs 16 1 CPU 2 CPUs 4 CPUs 8 CPUs 16 CPUs CPUs Cache Streaming Cache Streaming � 6/ 10 apps little affected by local memory choice 14

Local Memory Not Critical For Compute-Intensive Applications � Intuition 16 cores @ 3.2 GHz � Apps limited by compute Useful Data Sync � Good data reuse, even 1 with large datasets 0.9 � Low misses/ instruction 0.8 Normalized Time 0.7 0.6 0.5 0.4 0.3 0.2 0.1 � Note: 0 � “Sync” includes Barriers Cache Stream Cache Stream and DMA wait MPEG-2 FEM 15

Double-Buffering Hides Latency For Streaming Memory Systems � Intuition 16 cores @ 3.2 GHz, 12.8 GB/s � Non-local accesses entirely Useful Data Sync overlapped with computation 1 � DMAs perform efficient SW 0.9 prefetching 0.8 Normalized Time 0.7 0.6 0.5 0.4 0.3 � Note 0.2 � The case for memory- 0.1 intensive apps not bound by 0 memory BW Cache Stream � 179.art, Merge Sort FIR 16

Prefetching Hides Latency For Cache-Based Memory Systems � Intuition 16 cores @ 3.2 GHz, 12.8 GB/s � HW stream prefetcher Useful Data Sync overlaps misses with 1 computation as well 0.9 � Predictable & regular 0.8 access patterns Normalized Time 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Cache Prefetch Stream FIR 17

Streaming Memory Often Incurs Less Off-Chip Traffic Normalized Off-chip Traffic 1 0.9 0.8 0.7 0.6 Write 0.5 Read 0.4 0.3 0.2 0.1 0 Cac Str Cac Str Cac Str FIR Merge Sort MPEG-2 � The case for apps with large output streams � Avoids superfluous refills for output streams � Not the case for write-allocate, fetch-on-miss caches 18

SW-Guided Cache Policies Improve Bandwidth Efficiency 1 Normalized Off-chip Traffic 0.9 0.8 0.7 0.6 Write 0.5 Read 0.4 0.3 0.2 0.1 0 Cac PFS Str Cac PFS Str Cac PFS Str FIR Merge Sort MPEG-2 � Our system: “Prepare For Store” cache hint � Allocates cache line but avoid refill of old data � Xbox360: write-buffer for non allocating writes 19

Energy Efficiency Does not Depend on Local Memory � Intuition 16 cores @ 800 MHz � Energy dominated by 1 DRAM accesses and 0.9 processor core 0.8 Normalized Energy � Local store ~ 2x energy- 0.7 DRAM efficiency of cache, but L2-cache 0.6 small portion of total L-store 0.5 D-cache energy 0.4 I-cache Core 0.3 0.2 0.1 � Note 0 Stream Stream Stream Cache Cache Cache � The case for compute- intensive applications MPEG-2 FEM FIR 20

Optimized Bandwidth Yields Optimized Energy Efficiency 1 1 0.9 0.9 Normalized Off-chip Traffic 0.8 0.8 Normalized Energy DRAM 0.7 0.7 L2-cache 0.6 0.6 Write L-store 0.5 Read 0.5 D-cache 0.4 0.4 I-cache 0.3 0.3 Core 0.2 0.2 0.1 0 0.1 Cache PFS Stream 0 FIR Cache PFS Stream FIR � Superfluous off-chip accesses are expensive! � Streaming & SW-guided caching reduce them 21

Stream Programming for Caches: MPEG-2 Example Predicted P T Video Frame � MPEG-2 example � P() generates a video frame later consumed by T() � Whole frame is too large to fit in local memory � No temporal locality � Opportunity � Computation on frame blocks are independent 23

Stream Programming for Caches: MPEG-2 Example Predicted P T Video Frame Predicted block � Introducing temporal locality � Loop fusion for P() and T() at block level � Intermediate data are dead once T() done 24

Stream Programming for Caches: MPEG-2 Example P T Predicted block � Exploiting producer-consumer locality � Re-use the predicted block buffer � Dynamic working set reduced � Fits in local memory; no off-chip traffic 25

Stream Programming for Caches: MPEG-2 Example � Stream programming 2 cores beneficial for any 2 Memory System 1.8 Normalized Off-chip Traffic 1.6 � Exposes locality that 1.4 improves bandwidth and 1.2 energy efficiency of local Write 1 Read memory 0.8 0.6 0.4 � Stream programming 0.2 toolchains helpful 0 d d m e e a z z e i i m m r t i i S t t p p o O n U MPEG-2 26

Comparing Memory Systems for Chip Multiprocessors Jacob Leverich - PowerPoint PPT Presentation

Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University ISCA 2007 1 Cores are the New GHz M

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

Cap5 - Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

Lecture 24: Virtual Memory, Multiprocessors Todays topics: Virtual memory

Lecture 23: Virtual Memory, Multiprocessors Todays topics: Virtual memory

Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentn Puente

6 Transactional Memory Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

7 On-Chip Interconnection Networks Chip Multiprocessors (ACS MPhil) Robert Mullins

Why Multiprocessors? Limits on the performance of a single processor: what are they? Spring 2009

Architectural Support for Parallel Reduction in Scalable Shared Memory Multiprocessors in

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER Alexandra Fedorova Margo Seltzer Michael

Enhancing Software-Defined RAN with Ruozhou Yu, Shuang Qin, Mehdi Bennis, Xianfu Chen,

Virtual Memory & Caching (Chapter 12-17) CS 4410 Operating Systems Last Time: Address

Leases and Cache Coherence Leases Lease - a time-limited right to do something - can be renewed

Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000

Had You Looked Where I'm Looking? Cross-user Similarities in Viewing Behavior for 360 - degree

Coded Caching for Content Distribution Urs Niesen MobiHoc 2018 Importance of Content

When Should the Network Be the Computer? Dan Ports Jacob Nelson Microsoft Research

Instruction caching for bhyve Mihai Carabas, Neel Natu { mihai,neel } @freebsd.org AsiaBSDCon

Comparing Memory Systems for Chip Multiprocessors Jacob Leverich - PowerPoint PPT Presentation

Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University ISCA 2007 1 Cores are the New GHz M

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

Cap5 - Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory

Lecture 24: Virtual Memory, Multiprocessors Todays topics: Virtual memory

Lecture 23: Virtual Memory, Multiprocessors Todays topics: Virtual memory

Reducing the Interconnection Network Cost of Chip Multiprocessors Pablo Abad , Valentn Puente

6 Transactional Memory Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

7 On-Chip Interconnection Networks Chip Multiprocessors (ACS MPhil) Robert Mullins

Why Multiprocessors? Limits on the performance of a single processor: what are they? Spring 2009

Architectural Support for Parallel Reduction in Scalable Shared Memory Multiprocessors in

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER Alexandra Fedorova Margo Seltzer Michael

Enhancing Software-Defined RAN with Ruozhou Yu, Shuang Qin, Mehdi Bennis, Xianfu Chen,

Virtual Memory &amp; Caching (Chapter 12-17) CS 4410 Operating Systems Last Time: Address

Leases and Cache Coherence Leases Lease - a time-limited right to do something - can be renewed

Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000

Had You Looked Where I'm Looking? Cross-user Similarities in Viewing Behavior for 360 - degree

Coded Caching for Content Distribution Urs Niesen MobiHoc 2018 Importance of Content

When Should the Network Be the Computer? Dan Ports Jacob Nelson Microsoft Research

Instruction caching for bhyve Mihai Carabas, Neel Natu { mihai,neel } @freebsd.org AsiaBSDCon

Virtual Memory & Caching (Chapter 12-17) CS 4410 Operating Systems Last Time: Address