Direct Addressed Caches for Reduced Power Consumption Emmett - PowerPoint PPT Presentation

Direct Addressed Caches for Reduced Power Consumption Emmett Witchel Sam Larsen C. Scott Ananian Krste Asanovi MIT Lab for Computer Science

The Domain n We are attempting to reduce power consumed by the caches and memory system. o Not discs or screens. o 16% of processor + cache energy for StrongARM is dissipated in the data cache. n We focus on the data cache. The instruction cache is amenable to hardware-only techniques. n We are interested in power optimizations that are not just existing speed optimizations. n Exploit compile time knowledge to avoid runtime work. o Partially evaluate a program for certain hardware resources. n We show how software can eliminate cache tag checks which saves energy.

The First Problem — Cache Tags Direct Mapped Set-Associative CAM-tag Each memory Each memory Each memory location has a location has a location can be unique home. small number anywhere in a (e.g., 4) homes. sub bank. High miss rates Moderate miss Lowest miss which means rates. rates. high energy usage. Individual Individual Individual accesses are low accesses are high accesses are power. power because of moderate power. multiple tag and Most of the data reads. energy is in the tag check. n Both set-associative and CAM-tag caches spend the majority of their energy in the tag check.

The Solution — Pass Software Information To Hardware n The compiler often knows when the program is accessing the same piece of memory. Don’t check the cache tags for the second access. n HW challenge — make this path low power. n SW challenge — find the opportunities for use. o Two compiler algorithms for two languages (C and Java). n Interface challenge — minimize ISA changes, don’t disrupt HW, don’t expose too much HW detail. o New flavors of memory ops are a common ISA change. n Security challenge — Protect process data from other processes. o Snoop on evicts, detect invalid state early in pipeline

Direct Addressed CAM Tag Cache Virtually Indexed & Tagged Instruction lwlda da2 r1 r2 offset Fetch Register 16 (Sign extended) File 32 Offset Calculation 5 offset 3 bank 18 tag CAM 1 sub-bank Tag Stat Data 32 DA registers Hit?

Direct Addressing Software directly indexes Instruction lwda da2 r1 r2 offset Fetch into data RAM: No tag checks Register File 5 (Sext) 32 Offset Calculation 5 offset 3 bank 18 tag CAM 1 sub-bank Tag Stat Data 32 DA registers Hit?

Spill Code Using Direct Address Registers n Transformed code n Old code o subu $sp, 64 o subu $sp, 64 o swlda $ra,60($sp),$da0 o sw $ra, 60($sp) o swda $fp,56($sp),$da0 o sw $fp, 56($sp) o swda $s0,52($sp),$da0 o sw $s0, 52($sp) n One tag check per line used for spilling. n It is a simple transformation. o Similar to load/store multiple on StrongARM l Ld/st multiple is a limited model – can’t handle read-modify-write. o Hardware only schemes capture many references, but add latency.

Compiler Algorithm (C) § Find dominance Code from gsm in mediabench relationship. int P[8]; § E.g., Read of P[1] in A dominates read of P[0] in temp = P[1]; A D. § Determine distance. if (temp < 0) § P[0] is offset –4 from P[1]. B § If dist == 0, done. § Determine alignment. temp = -temp; C § Stack & static data are aligned by our backend. if (P[0] < temp) { § Loop unrolling to D increase alignment. § Eliminate tag check in the read of P[0].

C Compiler Infrastructure § We use SUIF, with a C backend. § Loop unrolling to increase aligned references. § Distance information from memory object offset. § Use simple, local information for aliases. § Profile information to set pre-loop break condition. for(i=0; i<N; i++) { for(i=0; i<N; i++) { A[i] = 0; if(&A[I] % line_size == 0) break; } A[I] = 0; } for(; i<N; i += 4) { A[i + 0] = 0; A[i + 1] = 0; A[i + 2] = 0; A[i + 3] = 0; }

Results — C Implementation Mediabench n Data cache energy reduction 8.7 - 40%. n Function entry/exit code not included — expect greater savings.

Java Compiler Infrastructure § FLEX is a bytecode to native compiler developed at MIT. § We wrote a MIPS back end § Modified GNU as to accept new memory operations. § Modified ISA simulator to track DAR state. § Loops are unrolled. § Object type is tracked for additional opportunity. § Allows low level optimization of access to e.g., hash code.

Results — Java Implementation Tag Checks Eliminated n One big advantage — function entry/exit Load Store code was 70% transformed. o Calling convention 60% modified. 50% n Data cache power savings 26-31% 40% n No profile feedback. 30% 20% 10% 0% Jess Jack Zip DB SPEC JVM ‘98

Results — Comparison with L0 Cache Tag Checks Eliminated n DARs usually tie L0 or exceed it. 8 DAR L0 8 DAR + L0 n When L0 exceeds 90% DARs, DARs help L0. 80% 70% 60% 50% 40% 30% 20% 10% 0% g721_de untoast toast unepic Mediabench

Related Work n Fisher & Ellis used loop unrolling to reduce memory bank conflicts. o Barua expanded the work with Modulo Unrolling. n Burd and Kin have proposed hardware L0 caches. n Andras’ FlexCache does software way- prediction to software controlled array of tag registers.

Acknowledgements n Mark Hampton — GNU assembler, simulator. n Ronny Krashinsky — Energy modeling. n Sam Larsen — SUIF compiler. n C. Scott Ananian — Java compiler (FLEX) n DARPA, NSF, Infineon

Direct Addressed Caches for Reduced Power Consumption Emmett - PowerPoint PPT Presentation

Direct Addressed Caches for Reduced Power Consumption Emmett Witchel Sam Larsen C. Scott Ananian Krste Asanovi MIT Lab for Computer Science The Domain n We are attempting to reduce power consumed by the caches and memory system. o Not discs

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Some Immediately Noticeable Benefits of using Polytron Reduced temperature n Reduced vibrations n

Review: Why We Use Caches Caches Review Mechanism for transparent movement of Proc 1000

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Caches & Memcache Example Client N. America Client System Asia + Caches Client Africa

(power x 0) == 1 (power x (+ n 1)) == (* (power x n) x) (power x 0) == 1 (power x (+ (* 2 m)

Great Lakes Chloride, Inc. Direct Liquid Application (DLA) Direct Liquid Application (DLA)

State of Collaboration Direct Deposit and Payroll Reissuance 1 1 Topics Direct Deposit

Direct loan Direct loan Information Information Feder deral Direct Student Loans l Direct

Chapt hapter er 5 5 Large and Fast: Exploiting Memory Hierarchy 5.1 Introduction

Caching (part 2) byte block size result address (hex) exercise 3 tag index ofgset exercise:

Associative caches (3 rd Ed: p.496-504, 4 th Ed: 479-487) flexible block placement schemes

Python Session # 02 By: Saeed Haratian Spring 2016 Outlines My First Program Comments

WCET Driven Design Space Exploration of an Object Cache Benedikt Huber, Wolfgang Puffitsch,

Memory Address Map CS RD RAM 0 0 RD WR 1K x 8 WR DB(0..7) 1 AB Decoder AB(10..11) 2

Computer Architecture Memory System Virendra Singh Associate Professor Computer Architecture

Chapter Seven 1 2004 Morgan Kaufmann Publishers Memories: Review SRAM: value is

Direct Addressed Caches for Reduced Power Consumption Emmett - PowerPoint PPT Presentation

Direct Addressed Caches for Reduced Power Consumption Emmett Witchel Sam Larsen C. Scott Ananian Krste Asanovi MIT Lab for Computer Science The Domain n We are attempting to reduce power consumed by the caches and memory system. o Not discs

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

Some Immediately Noticeable Benefits of using Polytron Reduced temperature n Reduced vibrations n

Review: Why We Use Caches Caches Review Mechanism for transparent movement of Proc 1000

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Caches &amp; Memcache Example Client N. America Client System Asia + Caches Client Africa

(power x 0) == 1 (power x (+ n 1)) == (* (power x n) x) (power x 0) == 1 (power x (+ (* 2 m)

Great Lakes Chloride, Inc. Direct Liquid Application (DLA) Direct Liquid Application (DLA)

State of Collaboration Direct Deposit and Payroll Reissuance 1 1 Topics Direct Deposit

Direct loan Direct loan Information Information Feder deral Direct Student Loans l Direct

Chapt hapter er 5 5 Large and Fast: Exploiting Memory Hierarchy 5.1 Introduction

Caching (part 2) byte block size result address (hex) exercise 3 tag index ofgset exercise:

Associative caches (3 rd Ed: p.496-504, 4 th Ed: 479-487) flexible block placement schemes

Python Session # 02 By: Saeed Haratian Spring 2016 Outlines My First Program Comments

WCET Driven Design Space Exploration of an Object Cache Benedikt Huber, Wolfgang Puffitsch,

Memory Address Map CS RD RAM 0 0 RD WR 1K x 8 WR DB(0..7) 1 AB Decoder AB(10..11) 2

Computer Architecture Memory System Virendra Singh Associate Professor Computer Architecture

Chapter Seven 1 2004 Morgan Kaufmann Publishers Memories: Review SRAM: value is

Caches & Memcache Example Client N. America Client System Asia + Caches Client Africa