Scavenger: Automating the Construction of Application-Optimized - PowerPoint PPT Presentation

Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies Hsin-Jung Yang † , Kermin E. Fleming ‡ , Michael Adler ‡ , Felix Winterstein § , and Joel Emer †* † Massachusetts Institute of Technology, ‡ Intel Corporation § European Space Agency, *NVIDIA Research September 3rd, FPL 2015

Abstraction • Abstraction hides implementation details and provides good programmability programmer Processor FPGA Software C/Python Application User Program Operating System LUTs SRAM SRAM Instruction Set Architecture Hardware PCIE DRAM Memory CPU I/O • • Hardware is optimized for a set of Implementation details are applications and fixed at design time handled by programmers • Hardware can be optimized for the target application 2

Abstraction • Abstraction hides implementation details and provides good programmability programmer Processor FPGA Software C/Python Application User Program Operating System Abstraction Instruction Set Architecture Hardware Memory Communication Memory CPU I/O • • Hardware is optimized for a set of Platform hardware can be applications and fixed at design time optimized for the target application 3

Application-Optimized Memory Subsystems • Goal: build the “best” memory subsystem for a given application – What is the “best”? • The memory subsystem which minimizes the execution time – How? • A clean memory abstraction • A rich set of memory building blocks • Intelligent algorithms to analyze programs and automatically compose memory hierarchies 4

Observation • Many FPGA programs do not consume all the available block RAMs (BRAMs) – Design difficulty – Same program ported from smaller FPGAs to larger ones Goal: Utilizing spare BRAMs to improve program performance 5

LEAP Memory Abstraction LEAP Memory Block • Simple memory interface User Engine • Arbitrary data size Interface • Private address space • “Unlimited” storage LEAP • Automatic caching Memory interface MEM_IFC#(type t_ADDR, type t_DATA) method void readReq (t_ADDR addr); method void write (t_ADDR addr, t_DATA din); method t_DATA readResp (); endinterface 6

LEAP Scratchpad Scratchpads Processor Client Client Client Application Interface on-chip SRAM L1 Cache on-board DRAM L2 Cache Memory M. Adler et al. , “LEAP Scratchpads,” in FPGA, 2011. 7

LEAP Memory is Customizable • Highly parametric – Cache capacity – Cache associativity – Cache word size – Number of cache ports • Enable specific features/optimizations only when necessary – Private/coherent caches for private/shared memory – Prefetching – Cache hierarchy topology 8

Utilizing Spare Block RAMs • Many FPGA programs do not consume all the BRAMs • Goal: utilize all spare BRAMs in LEAP memory hierarchy • Problem: need to build very large caches 9

Cache Scalability Issue • Simply scaling up BRAM-based structures may have a negative impact on operating frequency – BRAMs are distributed across chip, increasing wire delay 10

Cache Scalability Issue • Solution: trade latency for frequency – Multi-banked BRAM structure – Pipelining relieves timing pressure 11

Cache Scalability Issue • Solution: trade latency for frequency 12

Banked Cache Overhead • Simple kernel (hit rate=100%) Latency-oriented applications Throughput-oriented applications 13

Banked Cache Overhead • Simple kernel (hit rate=69%) 14

Results: Scaling Private Caches • Case study: Merger (an HLS kernel) Merger has 4 partitions: each connects to a LEAP scratchpad and forms a sorted linked list from a stream of random values. 15

Private or Shared Cache? • We can now build large caches • Where should we allocate spare BRAMs? – Option1: Large private caches – Option2: A large shared cache at the next level • Many applications have multiple memory clients – Different working set sizes and runtime memory footprints 16

Adding a Shared Cache Scratchpad Controller Consume all extra BRAMs Shared On-Chip Cache Central Cache (DRAM) FPGA Host Host Memory 17

Automated Optimization User frequency, User Kernel Generation memory demands (Bluespec, Verilog, HLS kernel) (ex: cache capacity) Pre-build database LEAP Platform Construction BRAM Usage Estimation Shared Cache Construction FPGA Tool Chain 18

Results: Shared Cache • Case study: Filter (an HLS kernel) – Filtering algorithm for K-means clustering – 8 partitions: each uses 3 LEAP Scratchpads 16384 set, 2 way 8192 set, 4 way 8192 set, 2 way 4096 set, 1 way 19

Conclusion • It is possible to exploit unused resources to construct memory systems that accelerate the user program. • We propose microarchitecture changes for large on-chip caches to run at high frequency. • We make some steps toward automating the construction of memory hierarchies based on program resource utilization and frequency requirements. • Future work: – Program analysis – Energy study 20

Thank You 21

Scavenger: Automating the Construction of Application-Optimized - PowerPoint PPT Presentation

Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies Hsin-Jung Yang , Kermin E. Fleming , Michael Adler , Felix Winterstein , and Joel Emer * Massachusetts Institute of Technology, Intel

The library scavenger hunt reimagined: Incorporating the ACRLs Framework for Information

Redesigning the library scavenger hunt Holly Luetkenhaus, MLIS, MA First Year Experience

William Hamilton Archeologist or Scavenger? Monuments of Rome in English Culture Timeline of Sir

Happy Hour FINDING HOPE THROUGH HAPPINESS Welcome and Gratitude Activity Gratitude Scavenger

Freedom Intermediate School Rising 5th Grade Parent Presentation 2 Bulldog Bootcamp July

Presented By Regina Ferreira And Phillip Henry WARM UP: Please go to kahoot.it on any

O VERVIEW : A NNUAL AND S CAVENGER P ROPERTY T AX S ALES Office of Cook County Treasurer Maria

#umlibraryscavengerhunt Student Outreach via Social Media during Fall 2017 Background Examples

Wilson / WSTEM Elementary Neighborhood Design Meeting 1 project approach Tonights Topics: 1.

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Exam 1 Review Exam 1 Review February

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator By:

What is Computer Architecture? Structure: static arrangement of the parts Organization:

Computer-System Architecture Operating System Concepts 2.2 Silberschatz, Galvin and Gagne

CS6630: Realistic Image Synthesis Prof. Steve Marschner Spring 2012 40 Spring Joint Computer

Picking Week 10, Mon Mar 14 http://www.ugrad.cs.ubc.ca/~cs314/Vjan2005 News some people

5 Chip Multiprocessors (II) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

CS 6958 PROJECT IDEAS 2 March 5, 2014 1. Visualizing memory access patterns 2 How well do

Introduction (00) RNDr. Martin Madaras, PhD. martin.madaras@stuba.sk Overview PCGIP

Logistics Paper summaries on Ray Tracing Any takers? Advanced Ray Tracing About the

Hierarchy theorems Evgenij Thorstensen V18 Evgenij Thorstensen Hierarchy theorems V18 1 / 18

Turing, tt -, and m -reductions for functions in the Baire hierarchy Linda Brown Westrick

Conditional Moment Relaxations and Sums-of-AM/GM-Exponentials Riley Murray California Institute

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Complexity Hierarchies Lecture 2 2

Some analytical aspects of the Kontsevich matrix model Mattia Cafasso Laboratoire Angevin de