Efficient Data Supply for Hardware Accelerators with Prefetching and - - PowerPoint PPT Presentation

efficient data supply for hardware accelerators with
SMART_READER_LITE
LIVE PREVIEW

Efficient Data Supply for Hardware Accelerators with Prefetching and - - PowerPoint PPT Presentation

Cornell University Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling Tao Chen and G. Edward Suh Computer Systems Laboratory Cornell University Accelerator-Rich Computing Systems Computing


slide-1
SLIDE 1

Cornell University

Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling

Tao Chen and G. Edward Suh Computer Systems Laboratory Cornell University

slide-2
SLIDE 2

Tao Chen Cornell University 2

Accelerator-Rich Computing Systems

  • Computing systems are becoming accelerator-rich
  • General-purpose cores + a large number of accelerators
  • Challenge: Design and verification complexity
  • Non-recurring engineering (NRE) cost per accelerator
  • Manual efforts are a major source of cost
  • Create computation pipelines
  • Manage data supply from memory

High-Level Synthesis (HLS)

This work: An automated framework for generating accelerators with efficient data supply

slide-3
SLIDE 3

Tao Chen Cornell University 3

Inefficiencies in Accelerator Data Supply

Scratchpad-based accelerators

  • On-chip scratchpad memory (SPM)
  • Manually designed logic to move

data between SPM and main memory

  • Pros: Good performance
  • Cons: High design effort,

accelerator-specific, not reusable

Accelerator Compute Logic Memory Bus SPM Preload Logic

slide-4
SLIDE 4

Tao Chen Cornell University 3

Inefficiencies in Accelerator Data Supply

Scratchpad-based accelerators

  • On-chip scratchpad memory (SPM)
  • Manually designed logic to move

data between SPM and main memory

  • Pros: Good performance
  • Cons: High design effort,

accelerator-specific, not reusable Cache-based accelerators

  • Pros: Low design effort, cache can be reused
  • Cons: Uncertain memory latency impacts performance

Accelerator Compute Logic Memory Bus Cache

slide-5
SLIDE 5

Tao Chen Cornell University 4

Optimize Data Supply for Cache-Based Accelerators

Approach: automated framework for generating accelerators with efficient data supply

Accelerator Source Automated Framework Accelerator w/ Efficient Data Supply

slide-6
SLIDE 6

Tao Chen Cornell University 4

Optimize Data Supply for Cache-Based Accelerators

Approach: automated framework for generating accelerators with efficient data supply

Accelerator Source Automated Framework Accelerator w/ Efficient Data Supply Accelerator Cache Memory Bus Compute Logic

Techniques

  • Prefetching
  • Tagging memory accesses

HW Prefetcher

slide-7
SLIDE 7

Tao Chen Cornell University 4

Optimize Data Supply for Cache-Based Accelerators

Approach: automated framework for generating accelerators with efficient data supply

Accelerator Source Automated Framework Accelerator w/ Efficient Data Supply Accelerator Cache Memory Bus

Techniques

  • Prefetching
  • Tagging memory accesses
  • Access/Execute Decoupling
  • Program slicing + architecture

template

HW Prefetcher Access Logic Execute Logic

slide-8
SLIDE 8

Tao Chen Cornell University 5

Impact of Uncertain Memory Latency

  • Example: Sparse Matrix Vector Multiplication (spmv)
  • Pipeline generated with High-Level Synthesis (HLS)

// inner loop of sparse matrix // vector multiplication for (j = begin; j < end; j++) { #pragma HLS pipeline Si = val[j] * vec[cols[j]]; sum = sum + Si; }

LD LD LD LD LD LD MUL ADD MUL LD ADD MUL LD LD ADD LD MUL LD LD ADD Time

HLS

slide-9
SLIDE 9

Tao Chen Cornell University 5

Impact of Uncertain Memory Latency

  • Example: Sparse Matrix Vector Multiplication (spmv)
  • Pipeline generated with High-Level Synthesis (HLS)

// inner loop of sparse matrix // vector multiplication for (j = begin; j < end; j++) { #pragma HLS pipeline Si = val[j] * vec[cols[j]]; sum = sum + Si; }

LD LD LD LD LD LD MUL ADD MUL LD ADD MUL LD LD ADD LD MUL LD LD ADD miss Time

A cache miss stalls the entire accelerator pipeline

Regular stride Regular stride Irregular

  • Reduce cache misses for regular accesses
  • Prefetch data into the cache
  • Tolerate cache misses for irregular accesses
  • Access/Execute Decoupling

HLS

slide-10
SLIDE 10

Tao Chen Cornell University 6

Hardware Prefetching

  • Predict future memory accesses
  • PC is often used as a hint
  • Stream localization
  • Spatial correlation prediction

for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; }

2380 8010 541C 2384 8328 5420 2388 8454 5424 238C 81B8 5428 Global Addr Stream

slide-11
SLIDE 11

Tao Chen Cornell University 6

Hardware Prefetching

  • Predict future memory accesses
  • PC is often used as a hint
  • Stream localization
  • Spatial correlation prediction
  • Problem: accelerators lack a PC
  • Solution: generate PC-like tags for

accelerator memory accesses

for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; }

2380 8010 541C 2384 8328 5420 2388 8454 5424 238C 81B8 5428 Global Addr Stream 2380 8010 541C 2384 8328 5420 2388 8454 5424 238C 81B8 5428 Local Addr Streams

  • regular

strides irregular no pred 2380 2384 2388 238C PC x 541C 5420 5424 5428 PC y 8010 8328 8454 81B8 PC z

slide-12
SLIDE 12

Tao Chen Cornell University 6

Hardware Prefetching

  • Predict future memory accesses
  • PC is often used as a hint
  • Stream localization
  • Spatial correlation prediction
  • Problem: accelerators lack a PC
  • Solution: generate PC-like tags for

accelerator memory accesses

for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; }

2380 8010 541C 2384 8328 5420 2388 8454 5424 238C 81B8 5428 Global Addr Stream 2380 8010 541C 2384 8328 5420 2388 8454 5424 238C 81B8 5428 Local Addr Streams

  • regular

strides irregular no pred 2380 2384 2388 238C PC x 541C 5420 5424 5428 PC y 8010 8328 8454 81B8 PC z

BB1 BB3

LD LD LD

× +

BB2

x y z CDFG

slide-13
SLIDE 13

Tao Chen Cornell University 7

Decoupled Access/Execute (DAE)

  • Limitations of Hardware Prefetching
  • Not accurate for complex patterns / Needs warm-up time
  • Fundamental reason: lack of semantic information
  • Decoupled Access/Execute
  • Allow memory accesses to run ahead to preload data

Time

Memory Access Value Comp

memory latency

Accelerator Cache Memory Bus HW Prefetche r Access Logic Execute Logic Cache w/ DAE Compute Manual Preload

memory latency

SPM w/ Manual Preload

slide-14
SLIDE 14

Tao Chen Cornell University 8

Traditional DAE is not Effective for Accelerators

  • Traditional DAE: access part forwards data to execute part
  • Problem: access pipeline stalls on misses
  • Throughput is limited by access pipeline
  • Goal: allow access pipeline to continue to flow under misses

MUL ADD MUL ADD MUL ADD MUL ADD LD LD LD LD LD LD LD LD LD LD LD LD LD LD LD LD LD LD MUL ADD MUL LD ADD MUL LD LD ADD LD MUL LD LD ADD miss

Original Decoupled Access Execute

slide-15
SLIDE 15

Tao Chen Cornell University 9

DAE Accelerator with Decoupled Loads

  • Anatomy of a load
  • Solution: Delegate request/response handling

LD AGen Req Resp LD AGen Req Resp AGen Req Resp

slide-16
SLIDE 16

Tao Chen Cornell University 10

Memory Unit

  • Proxy for handling memory requests and responses
  • Supports response reordering and store-to-load forwarding

Load Queue to ExeU Mem Unit Dep Check memreq memresp Store Addr Store Data from ExeU Load Addr Store Addr Queue Store Data Queue Fwd Data Queue Load Data to AccU Fwd Data

LD

slide-17
SLIDE 17

Tao Chen Cornell University 10

Memory Unit

  • Proxy for handling memory requests and responses
  • Supports response reordering and store-to-load forwarding

Load Queue to ExeU Mem Unit Dep Check memreq memresp Store Addr Store Data from ExeU Load Addr Store Addr Queue Store Data Queue Fwd Data Queue Load Data to AccU Fwd Data

LD ST

slide-18
SLIDE 18

Tao Chen Cornell University 11

Automated DAE Accelerator Generation

  • Program slicing for generating access/execute slices
  • Architectural template with configurable parameters

accel.c Architectural Template access.c execute.c

slicing slicing

access.v execute.v HLS HLS Access/Execute Decoupled Accel HW Generation Parameters

  • Queue sizes
  • Port width
  • MemUnit config
  • etc

Written in PyMTL

slide-19
SLIDE 19

Tao Chen Cornell University 12

Evaluation Methodology

  • Vertically integrated modeling methodology
  • System components: cycle-level (gem5)
  • Accelerators: register-transfer-level (Vivado HLS, PyMTL)
  • Area, power and energy: gate-level (commercial ASIC flow)
  • Benchmark accelerators from MachSuite

Name Description bbgemm Blocked matrix multiplication bfsbulk Breadth-First Search gemm Dense matrix multiplication mdknn Molecular dynamics (K-Nearest Neighbor) nw Needleman-Wunsch algorithm spmvcrs Sparse matrix vector multiplication stencil2d 2D stencil computation viterbi Viterbi algorithm

slide-20
SLIDE 20

Tao Chen Cornell University 13

Performance Comparison

  • 2.28x speedup on average
  • Prefetching and DAE work in synergy
slide-21
SLIDE 21

Tao Chen Cornell University 14

Energy Comparison

  • 15% energy reduction on average because of reduced stalls
  • MemUnits/queues only consume a small amount of energy
slide-22
SLIDE 22

Tao Chen Cornell University 15

More Details in the Paper

  • Deadlock Avoidance
  • Customization of Memory Units
  • Baseline Validation
  • Power and Area Comparison
  • Energy, Power and Area Breakdown
  • Sensitivity Study on Varying Queue Sizes
  • Design Space Exploration: Queue Size Customization
slide-23
SLIDE 23

Tao Chen Cornell University 16

Summary

Cache-based accelerators

  • Avoid the high design cost of manual data movement logic
  • Problem: Inefficient in handling uncertain memory latency

Approach: Automated program analysis and architectural template to generate accelerators with efficient data supply

  • Tagging memory requests to enable prefetching
  • Decoupling to enable memory accesses to run ahead

Results: High-performance cache-based accelerators with minimal manual efforts