Efficient Data Supply for Hardware Accelerators with Prefetching and - - PowerPoint PPT Presentation
Efficient Data Supply for Hardware Accelerators with Prefetching and - - PowerPoint PPT Presentation
Cornell University Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling Tao Chen and G. Edward Suh Computer Systems Laboratory Cornell University Accelerator-Rich Computing Systems Computing
Tao Chen Cornell University 2
Accelerator-Rich Computing Systems
- Computing systems are becoming accelerator-rich
- General-purpose cores + a large number of accelerators
- Challenge: Design and verification complexity
- Non-recurring engineering (NRE) cost per accelerator
- Manual efforts are a major source of cost
- Create computation pipelines
- Manage data supply from memory
High-Level Synthesis (HLS)
This work: An automated framework for generating accelerators with efficient data supply
Tao Chen Cornell University 3
Inefficiencies in Accelerator Data Supply
Scratchpad-based accelerators
- On-chip scratchpad memory (SPM)
- Manually designed logic to move
data between SPM and main memory
- Pros: Good performance
- Cons: High design effort,
accelerator-specific, not reusable
Accelerator Compute Logic Memory Bus SPM Preload Logic
Tao Chen Cornell University 3
Inefficiencies in Accelerator Data Supply
Scratchpad-based accelerators
- On-chip scratchpad memory (SPM)
- Manually designed logic to move
data between SPM and main memory
- Pros: Good performance
- Cons: High design effort,
accelerator-specific, not reusable Cache-based accelerators
- Pros: Low design effort, cache can be reused
- Cons: Uncertain memory latency impacts performance
Accelerator Compute Logic Memory Bus Cache
Tao Chen Cornell University 4
Optimize Data Supply for Cache-Based Accelerators
Approach: automated framework for generating accelerators with efficient data supply
Accelerator Source Automated Framework Accelerator w/ Efficient Data Supply
Tao Chen Cornell University 4
Optimize Data Supply for Cache-Based Accelerators
Approach: automated framework for generating accelerators with efficient data supply
Accelerator Source Automated Framework Accelerator w/ Efficient Data Supply Accelerator Cache Memory Bus Compute Logic
Techniques
- Prefetching
- Tagging memory accesses
HW Prefetcher
Tao Chen Cornell University 4
Optimize Data Supply for Cache-Based Accelerators
Approach: automated framework for generating accelerators with efficient data supply
Accelerator Source Automated Framework Accelerator w/ Efficient Data Supply Accelerator Cache Memory Bus
Techniques
- Prefetching
- Tagging memory accesses
- Access/Execute Decoupling
- Program slicing + architecture
template
HW Prefetcher Access Logic Execute Logic
Tao Chen Cornell University 5
Impact of Uncertain Memory Latency
- Example: Sparse Matrix Vector Multiplication (spmv)
- Pipeline generated with High-Level Synthesis (HLS)
// inner loop of sparse matrix // vector multiplication for (j = begin; j < end; j++) { #pragma HLS pipeline Si = val[j] * vec[cols[j]]; sum = sum + Si; }
LD LD LD LD LD LD MUL ADD MUL LD ADD MUL LD LD ADD LD MUL LD LD ADD Time
HLS
Tao Chen Cornell University 5
Impact of Uncertain Memory Latency
- Example: Sparse Matrix Vector Multiplication (spmv)
- Pipeline generated with High-Level Synthesis (HLS)
// inner loop of sparse matrix // vector multiplication for (j = begin; j < end; j++) { #pragma HLS pipeline Si = val[j] * vec[cols[j]]; sum = sum + Si; }
LD LD LD LD LD LD MUL ADD MUL LD ADD MUL LD LD ADD LD MUL LD LD ADD miss Time
A cache miss stalls the entire accelerator pipeline
Regular stride Regular stride Irregular
- Reduce cache misses for regular accesses
- Prefetch data into the cache
- Tolerate cache misses for irregular accesses
- Access/Execute Decoupling
HLS
Tao Chen Cornell University 6
Hardware Prefetching
- Predict future memory accesses
- PC is often used as a hint
- Stream localization
- Spatial correlation prediction
for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; }
2380 8010 541C 2384 8328 5420 2388 8454 5424 238C 81B8 5428 Global Addr Stream
Tao Chen Cornell University 6
Hardware Prefetching
- Predict future memory accesses
- PC is often used as a hint
- Stream localization
- Spatial correlation prediction
- Problem: accelerators lack a PC
- Solution: generate PC-like tags for
accelerator memory accesses
for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; }
2380 8010 541C 2384 8328 5420 2388 8454 5424 238C 81B8 5428 Global Addr Stream 2380 8010 541C 2384 8328 5420 2388 8454 5424 238C 81B8 5428 Local Addr Streams
- regular
strides irregular no pred 2380 2384 2388 238C PC x 541C 5420 5424 5428 PC y 8010 8328 8454 81B8 PC z
Tao Chen Cornell University 6
Hardware Prefetching
- Predict future memory accesses
- PC is often used as a hint
- Stream localization
- Spatial correlation prediction
- Problem: accelerators lack a PC
- Solution: generate PC-like tags for
accelerator memory accesses
for (j = begin; j < end; j++) { Si = val[j] * vec[cols[j]]; sum = sum + Si; }
2380 8010 541C 2384 8328 5420 2388 8454 5424 238C 81B8 5428 Global Addr Stream 2380 8010 541C 2384 8328 5420 2388 8454 5424 238C 81B8 5428 Local Addr Streams
- regular
strides irregular no pred 2380 2384 2388 238C PC x 541C 5420 5424 5428 PC y 8010 8328 8454 81B8 PC z
BB1 BB3
LD LD LD
× +
BB2
x y z CDFG
Tao Chen Cornell University 7
Decoupled Access/Execute (DAE)
- Limitations of Hardware Prefetching
- Not accurate for complex patterns / Needs warm-up time
- Fundamental reason: lack of semantic information
- Decoupled Access/Execute
- Allow memory accesses to run ahead to preload data
Time
Memory Access Value Comp
memory latency
Accelerator Cache Memory Bus HW Prefetche r Access Logic Execute Logic Cache w/ DAE Compute Manual Preload
memory latency
SPM w/ Manual Preload
Tao Chen Cornell University 8
Traditional DAE is not Effective for Accelerators
- Traditional DAE: access part forwards data to execute part
- Problem: access pipeline stalls on misses
- Throughput is limited by access pipeline
- Goal: allow access pipeline to continue to flow under misses
MUL ADD MUL ADD MUL ADD MUL ADD LD LD LD LD LD LD LD LD LD LD LD LD LD LD LD LD LD LD MUL ADD MUL LD ADD MUL LD LD ADD LD MUL LD LD ADD miss
Original Decoupled Access Execute
Tao Chen Cornell University 9
DAE Accelerator with Decoupled Loads
- Anatomy of a load
- Solution: Delegate request/response handling
LD AGen Req Resp LD AGen Req Resp AGen Req Resp
Tao Chen Cornell University 10
Memory Unit
- Proxy for handling memory requests and responses
- Supports response reordering and store-to-load forwarding
Load Queue to ExeU Mem Unit Dep Check memreq memresp Store Addr Store Data from ExeU Load Addr Store Addr Queue Store Data Queue Fwd Data Queue Load Data to AccU Fwd Data
LD
Tao Chen Cornell University 10
Memory Unit
- Proxy for handling memory requests and responses
- Supports response reordering and store-to-load forwarding
Load Queue to ExeU Mem Unit Dep Check memreq memresp Store Addr Store Data from ExeU Load Addr Store Addr Queue Store Data Queue Fwd Data Queue Load Data to AccU Fwd Data
LD ST
Tao Chen Cornell University 11
Automated DAE Accelerator Generation
- Program slicing for generating access/execute slices
- Architectural template with configurable parameters
accel.c Architectural Template access.c execute.c
slicing slicing
access.v execute.v HLS HLS Access/Execute Decoupled Accel HW Generation Parameters
- Queue sizes
- Port width
- MemUnit config
- etc
Written in PyMTL
Tao Chen Cornell University 12
Evaluation Methodology
- Vertically integrated modeling methodology
- System components: cycle-level (gem5)
- Accelerators: register-transfer-level (Vivado HLS, PyMTL)
- Area, power and energy: gate-level (commercial ASIC flow)
- Benchmark accelerators from MachSuite
Name Description bbgemm Blocked matrix multiplication bfsbulk Breadth-First Search gemm Dense matrix multiplication mdknn Molecular dynamics (K-Nearest Neighbor) nw Needleman-Wunsch algorithm spmvcrs Sparse matrix vector multiplication stencil2d 2D stencil computation viterbi Viterbi algorithm
Tao Chen Cornell University 13
Performance Comparison
- 2.28x speedup on average
- Prefetching and DAE work in synergy
Tao Chen Cornell University 14
Energy Comparison
- 15% energy reduction on average because of reduced stalls
- MemUnits/queues only consume a small amount of energy
Tao Chen Cornell University 15
More Details in the Paper
- Deadlock Avoidance
- Customization of Memory Units
- Baseline Validation
- Power and Area Comparison
- Energy, Power and Area Breakdown
- Sensitivity Study on Varying Queue Sizes
- Design Space Exploration: Queue Size Customization
Tao Chen Cornell University 16
Summary
Cache-based accelerators
- Avoid the high design cost of manual data movement logic
- Problem: Inefficient in handling uncertain memory latency
Approach: Automated program analysis and architectural template to generate accelerators with efficient data supply
- Tagging memory requests to enable prefetching
- Decoupling to enable memory accesses to run ahead