Efficient Data Supply for Hardware Accelerators with Prefetching and - PowerPoint PPT Presentation

Cornell University Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling Tao Chen and G. Edward Suh Computer Systems Laboratory Cornell University

Accelerator-Rich Computing Systems • Computing systems are becoming accelerator-rich • General-purpose cores + a large number of accelerators • Challenge : Design and verification complexity • Non-recurring engineering (NRE) cost per accelerator • Manual efforts are a major source of cost • Create computation pipelines High-Level Synthesis (HLS) • Manage data supply from memory This work: An automated framework for generating accelerators with efficient data supply 2 Cornell University Tao Chen

Inefficiencies in Accelerator Data Supply Scratchpad-based accelerators • On-chip scratchpad memory (SPM) Accelerator • Manually designed logic to move Compute data between SPM and main Logic memory • Pros : Good performance Preload SPM • Cons : High design effort, Logic accelerator-specific, not reusable Memory Bus 3 Cornell University Tao Chen

Inefficiencies in Accelerator Data Supply Scratchpad-based accelerators • On-chip scratchpad memory (SPM) Accelerator • Manually designed logic to move Compute data between SPM and main Logic memory • Pros : Good performance Cache • Cons : High design effort, accelerator-specific, not reusable Memory Bus Cache-based accelerators • Pros : Low design effort, cache can be reused • Cons : Uncertain memory latency impacts performance 3 Cornell University Tao Chen

Optimize Data Supply for Cache-Based Accelerators Approach : automated framework for generating accelerators with efficient data supply Accelerator Accelerator Automated w/ Efficient Source Framework Data Supply 4 Cornell University Tao Chen

Optimize Data Supply for Cache-Based Accelerators Approach : automated framework for generating accelerators with efficient data supply Accelerator Accelerator Automated w/ Efficient Source Framework Data Supply Accelerator Techniques • Prefetching Compute Logic • Tagging memory accesses HW Cache Prefetcher Memory Bus 4 Cornell University Tao Chen

Optimize Data Supply for Cache-Based Accelerators Approach : automated framework for generating accelerators with efficient data supply Accelerator Accelerator Automated w/ Efficient Source Framework Data Supply Accelerator Techniques • Prefetching Access Execute Logic Logic • Tagging memory accesses • Access/Execute Decoupling HW Cache • Program slicing + architecture Prefetcher template Memory Bus 4 Cornell University Tao Chen

Impact of Uncertain Memory Latency • Example : Sparse Matrix Vector Multiplication (spmv) • Pipeline generated with High-Level Synthesis (HLS) Time // inner loop of sparse matrix LD // vector multiplication LD LD LD LD LD HLS LD LD LD for (j = begin; j < end; j++) { MUL LD LD #pragma HLS pipeline MUL LD Si = val[j] * vec[cols[j]]; ADD MUL ADD MUL sum = sum + Si; ADD } ADD 5 Cornell University Tao Chen

Impact of Uncertain Memory Latency • Example : Sparse Matrix Vector Multiplication (spmv) • Pipeline generated with High-Level Synthesis (HLS) Time // inner loop of sparse matrix LD miss // vector multiplication LD LD LD LD LD HLS for (j = begin; j < end; j++) { #pragma HLS pipeline Si = val[j] * vec[cols[j]]; LD LD LD sum = sum + Si; LD LD MUL } LD MUL ADD MUL Regular Regular Irregular ADD MUL stride stride ADD • Reduce cache misses for regular accesses ADD • Prefetch data into the cache A cache miss stalls the • Tolerate cache misses for irregular accesses entire accelerator pipeline • Access/Execute Decoupling 5 Cornell University Tao Chen

Hardware Prefetching Global • Predict future memory accesses Addr Stream 2380 • PC is often used as a hint 541C 8010 • Stream localization 2384 • Spatial correlation prediction 5420 8328 2388 for (j = begin; j < end; j++) { 5424 8454 Si = val[j] * vec[cols[j]]; 238C sum = sum + Si; 5428 } 81B8 ● ● ● 6 Cornell University Tao Chen

Hardware Prefetching Local Addr Global • Predict future memory accesses Streams Addr Stream PC y PC x PC z 2380 2380 2380 541C 8010 • PC is often used as a hint 541C 541C 2384 5420 8328 8010 8010 2388 5424 8454 • Stream localization 2384 2384 238C 5428 81B8 • Spatial correlation prediction 5420 5420 8328 8328 regular irregular 2388 2388 strides no pred for (j = begin; j < end; j++) { 5424 5424 8454 8454 Si = val[j] * vec[cols[j]]; 238C 238C sum = sum + Si; 5428 5428 } 81B8 81B8 ● • Problem : accelerators lack a PC ● ● • Solution : generate PC-like tags for accelerator memory accesses 6 Cornell University Tao Chen

Hardware Prefetching Local Addr Global • Predict future memory accesses Streams Addr Stream PC y PC x PC z 2380 2380 2380 541C 8010 • PC is often used as a hint 541C 541C 2384 5420 8328 8010 8010 2388 5424 8454 • Stream localization 2384 2384 238C 5428 81B8 • Spatial correlation prediction 5420 5420 8328 8328 regular irregular 2388 2388 strides no pred for (j = begin; j < end; j++) { 5424 5424 8454 8454 Si = val[j] * vec[cols[j]]; BB1 238C 238C sum = sum + Si; 5428 5428 } BB2 81B8 81B8 LD ● • Problem : accelerators lack a PC y ● LD LD ● x × • Solution : generate PC-like tags for z accelerator memory accesses + CDFG BB3 6 Cornell University Tao Chen

Decoupled Access/Execute (DAE) • Limitations of Hardware Prefetching • Not accurate for complex patterns / Needs warm-up time • Fundamental reason: lack of semantic information • Decoupled Access/Execute • Allow memory accesses to run ahead to preload data Memory Cache w/ DAE Accelerator Access memory Value Access Execute latency Comp Logic Logic Time SPM w/ Manual Preload HW Cache Prefetche memory Manual r Compute latency Preload Memory Bus 7 Cornell University Tao Chen

Traditional DAE is not Effective for Accelerators • Traditional DAE : access part forwards data to execute part • Problem : access pipeline stalls on misses • Throughput is limited by access pipeline Decoupled Original Access Execute LD LD miss LD LD LD LD LD LD LD LD LD LD LD LD LD LD LD LD MUL LD LD LD LD MUL MUL LD LD MUL MUL ADD MUL ADD MUL ADD MUL ADD ADD ADD ADD ADD • Goal : allow access pipeline to continue to flow under misses 8 Cornell University Tao Chen

DAE Accelerator with Decoupled Loads • Anatomy of a load AGen Req LD AGen LD Req Resp Resp • Solution : Delegate request/response handling Req Resp AGen 9 Cornell University Tao Chen

Memory Unit • Proxy for handling memory requests and responses • Supports response reordering and store-to-load forwarding Store Addr Load Addr Load Data to AccU Store Data from ExeU LD Fwd Data Dep Fwd Check Load Queue Data Queue to ExeU Store Store Data Addr Queue Queue Mem Unit memreq memresp 10 Cornell University Tao Chen

Memory Unit • Proxy for handling memory requests and responses • Supports response reordering and store-to-load forwarding Load Data to AccU Store Addr Load Addr Store Data from ExeU Fwd LD Data Dep Fwd Check Load Queue Data Queue to ExeU Store Store Data Addr Queue Queue Mem Unit memreq memresp ST 10 Cornell University Tao Chen

Automated DAE Accelerator Generation • Program slicing for generating access/execute slices • Architectural template with configurable parameters accel.c slicing slicing access.c execute.c HLS HLS Architectural access.v execute.v Template HW Parameters Generation • Queue sizes • Port width • MemUnit config Written in PyMTL • etc Access/Execute Decoupled Accel 11 Cornell University Tao Chen

Evaluation Methodology • Vertically integrated modeling methodology • System components : cycle-level (gem5) • Accelerators : register-transfer-level (Vivado HLS, PyMTL) • Area, power and energy : gate-level (commercial ASIC flow) • Benchmark accelerators from MachSuite Name Description bbgemm Blocked matrix multiplication bfsbulk Breadth-First Search gemm Dense matrix multiplication mdknn Molecular dynamics (K-Nearest Neighbor) nw Needleman-Wunsch algorithm spmvcrs Sparse matrix vector multiplication stencil2d 2D stencil computation viterbi Viterbi algorithm 12 Cornell University Tao Chen

Performance Comparison • 2.28x speedup on average • Prefetching and DAE work in synergy 13 Cornell University Tao Chen

Energy Comparison • 15% energy reduction on average because of reduced stalls • MemUnits/queues only consume a small amount of energy 14 Cornell University Tao Chen

More Details in the Paper • Deadlock Avoidance • Customization of Memory Units • Baseline Validation • Power and Area Comparison • Energy, Power and Area Breakdown • Sensitivity Study on Varying Queue Sizes • Design Space Exploration: Queue Size Customization 15 Cornell University Tao Chen

Efficient Data Supply for Hardware Accelerators with Prefetching and - PowerPoint PPT Presentation

Cornell University Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling Tao Chen and G. Edward Suh Computer Systems Laboratory Cornell University Accelerator-Rich Computing Systems Computing

Application Accelerators: Application Accelerators: Application Accelerators: Application

R265: Advanced Topics in Computer Architecture Seminar 7: HW accelerators and accelerators for

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Chronos: Efficient Speculative Parallelism for Accelerators MALEEN ABEYDEERA, DANIEL SANCHEZ

Execution Time Prediction for Energy- Efficient Hardware Accelerators Tao Chen, Alex Rucker, and

DETECTORS AND ACCELERATORS DETECTORS AND ACCELERATORS APPLIED TO MEDICINE Jos Bernabu Jos

Accelerators for Americas Future ACCELERATORS - MODERN SHIPS OF DISCOVERY October 26, 2009

Confidential Accelerators Stavros Volos Microsoft Research Accelerators Play Pivotal Role in

Activities on accelerators in Spain Francis Perez ALBA Accelerators Head on behalf of

Grif Griffin T Griffin T Grif Griffin T Grif Griffin T Grif n Tools and Supply n Tools and

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Applications on Heterogeneous Platforms with Accelerators Accelerators and Hybrid Exascale

EUCARD2/WP4:Applica2ons Medium Energy Accelerators/Accelerators for Medicine

EUCARD2/WP4:Applications Medium Energy Accelerators/Accelerators for Medicine Introduction Hywel

Post- -accelerators accelerators for EURISOL for EURISOL Post Marie- -H H l l ne

Disaster Recovery Grant Reporting System Training for NSP Users Release 7.10 1 Admin Action Plans

HashCache: Cache Storage for the Next Billion Anirudh Badam KyoungSoo Park Vivek S. Pai

QXF Support Structure Design and Development Helene Felice P. Ferracin, M. Juchno, D. Cheng, M.

The Journalling Flash File System http://sources.redhat.com/jffs2/ David Woodhouse

Reducing Risk When Upgrading Your MySQL Environment Kenny Gryp MySQL Practice Manager My

Time-resolved SAXS and SANS Manfred Roessle, EMBL Hamburg Beijing 28 th April to 6 th May 2011 1

Customization, Deployment and Manufacturing Jessie Labayen Principal Program Manager Agenda

Drug Supply Modelling Software Vladimir V. Anisimov, Valerii V. Fedorov, Richard M. Heiberger,