optimizing under abstraction
play

Optimizing Under Abstraction: Using Prefetching to Improve FPGA - PowerPoint PPT Presentation

Optimizing Under Abstraction: Using Prefetching to Improve FPGA Performance Hsin-Jung Yang , Kermin E. Fleming , Michael Adler , and Joel Emer Massachusetts Institute of Technology Intel Corporation September 3rd, FPL


  1. Optimizing Under Abstraction: Using Prefetching to Improve FPGA Performance Hsin-Jung Yang † , Kermin E. Fleming ‡ , Michael Adler ‡ , and Joel Emer †‡ † Massachusetts Institute of Technology ‡ Intel Corporation September 3rd, FPL 2013

  2. Motivation • Moore’s Law – Increasing FPGA size and capability • Use case for FPGA: FPGA A User Program A

  3. Motivation • Moore’s Law – Increasing FPGA size and capability • Use case for FPGA: FPGA A FPGA B User User Program B Program A SRAM Ethernet DRAM

  4. Motivation • Moore’s Law – Increasing FPGA size and capability • Use case for FPGA: FPGA A FPGA B User User Program B Program A SRAM Ethernet DRAM circuit verification

  5. Motivation • Moore’s Law – Increasing FPGA size and capability • Use case for FPGA: FPGA A FPGA B User User Program B Program A SRAM Ethernet DRAM circuit algorithm verification acceleration

  6. Motivation • Moore’s Law – Increasing FPGA size and capability • Use case for FPGA: FPGA C FPGA A FPGA B User SRAM User Program B User Program B Program A SRAM SRAM Ethernet SRAM Ethernet SRAM DRAM LUTs DRAM PCIE DRAM circuit algorithm verification acceleration

  7. Motivation • Moore’s Law – Increasing FPGA size and capability • Use case for FPGA: FPGA C FPGA A FPGA B User SRAM User Program B User Program B’ Program A SRAM SRAM Ethernet SRAM Ethernet SRAM DRAM LUTs DRAM PCIE DRAM circuit algorithm verification acceleration

  8. Abstraction • Goal: making FPGAs easier to use Processor A C++/Python/Perl Application Software Library Operating System (config. A) Memory Device CPU

  9. Abstraction • Goal: making FPGAs easier to use Processor B C++/Python/Perl Application Software Library Operating System (config. B) Memory’ Device’ CPU’

  10. Abstraction • Goal: making FPGAs easier to use FPGA A Processor B User Program C++/Python/Perl Application Interface Software Library Abstraction (config. A) Operating System (config. B) SRAM Ethernet Memory’ Device’ CPU’

  11. Abstraction • Goal: making FPGAs easier to use FPGA B Processor B User Program C++/Python/Perl Application Interface Software Library Abstraction (config. B) Operating System (config. B) DRAM PCIe Memory’ Device’ CPU’ Unused resources

  12. Abstraction • Goal: making FPGAs easier to use FPGA B Processor B User Program C++/Python/Perl Application Interface Software Library Abstraction (config. B) Operating System (config. B) Optimization Memory’ Device’ CPU’ DRAM PCIe • Optimization under abstraction – Automatically accelerate FPGA applications – Provide FREE performance gain

  13. Memory Abstraction FPGA Block RAMs Client Client Client addr din dout RAM RAM RAM wen Block Block Block interface MEMORY_INTERFACE input: clk readReq (addr); write (addr, din); A1 addr output: D1 // dout is available at the next cycle of readReq dout readResp () if (readReq fired previous cycle); endinterface

  14. Memory Abstraction FPGA Block RAMs Client Client Client addr din dout RAM RAM RAM wen valid Block Block Block interface MEMORY_INTERFACE input: clk readReq (addr); A1 addr write (addr, din); output: D1 dout // dout is available when response is ready readResp () if (valid == True); valid endinterface

  15. LEAP Scratchpads Scratchpads Client Client Client Scratchpad Client Private Cache Private Cache Private Cache Scratchpad Scratchpad Scratchpad Interface Interface Interface Connector Central Cache Platform Scratchpad Controller Host Memory M. Adler et al. , “LEAP Scratchpads,” in FPGA, 2011.

  16. LEAP Scratchpads Processor Scratchpads Client Client Client Scratchpad Client Private Cache Private Cache Private Cache Scratchpad Scratchpad Scratchpad Interface Interface Interface Connector Central Cache Platform Scratchpad Controller Host Memory M. Adler et al. , “LEAP Scratchpads,” in FPGA, 2011.

  17. Scratchpad Optimization Automatically accelerate memory-using FPGA programs • Reduce scratchpad latency • Leverage unused resources • Learn from optimization techniques in processors – Larger caches, greater associativity – Better cache policies – Cache prefetching

  18. Scratchpad Optimization Automatically accelerate memory-using FPGA programs • Reduce scratchpad latency • Leverage unused resources • Learn from optimization techniques in processors – Larger caches, greater associativity – Better cache policies – Cache prefetching

  19. Talk Outline • Motivation • Introduction to LEAP Scratchpads • Prefetching in FPGAs vs. in processors • Scratchpad Prefetcher Microarchitecture • Evaluation and Prefetch Optimization • Conclusion

  20. Prefetching Techniques Comparison of prefetching techniques and platforms Static Prefetching Dynamic Prefetching Platform Processor FPGA Processor FPGA User/ Hardware How? User Compiler Compiler manufacturer   No code change   High prefetch accuracy    No instruction overhead   Runtime information

  21. Prefetching Techniques Comparison of prefetching techniques and platforms Static Prefetching Dynamic Prefetching Platform Processor FPGA Processor FPGA User/ Hardware How? User Compiler Compiler manufacturer   No code change   High prefetch accuracy    No instruction overhead   Runtime information

  22. Dynamic Prefetching in Processor Classic processor dynamic prefetching policies • When to prefetch – Prefetch on cache miss – Prefetch on cache miss and prefetch hit • Also called tagged prefetch • What to prefetch – Always prefetch next memory block – Learn stride-access patterns

  23. Dynamic Prefetching in Processor Stride prefetching • L1 cache: PC-based stride prefetching • L2 cache: address-based stride prefetching (fully associative cache of learners) Previous Tag Stride State Address 0xa001 0x1008 4 Steady learner 1 learner 2 0xa002 0x2000 0 Initial learner 3

  24. Dynamic Prefetching on FPGAs • Easier: – Cleaner streaming memory accesses – No need for PC as a filter to separate streams – Fixed (usually plenty of) resources • Harder: – Back-to-back memory accesses – Inefficient to implement CAM

  25. Dynamic Prefetching on FPGAs • Easier: – Cleaner streaming memory accesses – No need for PC as a filter to separate streams – Fixed (usually plenty of) resources • Harder: – Back-to-back memory accesses – Inefficient to implement CAM Scratchpad prefetcher uses address-based stride prefetching with a larger set of direct-mapped learners

  26. Talk Outline • Motivation • Introduction to LEAP Scratchpads • Prefetching in FPGAs vs. in processors • Scratchpad Prefetcher Design • Evaluation and Prefetch Optimization • Conclusion

  27. Scratchpad Prefetcher Client Client Client Scratchpad Client Private Private Private Cache Cache Cache Scratchpad Scratchpad Scratchpad Interface Interface Interface Connector Central Cache Platform Scratchpad Controller Host Memory

  28. Scratchpad Prefetcher Client Client Client Scratchpad Client Private Private Private Prefetcher Prefetcher Prefetcher Cache Cache Cache Scratchpad Scratchpad Scratchpad Interface Interface Interface Connector Central Cache Platform Scratchpad Controller Host Memory

  29. Scratchpad Prefetching Policy • When to prefetch – Cache line miss / prefetch hit – Prefetcher learns the stride pattern • What to prefetch – Prefetch address: P = L + s * d – Cache line address: L – Learned stride: S – Look-ahead distance: d

  30. Scratchpad Prefetching Policy • When to prefetch – Cache line miss / prefetch hit – Prefetcher learns the stride pattern • What to prefetch – Prefetch address: P = L + s * d – Cache line address: L – Learned stride: S – Look-ahead distance: d

  31. Scratchpad Prefetching Policy • Look-ahead distance: – Small distance? prefetch benefit – Large distance? cache pollution – Suitable distance for different programs & platforms?

  32. Scratchpad Prefetching Policy • Look-ahead distance: – Small distance? prefetch benefit – Large distance? cache pollution – Suitable distance for different programs & platforms? Dynamically adjust look-ahead distance

  33. Scratchpad Prefetching Policy • Look-ahead distance: – Small distance? prefetch benefit – Large distance? cache pollution – Suitable distance for different programs & platforms? Dynamically adjust look-ahead distance Issued prefetch

  34. Scratchpad Prefetching Policy • Look-ahead distance: – Small distance? prefetch benefit – Large distance? cache pollution – Suitable distance for different programs & platforms? Dynamically adjust look-ahead distance To Memory Issued prefetch Dropped

  35. Scratchpad Prefetching Policy • Look-ahead distance: – Small distance? prefetch benefit – Large distance? cache pollution – Suitable distance for different programs & platforms? Dynamically adjust look-ahead distance To Memory Issued prefetch Dropped by hit Dropped Dropped by busy

Recommend


More recommend