motivation
play

Motivation Moores Law continues More transistors & memory - PowerPoint PPT Presentation

LMC: Automatic Resource-Aware Program-Optimized Memory Partitioning Hsin-Jung Yang , Kermin E. Fleming , Michael Adler , Felix Winterstein , and Joel Emer Massachusetts Institute of Technology, Intel Corporation,


  1. LMC: Automatic Resource-Aware Program-Optimized Memory Partitioning Hsin-Jung Yang † , Kermin E. Fleming ‡ , Michael Adler ‡ , Felix Winterstein § , and Joel Emer † † Massachusetts Institute of Technology, ‡ Intel Corporation, § Imperial College London, February 22nd, FPGA 2016

  2. Motivation • Moore’s Law continues – More transistors & memory controllers on modern FPGAs • Example: Xilinx VC709: two 4GB DDR3 memories Nallatech 510T: eight 4GB DDR4 memories + 2GB HMC Xeon + FPGA: three memory channels • It is difficult to fully utilize DRAM bandwidth – Co-optimizing application cores and memory systems – Porting an existing design to a new platform • Smaller FPGA -> Larger FPGA • Single FPGA -> Multiple FPGAs

  3. Motivation • Moore’s Law continues – More transistors & memory controllers on modern FPGAs • Example: Xilinx VC709: two 4GB DDR3 memories Nallatech 510T: eight 4GB DDR4 memories + 2GB HMC Xeon + FPGA: three memory channels • It is difficult to fully utilize DRAM bandwidth – Co-optimizing application cores and memory systems – Porting an existing design to a new platform • Smaller FPGA -> Larger FPGA • Single FPGA -> Multiple FPGAs Goal: automatically optimizing the memory system to efficiently utilize the increased DRAM bandwidth

  4. Utilizing Multiple DRAMs • How to connect computational engines to DRAMs in order to maximize program performance? – Network topology: latency, bandwidth – On-chip caching – Area constraints ?

  5. Utilizing Multiple DRAMs • How to connect computational engines to DRAMs in order to maximize program performance? – Network topology: latency, bandwidth – On-chip caching – Area constraints ?

  6. Utilizing Multiple DRAMs • How to connect computational engines to DRAMs in order to maximize program performance? – High design complexity: network, caching… ?

  7. Utilizing Multiple DRAMs • How to connect computational engines to DRAMs in order to maximize program performance? – High design complexity: network, caching… • Applications have different memory behavior

  8. Utilizing Multiple DRAMs • How to connect computational engines to DRAMs in order to maximize program performance? – High design complexity: network, caching… • Applications have different memory behavior Need more bandwidth!

  9. Utilizing Multiple DRAMs • How to connect computational engines to DRAMs in order to maximize program performance? – High design complexity: network, caching… • Applications have different memory behavior Need more bandwidth!

  10. Utilizing Multiple DRAMs • How to connect computational engines to DRAMs in order to maximize program performance? – High design complexity: network, caching… • Applications have different memory behavior Need more bandwidth! Need a memory compiler!

  11. Automatic Construction of Program-Optimized Memories • A clearly-defined, generic memory abstraction – Separate the user program from the memory system implementation • Program introspection – To understand the program’s memory behavior • A resource-aware, feedback-driven memory compiler – Use introspection results as feedback to automatically construct the “best” memory system for the target program and platform

  12. Abstraction • Abstraction hides implementation details and provides good programmability Processor FPGA Software C/Python Application User Program Operating System Abstraction Instruction Set Architecture Memory Communication Hardware Memory CPU I/O

  13. Abstraction • Abstraction hides implementation details and provides good programmability Processor FPGA Software C/Python Application User Program Operating System Abstraction Instruction Set Architecture Memory Communication Hardware Memory CPU I/O Compilers & system developers Hardware can be optimized for the • target application and platform

  14. LEAP Memory Abstraction LEAP memory block User Engine • Simple memory interface • Arbitrary data size Interface • Private address space • “Unlimited” storage LEAP Memory • Automatic caching interface MEM_IFC#(type t_ADDR, type t_DATA) method void readReq (t_ADDR addr); method void write (t_ADDR addr, t_DATA din); method t_DATA readResp (); endinterface

  15. LEAP Memory Abstraction Same as block RAMs LEAP memory block User Engine • Simple memory interface • Arbitrary data size Interface • Private address space • “Unlimited” storage LEAP Memory • Automatic caching interface MEM_IFC#(type t_ADDR, type t_DATA) method void readReq (t_ADDR addr); method void write (t_ADDR addr, t_DATA din); method t_DATA readResp (); endinterface

  16. LEAP Private Memory FPGA User Client Client Client Program Interface Platform M. Adler et al. , “LEAP Scratchpads,” in FPGA, 2011.

  17. LEAP Private Memory FPGA User Client Client Client Program Interface on-chip SRAM Platform on-board DRAM M. Adler et al. , “LEAP Scratchpads,” in FPGA, 2011.

  18. LEAP Private Memory FPGA Processor User Client Client Client Application Program Interface on-chip SRAM L1 Cache Platform on-board DRAM L2 Cache Memory M. Adler et al. , “LEAP Scratchpads,” in FPGA, 2011.

  19. LEAP Memory w ith Multiple DRAMs • Naïve solution: unified memory with multiple DRAM banks Client Client Client Interface

  20. LEAP Memory w ith Multiple DRAMs • Naïve solution: unified memory with multiple DRAM banks Client Client Client Interface

  21. LEAP Memory w ith Multiple DRAMs • Naïve solution: unified memory with multiple DRAM banks Client Client Client Interface Simplicity More capacity Higher bandwidth

  22. LEAP Memory w ith Multiple DRAMs • Naïve solution: unified memory with multiple DRAM banks Client Client Client Interface Simplicity More capacity Higher bandwidth Difficulty: Performance is limited Serialized requests Long latency for large rings

  23. LEAP Memory w ith Multiple DRAMs • Naïve solution: unified memory with multiple DRAM banks Client Client Client Interface Simplicity More capacity Higher bandwidth Difficulty: Performance is limited Serialized requests Long latency for large rings Can we do better?

  24. LEAP Memory w ith Multiple DRAMs • Distributed central caches and memory controllers

  25. LEAP Memory w ith Multiple DRAMs • Distributed central caches and memory controllers ?

  26. Private Cache Network Partitioning • Program introspection – To understand programs’ memory behavior Statistics file Statistics Counter Statistics file Client A: 100 Client A: 100 Ex: # Cache misses Client B: 10 Client B: 10 # Outstanding requests Client C: 50 Client C: 50 Queueing delays Client D: 20 Client D: 20

  27. Private Cache Network Partitioning • Case 1: Memory clients with homogeneous behavior

  28. Private Cache Network Partitioning • Case 1: Memory clients with homogeneous behavior Homogeneous

  29. Private Cache Network Partitioning • Case 1: Memory clients with homogeneous behavior Homogeneous

  30. Private Cache Network Partitioning • Case 2: Memory clients with heterogeneous behavior Traffic: 100 10 50 20

  31. Private Cache Network Partitioning • Case 2: Memory clients with heterogeneous behavior Need more bandwidth! Traffic: 100 10 50 20

  32. Private Cache Network Partitioning • Case 2: Memory clients with heterogeneous behavior Need more bandwidth! Traffic: 100 10 50 20

  33. Private Cache Network Partitioning • Case 2: Memory clients with heterogeneous behavior – Load-balanced partitioning • Classical minimum makespan scheduling problem 𝑛 controllers, n clients, client j with traffic 𝑢 𝑘 𝑦 𝑗 , 𝑘 = � 1 if client j is mapped to controller i 0 otherwise ILP formulation: minimize t 𝑜 s.t. ∑ 𝑦 𝑗 , 𝑘 𝑢 𝑘 ≤ 𝑢 , 𝑗 = 1, … , 𝑛 𝑘=1 𝑛 s.t. ∑ 𝑦 𝑗 , 𝑘 = 1, j = 1, … , n 𝑗=1 s.t. 𝑦 𝑗 , 𝑘 ∈ 0,1 , 𝑗 = 1, … , 𝑛 , 𝑘 = 1, … , 𝑜

  34. Private Cache Network Partitioning • Case 2: Memory clients with heterogeneous behavior – Load-balanced partitioning • Classical minimum makespan scheduling problem 𝑛 controllers, n clients, client j with traffic 𝑢 𝑘 𝑦 𝑗 , 𝑘 = � 1 if client j is mapped to controller i 0 otherwise Approximation: ILP formulation: Longest processing time (LPT) minimize t algorithm 𝑜 s.t. ∑ 𝑦 𝑗 , 𝑘 𝑢 𝑘 ≤ 𝑢 , 𝑗 = 1, … , 𝑛 𝑘=1 𝑛 s.t. ∑ 𝑦 𝑗 , 𝑘 = 1, j = 1, … , n 𝑗=1 s.t. 𝑦 𝑗 , 𝑘 ∈ 0,1 , 𝑗 = 1, … , 𝑛 , 𝑘 = 1, … , 𝑜

  35. Private Cache Network Partitioning • Case 3: Fractional load-balancing

  36. Private Cache Network Partitioning • Case 3: Fractional load-balancing

  37. Private Cache Network Partitioning • Case 3: Fractional load-balancing ILP->LP minimize t 𝑜 s.t. ∑ 𝑦 𝑗 , 𝑘 𝑢 𝑘 ≤ 𝑢 𝑘=1 𝑛 s.t. ∑ 𝑦 𝑗 , 𝑘 = 1 𝑗=1 s.t. 𝟏 ≤ 𝒚 𝒋 , 𝒌 ≤ 𝟐

Recommend


More recommend