cosmos coordination of high level synthesis and memory
play

COSMOS: Coordination of High-Level Synthesis and Memory Optimization - PowerPoint PPT Presentation

ACM/IEEE CODES+ISSS 2017, Seoul, South Korea COSMOS: Coordination of High-Level Synthesis and Memory Optimization for Hardware Accelerators Luca Piccolboni, Paolo Mantovani, Giuseppe Di Guglielmo, Luca Carloni Columbia University, New York,


  1. ACM/IEEE CODES+ISSS 2017, Seoul, South Korea COSMOS: Coordination of High-Level Synthesis and Memory Optimization for Hardware Accelerators Luca Piccolboni, Paolo Mantovani, Giuseppe Di Guglielmo, Luca Carloni Columbia University, New York, USA

  2. Hardware Accelerators Motivations • Hardware accelerators are devices designed and optimized to realize very specific functionalities General-Purpose DianNao Generality Processor Cores Hardware Accelerators Efficiency [T. Chen et al., ASPLOS’14] ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 2 / 16

  3. Hardware Accelerators Architecture Component Interface Accelerator Component Logic Component #1 On-chip Interconnect Loop #1 … Loop #N Component #2 Component Datapath … bank bank bank bank Component #K bank bank bank bank Private Local Memory (PLM) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 3 / 16

  4. Hardware Accelerators High-Level Synthesis (HLS) Component Interface SystemC Specification Component Logic High-Level Synthesis Loop #1 … Loop #N knob Component Datapath conf. #1 Cost (Area) bank bank bank bank knob conf. #2 bank bank bank bank RTL Private Local Memory (PLM) Performance (Latency) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 4 / 16

  5. Hardware Accelerators High-Level Synthesis (HLS) Component Interface SystemC Specification Component Logic High-Level Synthesis Loop #1 … Loop #N Pareto-Optimal Component Datapath Implementations Cost (Area) bank bank bank bank bank bank bank bank RTL Private Local Memory (PLM) Performance (Latency) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 4 / 16

  6. Hardware Accelerators High-Level Synthesis (HLS) Component Interface SystemC Specification Component Logic High-Level Synthesis Loop #1 … Loop #N Pareto Dominated Component Datapath Cost (Area) bank bank bank bank bank bank bank bank RTL Private Local Memory (PLM) Performance (Latency) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 4 / 16

  7. Hardware Accelerators High-Level Synthesis (HLS) 1. Loop unrolling Which knobs can be used to obtain several for (k = 0; k < N; ++k) RTL implementations? a[k] = b[k] + c[k]; b[k] c[k] Cost (Area) a[k] RTL Performance (Latency) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 4 / 16

  8. Hardware Accelerators High-Level Synthesis (HLS) 1. Loop unrolling Which knobs can be used to obtain several for (k = 0; k < N; k += 2) RTL implementations? a[k+0] = b[k+0] + c[k+0]; a[k+1] = b[k+1] + c[k+1]; apply b[k+0] c[k+0] b[k+1] c[k+1] Cost (Area) unrolling a[k+0] a[k+1] RTL Performance (Latency) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 4 / 16

  9. Hardware Accelerators High-Level Synthesis (HLS) 2. Memory Ports Which knobs can be used to obtain several RTL implementations? port 1 port 2 Cost (Area) bank bank bank bank RTL Private Local Memory (PLM) Performance (Latency) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 4 / 16

  10. Hardware Accelerators High-Level Synthesis (HLS) 2. Memory Ports Which knobs can be used to obtain several RTL implementations? port 1 port 2 port 3 port 4 increase number of ports Cost (Area) bank bank bank bank bank bank bank bank RTL Private Local Memory (PLM) Performance (Latency) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 4 / 16

  11. Motivational Examples • Performing an accurate and exhaustive design-space exploration for a hardware accelerator is complex: ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

  12. Motivational Examples • Performing an accurate and exhaustive design-space exploration for a hardware accelerator is complex: 1. HLS tools do not always support the generation (and optimization) of the private local memories ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

  13. Motivational Examples Need of multi-port memories using standard memories 3.0 1 port 2 ports 4 ports 8 ports 2.5 Area (mm 2 ) 2.0 1.5 latency span: 1.4× 1.0 area span: 1.2× Gradient 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Effective Latency (ms) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

  14. Motivational Examples Need of multi-port memories using multi-port memories 3.0 1 port 2 ports 4 ports 8 ports 2.5 latency span: 7.9× Area (mm 2 ) 2.0 area span: 3.7× 1.5 1.0 Gradient 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Effective Latency (ms) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

  15. Motivational Examples • Performing an accurate and exhaustive design-space exploration for a hardware accelerator is complex: 1. HLS tools do not always support the generation (and optimization) of the private local memories 2. The algorithms adopted by HLS tools are based on heuristics that make it hard to set the knobs ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

  16. Motivational Examples Unpredictability of HLS tools 3.0 1 port 2 ports 4 ports 8 ports 1.20 14 2.5 14u 1.16 # unrolls 10 9 10u 8 1.12 9u Area (mm 2 ) 8u 2.0 7u 7 1.08 6u 6 5u 4 3 4u 5 2 3u 1.04 2u 1.5 1.00 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 1.0 Gradient Gradient 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Effective Latency (ms) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

  17. Motivational Examples Unpredictability of HLS tools 3.0 1 port 2 ports 4 ports 8 ports 1.20 14 2.5 14u 1.16 # unrolls 10 9 10u 8 1.12 9u Area (mm 2 ) 8u 2.0 7u 7 1.08 6u 6 5u 4 3 4u 5 2 3u 1.04 2u 1.5 1.00 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 1.0 Gradient 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Effective Latency (ms) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

  18. Motivational Examples Unpredictability of HLS tools 3.0 1 port 2 ports 4 ports 8 ports 1.20 14 2.5 14u 1.16 # unrolls 10 9 10u 8 1.12 9u Area (mm 2 ) 8u 2.0 7u 7 1.08 6u 6 5u 4 3 4u 5 2 3u 1.04 2u 1.5 1.00 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 1.0 Gradient 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Effective Latency (ms) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

  19. Motivational Examples • Performing an accurate and exhaustive design-space exploration for a hardware accelerator is complex: 1. HLS tools do not always support the generation (and optimization) of the private local memories 2. The algorithms adopted by HLS tools are based on heuristics that make it hard to set the knobs 3. HLS tools do not handle the simultaneous optimization of multiple components ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

  20. Motivational Examples Need of compositionality 0.63 1.20 Area (mm 2 ) Area (mm 2 ) Grayscale Gradient 1.16 0.62 1.12 0.61 1.08 0.60 1.04 0.59 1.00 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 Effective Latency (ms) Effective Latency (ms) 1.80 Composition 1.76 Area (mm 2 ) 1.72 1.68 Pareto 1.64 Dominated 1.60 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 Effective Throughput (1/ms) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 5 / 16

  21. Contributions • We propose COSMOS, an automatic methodology for the design-space exploration of complex accelerators 1. COSMOS is able to efficiently coordinate high- level synthesis and memory generator tools ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 6 / 16

  22. Contributions • We propose COSMOS, an automatic methodology for the design-space exploration of complex accelerators 1. COSMOS is able to efficiently coordinate high- level synthesis and memory generator tools 2. COSMOS leverages a scalable compositional design-space exploration methodology ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 6 / 16

  23. Contributions • We propose COSMOS, an automatic methodology for the design-space exploration of complex accelerators Step 1: Component Characterization § SystemC Specification region 2 area Accelerator region 1 #K Component #1 latency … region 2 Step 1 area Component #K region 1 #1 latency ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 6 / 16

  24. Contributions • We propose COSMOS, an automatic methodology for the design-space exploration of complex accelerators Step 2: Design-Space Exploration § region 2 Design Space of area the Accelerator region 1 #K latency area region 2 Step 2 area region 1 #1 throughput latency ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 6 / 16

  25. Component Characterization • Goal: for each component of the accelerator identify the regions with the Pareto-optimal implementations 1.00 1 port 0.95 0.90 Area (mm 2 ) region 1 0.85 4 ports 0.80 2 ports region 2 0.75 0.70 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Effective Latency (ms) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 7 / 16

  26. Component Characterization • Goal: for each component of the accelerator identify the regions with the Pareto-optimal implementations upper-left point 1.00 1 port 0.95 lower-right point 0.90 Area (mm 2 ) region 1 0.85 4 ports 0.80 2 ports region 2 0.75 0.70 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Effective Latency (ms) ACM/IEEE CODES + ISSS 2017, Seoul, South Korea 7 / 16

Recommend


More recommend