The programmer's view The programmer's view of a dynamically reconfigurable of a dynamically reconfigurable architecture architecture Luciano Lavagno Lavagno Luciano Politecnico di di Torino Torino Politecnico lavagno@polito.it lavagno@polito.it Joint work with: Joint work with: Fabio Campi Campi, Roberto , Roberto Guerrieri Guerrieri, Andrea Lodi, Claudio , Andrea Lodi, Claudio Mucci Mucci, Mario , Mario Toma Toma Fabio Universita’ ’ di di Bologna Bologna Universita Francesco Gregoretti Gregoretti, Alberto La Rosa, , Alberto La Rosa, Mihai Mihai Lazarescu Lazarescu, Claudio , Claudio Passerone Passerone Francesco Politecnico di di Torino Torino Politecnico
Outline Outline • Motivations Motivations • • The Target Reconfigurable Processor ( The Target Reconfigurable Processor (XiRisc XiRisc) ) • • Design Space Exploration Design Space Exploration • – Design flow Design flow – – Optimizations and limitations – Optimizations and limitations • Turbo Turbo- -decoder example decoder example • – Memory optimizations Memory optimizations – – Dynamic instructions selection Dynamic instructions selection – – Mapping Mapping – – Experimental results Experimental results – • Conclusions Conclusions •
Motivations Motivations The reconfiguration landscape The reconfiguration landscape GOPS GOPS 1.E+05 Reset Reset FPGAs FPGAs Embedded Embedded 1.E+04 MAC MAC 1.E+03 Reconfiguration frequency Reconfiguration frequency ASIC 1.E+02 altera Processors with Processors with 1.E+01 Context xilinx Context dynamically dynamically 1.E+00 Xilinx Virtex Virtex Xilinx DSP reconfigurable HW reconfigurable HW Loop buffers Loop buffers MIPS 1.E-01 intel 1.E-02 trend MPU 1.E-03 Sub- -word ops. word ops. Processors Sub Processors 1.E-04 Clock Clock 1.E-05 Jan-90 Jan-95 Jan-00 Jan-05 Jan-10 Fine Coarse Source: Philips Fine Coarse Source: Philips Reconfiguration granularity Reconfiguration granularity
Past work Past work • Reconfigurable array as co Reconfigurable array as co- -processor: processor: • – GARP (Callahan), Nimble compiler (Li) GARP (Callahan), Nimble compiler (Li) – • Reconfigurable array as functional unit: Reconfigurable array as functional unit: • – Prisc – Prisc ( (Razdan Razdan), ), Chimaera Chimaera (Hauck), Concise ( (Hauck), Concise (Kastrup Kastrup) ) • Key issues: Key issues: • – path to memory and I/O limitations (co path to memory and I/O limitations (co- -processor better) processor better) – – ease of integration into ISA and compiler (FU better) ease of integration into ISA and compiler (FU better) – – row row- -based architecture for good arithmetic op mapping based architecture for good arithmetic op mapping – – efficient HW synthesis onto non efficient HW synthesis onto non- -standard architecture standard architecture –
The XiRisc XiRisc Architecture Architecture The • 2 • 2- -Channel VLIW Channel VLIW Elaboration Elaboration • Shared DSP Shared DSP- -like like • function units function units • Embedded pGA Embedded pGA • device device
Dynamic Instruction Set Extension Dynamic Instruction Set Extension
Dynamic Instruction Set Extension Dynamic Instruction Set Extension Register File Register File ….. pgaload ….. ….. Configuration Configuration ….. Memory Memory pgaop $3,$4,$5 …... …... Add $8, $3
Computing on the PiCoGA the PiCoGA Computing on Data Flow Graph Data in Pga_op1 PiCoGA Control PiCoGA Mapping Control Unit Unit Pga_op2 Mapping Data out
Multi- -context context Array Array Multi PiCoGA PiCoGA Configuration Cache Cache Configuration Func. 1 . 1 Func Func. 2 . 2 Func Func. 3 . 3 Func Func. 4 . 4 Func Func. n Func . n • Four configuration planes are available Four configuration planes are available • • Plane switching takes one clock cycle Plane switching takes one clock cycle • • While one plane is loading, others can work undisturbed While one plane is loading, others can work undisturbed •
Design Space Exploration Design Space Exploration • Software developer’s perspective: Software developer’s perspective: • – Wants only the speed Wants only the speed- -up (cc up (cc - -OR foo.c) OR foo.c) – – Does not want to see the architecture – Does not want to see the architecture • Reconfigurable processor compilers enable the transparent use Reconfigurable processor compilers enable the transparent use • of the reconfigurable instruction set via: of the reconfigurable instruction set via: – Pseudo – Pseudo- -function calls (“ function calls (“intrinsics intrinsics”) ”) – Language extensions ( Language extensions (pragmas pragmas) ) – • Design flow: Design flow: • – Identify compute intensive kernels Identify compute intensive kernels – – Group instructions into sets of user Group instructions into sets of user- -defined defined pGA pGA instructions instructions – – Use cost figures to compare costs and performance of different Use cost figures to compare costs and performance of different – HW/SW partitions HW/SW partitions – Refine cost figures by manual or automatic synthesis Refine cost figures by manual or automatic synthesis –
XiRisc Design Flow Design Flow XiRisc Front-end C source Design Space Exploration Design Space Exploration pGA insn. identification Simulation Compiler Scheduler HIR LIR Profiling Griffy-C Assembler bitstream obj Backend Backend
Manual pGAop pGAop identification: example identification: example Manual int i; int bar ( int a, int b) { int bar ( int a, int b) { int c; int c; # pragma pgaop sa 0x12 5 1 2 c a b # if defined (PGA) c = (a << 2) + b; asm ("pga5 0x12,%0,%1,%2":"=r"(c):"r"(a),"r"(b)); # pragma end # else return c + a; asm ("topga %1, %2, $0"::"r"(a),"r"(b)); } asm ("jal _sa"); asm ("fmpga %0, $0, $0": "=r"(c): ); main() { # endif i = bar(2,3); return c + a; return ; } } ... # if ! defined (PGA) void _sa () { int c,a,b; asm("move %0,$2;move %1,$3": "=r"(a),"=r"(b):"r"(c): "$2","$3","$4"); c = (a << 2) + b; /* delay by 5 cycles */ asm("move $2,%0; li $4,5": : "r"(c) : "$2","$3","$4" ); } # endif
Back- -end end Back High-Level C Compiler •DFG-based description •Single Assignment •Manual Dismantling Mapping Place & Route Griffy Configuration Compiler Bits Emulation Function with Griffy-C Latency and Issue Delay
Design Space Exploration Design Space Exploration Optimizations for the Reconfigurable Array Optimizations for the Reconfigurable Array Increase Performance Increase Performance 40 40 Increase concurrency Increase concurrency Instruction memory Instruction memory 35 35 Data Memory Data Memory Minimize memory accesses Minimize memory accesses 30 30 Bus architecture Bus architecture Customize data- -width width Customize data Register File Register File 25 25 Optimize data structures Optimize data structures Alu Alu 20 20 Shifter Shifter 15 15 Multiplier Multiplier Exception handling Exception handling Reduce Energy Reduce Energy 10 10 Instruction decode Instruction decode 5 5 Reduce instruction fetches Reduce instruction fetches Pipeline control Pipeline control 0 0 Reduce data fetches Reduce data fetches Contributions to Power Consumption Contributions to Power Consumption
Design Space Exploration Design Space Exploration Optimizations for the Reconfigurable Array Optimizations for the Reconfigurable Array • Exploit concurrency Exploit concurrency • Increase Performance Increase Performance – within the reconfigurable array within the reconfigurable array – Increase concurrency Increase concurrency • horizontally: operate on multiple data horizontally: operate on multiple data • • vertically: pipelined implementation vertically: pipelined implementation Minimize memory accesses • Minimize memory accesses – with respect to the standard data with respect to the standard data- -path path – Customize data- -width width Customize data • Optimize data memory Optimize data memory • Optimize data structures Optimize data structures – internal storage reduces register spills internal storage reduces register spills – – reordering and shifting are free reordering and shifting are free – Reduce Energy Reduce Energy – pack data into a single word (SIMD pack data into a single word (SIMD – operation) operation) Reduce instruction fetches Reduce instruction fetches • Optimize instruction memory Optimize instruction memory • Reduce data fetches Reduce data fetches – reduced instruction fetches reduced instruction fetches –
Recommend
More recommend