Institut für Technische Informatik Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2012 Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 4. Fine-Grained Reconfigurable Processors Lars Bauer, Jörg Henkel - 1 - - 2 - RAS Topic Overview Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 1. Introduction • PRISM 2. Overview • PRISM-II 4 4.1 PRISM: Processor • Garp 3. Special Instructions Reconfiguration through • MOLEN 4. Fine-Grained Reconfigurable Processors • PRISC Instruction Set • OneChip 5. Configuration Prefetching Metamorphosis • OneChip98 6. Coarse-Grained • XiRISC Reconfigurable Processors • XiSystem 7. Adaptive Reconfigurable Processors • New FPGA Architectures 8. Fault-tolerance by Reconfiguration - 3 - - 4 - L. Bauer, CES, KIT, 2012
PRISM Tool Chain PRISM Overview � PRISM-I system: external Observation: an adaptive micro-architecture cannot be designed by the � high-level programmer (limited expertise) stand-alone processing unit Solution: High Level Language compiler, so-called configuration compiler � ◦ Two boards that are inter- “The configuration compiler […] is a special compiler that accepts a high- � connected by a 16-bit bus level language program as input, and produces both a hardware image ◦ Processor board: Motorola and a software image” [WAL+93] 68010 processor running at Identifying Hot spots (with manual interaction) ◦ 10 MHz HW/SW partitioning ◦ ◦ Accelerator board: four ◦ Generating SIs Xilinx 3090 FPGAs � Hardly run-time reconfi- gurable, i.e. it takes (little less than) one second to reconfigure the FPGAs src: [WAL + 93] - 5 - - 5 - src: [WAL + 93] - 6 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012 PRISM Limitations PRISM Limitations (cont’d) � Hardware Limitations: � Tool Chain Limitations: ◦ PRISM-I is the first implementation of the PRISM ◦ State and global variables are not supported concept, i.e. it is a proof-of-concept ◦ At most 32-bit input bits and 32-bit output bits ◦ Slow reconfiguration speed (a little less than one respectively (may be distributed among multiple second) under software control variables) ◦ FPGA provide only a low overall speed and capacity ◦ No support for variable loop counts (i.e. not supporting “for (i=0 to n )”, where n is variable) ◦ Slow communication: between 45 and 75 clock cycles (at 10 MHz) to move operands to an SI and to ◦ Only single-cycle SI implementations collect the results ◦ Limited support for C data types (e.g. no ‘float’) and C constructs (e.g. no ‘do-while’ or ‘switch-case’) - 7 - - 8 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012
4.2 PRISM-II PRISM-II Tool Chain � Improved System: PRISM-II � The parsing and � Supports larger parts of the C language specification optimization stage � Supports synthesis builds on top of GCC of sequential ◦ GCC used a variation logic for execu- of a register transfer tion of loops language at that time with variable � The synthesis is done loop counts (i.e. unknown using ‘VHDL Designer’ at compile or ‘X-BLOX’ time) src: [WAL + 93] src: [WAL + 93] - 9 - - 10 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012 PRISM-II Architecture PRISM-II Architecture (cont’d) � 3 Xilinx 4010 FPGAs � AMD Am29050 at 33 ◦ An SI may use all 3 FPGAs MHz, 28 MIPS � By utilizing data buffers, the � Coprocessor-like FPGAs can work together or reconfigurable fabric perform individual tasks � 64-bit bus � Global bus provides control signals to be shared between ◦ Using the Address Bus FPGAs and the Data Bus at ◦ used for providing global clocks the same time ◦ or transferring state information ◦ Only 32-bit results are between the FPGAs allowed � Reported Speedup: � Tighter coupling ◦ 86x for simple bit reversal ◦ Only 30 ns data ◦ 10x for computing a Hamming code movement cost src: [WAL + 93] src: [WAL + 93] - 11 - - 12 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012
PRISM Summary Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel � Very early approach (1993) for a loosely coupled reconfigurable component � PRISM-I: external Processing unit 4.3 Garp 4 � PRISM-II: external Coprocessor (to some degree) � Very slow coupling � Very slow reconfiguration time (range of seconds, not milliseconds) � Relies on very old FPGAs (from today's perspective) ◦ Multiple FPGAs are combined to obtain a reasonable amount reconfigurable fabric - 13 - - 14 - L. Bauer, CES, KIT, 2012 Garp Overview Garp Reconfigurable Fabric Research effort on overcoming the limitations of reconfigurable HW � � Reconf. fabric is a ◦ Reconfiguration overhead 2D-mesh com- ◦ Memory access from reconfigurable hardware posed of entities ◦ Binary compatibility of executables across version of reconfigurable hardware called blocks Core processor and reconf. fabric on same die � ◦ Number of columns ◦ Core processor: a single-issue MIPS-II is fixed to 24 (1 Reconfigurable Fabric as Coprocessor, but needs ◦ control and 23 some modifications in the core processor logic blocks) ◦ However, no actual chip produced ◦ Some special Core processor and reconf. fabric purposes blocks � share the same memory hierarchy ◦ Number of rows is implementation SW controlled run-time reconfiguration � specific and can Reconfigurable fabric run asynchronous grow in an upward- � compatible fashion to the core processor (expected to be at ◦ Reconfigurable fabric: estimated 133 MHz least 32) src: [HW97] src: [HW97] - 15 - - 16 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012
Garp Reconfigurable Fabric Garp Reconfigurable Fabric (cont’d) (cont’d) � Partially reconfiguring the reconfigurable fabric is supported ◦ Basic reconfigurable unit is a row of 24 blocks, a so-called reconfigurable ALU ◦ SI size is defined by #rows ( � 1D structure) � A row is exclusively used by at most one SI, i.e. it is not allowed that some logic blocks in a row are used src: [HW97] for Si i and some others in the same row are used for SI j � Memory accesses can be initiated by the reconfigurable ◦ Fabric is blocked during reconfiguration fabric, but only through the central 16 columns ◦ Supports run-time relocation (a hardware translates � Extra blocks for overflow checking, rounding, control from logical to physical row number) functions, wider data sizes etc. - 17 - - 18 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012 Reconfigurable Blocks (cont’d) Reconfigurable Blocks � Each logic block can be configured to perform � Each logic block takes ◦ an arbitrary 4-input bitwise logical function, as many as four 2-bit ◦ a variable shift of up to 15 bits, inputs and produces ◦ a 4-way select (multiplexer) function, or up to two 2-bit ◦ a 3-input add/subtract/comparison function � Garp made a first step to integrate specialized outputs hardware blocks into a partially reconfigurable � Routing architecture: processor (not only LUTs) ◦ 2-bit buses in horizon- ◦ Multi-bit adders, shifters etc. are designed with ‘more tal and vertical columns hardware’ than typically FPGAs at that time ◦ global & semi-global � Each logic block includes four bits of data state lines (i.e. registers), totaling to 92 bits per row src: [HW97] - 19 - - 20 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012
Data Access Reconfigurable Routing � The routing architecture includes 2 bit � Data input/output horizontal and vertical lines of different ◦ Up to 128 bits per cycle length, segmented in a non-uniform way to/from any 4 rows in the fabric ◦ Short horizontal segments spanning 11 blocks are tailored to multi-bit shifts across a row ◦ Up to 64 bits per cycle from the MIPS core ◦ Note: the figures show the routing for one row/column of logic blocks, respectively register file to any 2 rows ◦ Up to 32 bits per cycle from any row back to the MIPS core register file � Dedicated Queues ◦ Allowing read ahead and write behind src: [CHW00] src: [HW97] - 21 - - 22 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012 Reconfiguration Management Reconfiguration Management (cont’d) � Reconfiguration � For fast reconfiguration, the reconfigurable fabric ◦ A block requires 64 configuration bits features a transparent distributed configuration cache ◦ Configuring 32 rows: 8 [Bytes/block] x 24 [blocks/row] x 32 [rows] = 6144 Bytes ◦ Holds the equivalent of 128 total rows of configurations ◦ Assuming 128-bit memory access, 384 sequential accesses are required ◦ Distributed as 4 cached configuration rows for each ◦ Approx. 50 micro seconds (depending on the bus) physical row � To accelerate context switching, the Garp array does not contain ◦ Stores the least recently used configurations large amount of embedded memory (if an SI needs some data ◦ Content can be pre-fetched twice, it typically has to load it twice) � Reconfiguration time from external memory is 12 � Supports virtual memory, supervisor mode, and protected execution of multiple processes external bus cycles per row plus some startup time � Reported speedup (for hand-coded functions) compared to a 4- � Reconfiguration time from the integrated cache is 4 way superscalar UltraSparc 170: cycles (independent of the number of rows) ◦ 43x for an image median filter ◦ 18.7x for DES encryption - 23 - - 24 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012
Recommend
More recommend