Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2014 Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jörg Henkel - 1 - Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Reconfigurable and Adaptive Systems (RAS) 5. Configuration Prefetching - 2 -
RAS Topic Overview 1. Introduction 2. Overview • Motivation and 3. Special Instructions Definition 4. Fine-Grained • Static Prefetching Reconfigurable Processors • Clock Frequency 5. Configuration Prefetching Variation 6. Coarse-Grained • Dynamic Reconfigurable Processors Prefetching 7. Adaptive • Area Models Reconfigurable Processors 8. Fault-tolerance by Reconfiguration - 3 - L. Bauer, CES, KIT, 2014 Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 5 5.1 Motivation and Definition - 4 -
Recall: Performing Run-time Reconfigurations � PRISC ◦ Reconfiguration is triggered implicitly by SI execution ◦ Reconfiguration time: 100-600 cycles ◦ Fast reconfiguration time at the cost of very limited SI complexity � XiRISC ◦ pGA-load: load a configuration into the array ◦ pGA-free: remove a configuration ◦ 16 cycles to receive a complete configuration if it is available in 2 nd level configuration cache ◦ Approx. ‘128+startup’ cycles (not explicitly stated by the authors) to receive it from external memory � 8 times slower because 2 nd level cache bus is 8 times wider (256 bit) than memory bus (32 bit) - 5 - L. Bauer, CES, KIT, 2014 Recall: Performing Run-time Reconfigurations (cont’d) � Garp ◦ gaconf reg : Load (or switch to) configuration at address given by reg ◦ gasave reg : Save all array data state to memory at address given by reg ◦ garestore reg : Restore previously saved data state from memory at address given by reg ◦ Approx. 50 μs to reconfigure 32 rows (12 bus cycles per row plus some startup time) � MOLEN: ◦ p-set address : reconfigure those parts that that seldom change ◦ c-set address : reconfigure those parts not addressed by p-set ◦ set-prefetch address : prefetches the Microcode that is responsible for a p-set or c-set operation ◦ Reconfiguration time between 2 and 12 ms - 6 - L. Bauer, CES, KIT, 2014
Recall: Performing Run-time Reconfigurations (cont’d) � Reconfiguration can last from few cycles (if available in cache) over microseconds (for limited SI complexity) to milliseconds (powerful FPGA fabrics) � If a configuration is not available when the SI shall execute then the system performance is significantly affected ◦ Either stall the execution until the reconfiguration completes ◦ Or use the core ISA to implement the SI functionality (trap handler or conditional branch) � Solution: if some region of the reconfigurable fabric is currently free (i.e. not occupied by another configuration), then configuration prefetching can be used to perform the reconfiguration before the SI is needed - 7 - L. Bauer, CES, KIT, 2014 Configuration Prefetching � Definition: Start loading the configuration data of a particular SI before that SI is actually used � Goal: Minimize performance loss due to pending reconfigurations ◦ Typically: try to finish the reconfiguration before the SI is executed ◦ Note: sometimes it is better to avoid reconfiguration for an SI and to execute it with the core ISA instead (to avoid Thrashing ) ◦ Configuration prefetching can be used to transfer confi- guration data from external memory to configuration cache (preparing a reconfiguration) or to perform the reconfiguration right ahead - 8 - L. Bauer, CES, KIT, 2014
Configuration Prefetching (cont’d) prefetch SI � Example Control- flow graph Time for ◦ Each node is a Base- reconfigu- ration Block (BB) ◦ Color indicates the execution frequency Execute ◦ Edges show the SI control flow ◦ Red edges are function calls (dashed lines) or Return from subroutine returns (solid lines) - 9 - L. Bauer, CES, KIT, 2014 Configuration Prefetching (cont’d) � Example Control- flow graph ◦ Each node is a Base- Block (BB) ◦ Color indicates the execution frequency ◦ Edges show the control flow ◦ Red edges are function calls (dashed lines) or returns (solid lines) - 10 - L. Bauer, CES, KIT, 2014
Relevant Parameters for Prefetching � Temporal distance between starting the prefetching operation and the SI execution ◦ Depends on control flow ◦ Starting too late � SI is demanded before prefetching completes ◦ Starting too early � Potential conflicts between currently demanded SIs and SIs that shall be prefetched � Probability that the SI executions are reached ◦ Depends on control flow ◦ Typically: when prefetching is started earlier, then the uncertainty whether the SI execution is eventually reached is higher � Number of SI executions ◦ Depends on control flow ◦ If the SI is executed rather seldom then it might be better to execute it using the core ISA rather then speculating a prefetch operation - 11 - L. Bauer, CES, KIT, 2014 Aborting prefetching operations � False Prefetching: Due to control-flow uncertainty, it can happen that prefetching for an SI was triggered and even before it finishes it becomes clear that the SI is not going to execute at all � ‘Still Pending’ False Prefetching: ◦ The prefetching was triggered to a queue and did not start yet ◦ Simply remove it from that queue � ‘Already Running’ False Prefetching: ◦ The prefetching operation is currently running ◦ For line-based reconfigurable fabrics (e.g. Garp or XiRisc) finish prefetching the current line and abort it afterwards (short delay) ◦ For FPGA-based reconfigurable fabrics aborting may not be possible (unless prefetching to a cache) - 12 - L. Bauer, CES, KIT, 2014
Aborting prefetching operations (cont’d) SI-centric � In node I4 and I3 Control Flow Graph all 4 SIs may be prefetched Circle: set of in- I4 structions (poten- � When the R Rectangle: tially with embed- control flow usage of SI I3 ded control flow, moves to I1 then sub routine it is clear that calls etc.) I1 SIs 3 and 4 are not demanded ◦ Their prefetching I2 may be stopped (if possible) src: [LH02] - 13 - L. Bauer, CES, KIT, 2014 Aborting prefetching operations (cont’d) ‘0’ ‘1’ - 14 - L. Bauer, CES, KIT, 2014
Aborting prefetching operations (cont’d) ‘0’ - 15 - L. Bauer, CES, KIT, 2014 Aborting prefetching operations (cont’d) ‘0’ ‘1’ - 16 - L. Bauer, CES, KIT, 2014
Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 5 5.2 Static Prefetching - 17 - Static Prefetching � Idea: At compile time analyze the control-flow graph and embed prefetching instructions into to code that statically decide which SIs shall be prefetched � Required: probability which branch will be taken ◦ From profiling; shown as edge labels (in percent) � Next step: At each node, establish a list of all reachable SIs, sorted by the probability to reach them src: [LH02] - 18 - L. Bauer, CES, KIT, 2014
���� ���� Static Prefetching (cont’d) � Probability of a node n to reach SI s : � � � � � n s , : � e � Paths from p Edges e on node to SI n s Path p � Example: 3 Paths to reach SI 3 from I10 ◦ 0.3 * 0.4 * 0.4 + ◦ 0.3 * 0.6 * 0.4 + ◦ 0.2 * 0.8 * 0.4 ◦ Probability to reach SI 3 from node I10: 18.4% src: [LH02] - 19 - L. Bauer, CES, KIT, 2014 Static Prefetching (cont’d) All SI nodes that can be reached (in decreasing probability) P1,4,3,2 � Algorithm moving backwards through the graph ◦ Initialize a queue with all ‘SI- P3,4,1,2 nodes’ (squares, 100% reachability) P4,3,2 ◦ Remove a node n from the P1,3,2 queue and update the pro- bability information of all P1,2 P4,3 its predecessors that they can also reach the SIs that can be reached from node n ◦ When all successors of a P1 P2 P3 P4 node are processed then add it to the queue ◦ Iterate until queue is empty src: [LH02] - 20 - L. Bauer, CES, KIT, 2014
Static Prefetching (cont’d) P1,4,3,2 � Depending on the capacity of the FPGA, limit the prefetches to e.g. the 2 most probable SIs (this P3,4,1,2 affects I7, I8, I9, I10) � Some prefetch instructions are P4,3,2 redundant (e.g. due to previously P1,3,2 executed prefetches, I1, I4) ◦ Prefetch at I2 may not be removed P1,2 P4,3 because the control flow may come from node I8 (no SI 2 at I8) or from I5 (SI 1 might have started first and needs to be aborted now) ◦ Prefetch at I6 may be removed P1 P2 P3 P4 even though I9 prefetches P3 first; here, it is not beneficial to abort P3 to start P4 from scratch src: [LH02] - 21 - L. Bauer, CES, KIT, 2014 Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 5 5.3 Clock Frequency Variation - 22 -
Recommend
More recommend