�÷îóùûþ ÿþýüûúùø÷öõôóþòñðïîþíï Temporal Partioning Temporal Partioning with Partial Mikael Olausson Reconfiguration Embedded Reconfigurable Computer Engineering Architectures Department of Electrical Conclusions Engineering Linköping University 26/10/2001 Reconfigurable Systems 1 26/10/2001 Reconfigurable Systems 2 �þí�üöõóò�õöîùîùüûùûø �óîþöûõîù�þï M. Kaul, R. Vemuri, ”Temporal Many different implementations Partitioning Combined with Design ÿ Area Space Exploration for Latency ÿ Latency Minimization of Run-Time Reconfigured Intergrate partitioning with synthesis Designs”, Proc. DATE 1999 Iterative process Temporal Configuration ÿ Lowest latency that meets area Application partitioning 26/10/2001 Reconfigurable Systems 3 26/10/2001 Reconfigurable Systems 4 �õöîùîùüûùûøòóþ�þóï �þïùøûò�üùûîï Behavior level Different implementations of the same task Register Transfer Level ÿ Time-Area tradeoff Gate level ÿ Serial vs Parallel Too many design points? ÿ Candidate design points 26/10/2001 Reconfigurable Systems 5 26/10/2001 Reconfigurable Systems 6 1
�õûðòüöò�þ�ò �õûðòüöò�þ�ò �õöîùîùüûùûøï �õöîùîùüûùûøïòýüûî�� Spatial Partitioning Temporal Partitioning ÿ Increase partitioning ÿ Increase partitions ÿ Consumes more area ÿ Increase the available area ÿ Parallel processing => Less latency Will Latency decrease? ÿ Heavily dependent on the reconfiguration times 26/10/2001 Reconfigurable Systems 7 26/10/2001 Reconfigurable Systems 8 �õöîùîùüûùûø � û�÷îòîüò�óøüöùî�í 1. Map tasks to partitions Behavior specification(Task graph) ÿ Tasks 2. Map each partition to several design points ÿ Communication between Target Architecture 3. Explore multiple implementations of the design point ÿ Area ÿ Memory size ÿ Configuration times 26/10/2001 Reconfigurable Systems 9 26/10/2001 Reconfigurable Systems 10 ��þò�óøüöùî�í �óøüöùî�íòñîþ�ï Find the constraints 1. Find one solution for the constraints ÿ Minimum number of partitions, lower bound 2. Tighten the latency constraints N l min 3. Increase the partition size and start ÿ Minimum number of partitions, upper over bound N u min ÿ Worst case latency D max ÿ Best case latency D min 26/10/2001 Reconfigurable Systems 11 26/10/2001 Reconfigurable Systems 12 2
ñþõöý�òóùíùîï ��õí�óþï Bounds Result N I D max D min Da T 4x4 DCT New D max =(D max +D min )/2 9 1 25.710 1.065 9.650 37.40 with low 2 7.226 1.065 7.060 77.32 Stop when D max -D min < δ or when no 3 4.145 1.065 Inf 300 reconf. 4 5.685 4.145 Inf 300 new solutions are found time limit Time 10ns 5 6.455 5.685 Inf 300 Start and stop parameters for the 6 6.840 6.455 Inf 300 10 1 7.060 1.095 6.500 278.8 partitioning search α and γ 2 4.077 1.095 Inf 300 3 5.568 4.077 Inf 300 When reconfiguration time is large, set 4 6.314 5.568 Inf 300 α=γ= 0 5 6.407 6.314 Inf 300 11 1 6.500 1.125 Inf 300 12 1 6.500 1.155 Inf 300 26/10/2001 Reconfigurable Systems 13 26/10/2001 Reconfigurable Systems 14 �õöîùõóò ��õí�óþï ÿþýüûúùø÷öõîùüû 4x4 DCT with high reconf. Time 30ms S. Ganesan, R. Vemuri, ”An Integrated Temporal Partitioning and Partial Bounds Result N I D max D min Da T Reconfiguration Technique for Design 8 1 25.440 795 Inf 300 Latency Improvement”, Proc. DATE 9 1 25.440 795 9.630 77.60 2 6.956 795 Inf 300 2000. 3 9.226 6.956 9.100 78.95 One part executing, one part 4 8.111 6.956 8.100 185.73 5 7.533 6.956 7.380 281.93 reconfiguring 6 7.244 6.956 Inf 300 26/10/2001 Reconfigurable Systems 15 26/10/2001 Reconfigurable Systems 16 �õöî� òÿþýüûòýüûî��ò �õïùýò�üûýþ�î For maximum overlap: TP1 TP1 ÿ Exe(Tp i ) comparable to Rec(Tp i+1 ) ÿ Or Exe(Tp i ) >= Rec(Tp i+1 ) TP1 TP2 TP2 TP2 TP3 TP3 TP3 26/10/2001 Reconfigurable Systems 17 26/10/2001 Reconfigurable Systems 18 3
ñðûî�þïùï � û�÷îòï�þýùúùýõîùüû Input behaviour specification in C or Input Set VHDL BLK_1 Local Set Generate a Control Data Flow Graph(CDFG) BLK_2 Function Graph Partioner + area estimator => BLK_3 BLK_4 Temporal segments Output Set High-level synthesis => RTL Behaviour Block Intermediate Format (BBIF) 26/10/2001 Reconfigurable Systems 19 26/10/2001 Reconfigurable Systems 20 �õöøþîò�öý�ùîþýî÷öþ �üü�ò�õöîùîùüûùûø Xilinx 6200 FPGA Entire loop in one partition Why? ÿ Easy partioning Host-side CTRL ÿ Execution time maximum overlapped If the loop don’t fit? RC1 RC2 ÿ Report a failure ÿ Use the whole device(the adopted one) Switch between execution and reconfiguration 26/10/2001 Reconfigurable Systems 21 26/10/2001 Reconfigurable Systems 22 �üû�ùîùüûõóò þ�þý÷îùüû �óüý�ò�öüýþïïùûø We have to wait for the Gives high execution times outcome of the conditional One configuration for many inputs executing ÿ Filters Conditional in one, branches ÿ FFT in the other Works only with no dependencies If this fails between inputs ÿ Host polling 26/10/2001 Reconfigurable Systems 23 26/10/2001 Reconfigurable Systems 24 4
ÿþï÷óî � ÷þïîùüûï ? What is required for good performance? ? Would more partitions be better? Rec. Rec. Rec. Exe Exe Exe Speed Speed Speed Rec. Exe Speed #Inp #Inp #Inp Through- Through- Through- #Inp Through- Design Design Design Method Method Method #TP #TP #TP Time Time Time time time time % rec. % rec. % rec. up vs up vs up vs ? Can parallel processing increase the Design Method #TP Time time % rec. up vs blocks blocks blocks put (ms) put (ms) put (ms) blocks put (ms) (us) (us) (us) (us) (us) (us) full full full (us) (us) full performance? PR PR PR 1 3 3 165 1 1 54.03 165 86 154.7 0.9 385 0.21 0.550 0.087 98.9 25.7 30 1d PR 48 180 51.47 1991 2.04 2.52 TLC DCT SEG 1.52x 2.0x 1x 1.8x FFT FR FR FR 2 2 1 165 1 1 154.8 610 86 385 0.9 103.1 0.995 0.26 0.087 61.2 98.9 59.5 FR 22 180 1995.7 2088 4.08 51.2 26/10/2001 Reconfigurable Systems 25 26/10/2001 Reconfigurable Systems 26 ÿþýüûúùø÷öõôóþò �õöøþîò�öý�ùîþýî÷öþ �öý�ùîþýî÷öþï Yanbing Li, et al., ”Hardware-Software Target Architecture Codesign of Embedded Reconfigurable Architectures”, Proc. DAC 2000. Speed up execution with FPGA CPU Mem FPGA 26/10/2001 Reconfigurable Systems 27 26/10/2001 Reconfigurable Systems 28 � ùíôóþ � ùíôóþòýüûî�� HW/SW partitioner Search for candidate loops for implementation i HW From sytem-level described in C 1 SW loop vs. 1 or more HW loops Loop and Basic block level Search for Instruction Level Paralellism Two dimensional partioning ÿ Spatial ÿ Temporal 26/10/2001 Reconfigurable Systems 29 26/10/2001 Reconfigurable Systems 30 5
� þðòùïï÷þïòôðò �õöîùîùüûùûø ÿþîöøþîõôóþòñðîþí� Dynamic reconfiguration costs Yes!! Compiler optimations(SW) Platform described in ADL by ÿ Type of processor HW design space ÿ Characteristics of the FPGA Profiling information for HW/SW ÿ Memory tradeoffs 26/10/2001 Reconfigurable Systems 31 26/10/2001 Reconfigurable Systems 32 � �ðòóüü�ï� �öþ�öüýþïïùûø Significant portion of the execution time Profile target architecture Compact implementation of loops Extract loops Synthesize HW of loops Multiple HW structures ÿ Loop unrolling ÿ Procedure inlining ÿ Branch trimming 26/10/2001 Reconfigurable Systems 33 26/10/2001 Reconfigurable Systems 34 � óüôõóò�üïîò�÷ûýîùüû �óøüöùî�íò�óü� Maximize overall performance Loop Entry Profiling(LED) What to include? Interesting Loop Detection(ILP) ÿ SW execution times Intra Loop Selection ÿ HW execution times Inter Loop Selection ÿ Entry times for HW implementations ÿ Exit times for HW implementations ÿ Configuration times 26/10/2001 Reconfigurable Systems 35 26/10/2001 Reconfigurable Systems 36 6
Recommend
More recommend