optimised synthesis of optimised synthesis of
play

Optimised Synthesis of Optimised Synthesis of Asynchronous Elastic - PowerPoint PPT Presentation

Optimised Synthesis of Optimised Synthesis of Asynchronous Elastic Dataflows by Leveraging Clocked EDA Mahdi Jelodari Mamaghani, Jim Garside, Will Toms, & Doug Edwards & Doug Edwards Verona, Italy 29 th August 2014 Motivation:


  1. Optimised Synthesis of Optimised Synthesis of Asynchronous Elastic Dataflows by Leveraging Clocked EDA Mahdi Jelodari Mamaghani, Jim Garside, Will Toms, & Doug Edwards & Doug Edwards Verona, Italy 29 th August 2014

  2. Motivation: Automatic GALSification of Control-driven Systems � Partitioning a control-driven system at behavioural level is complicated � Detecting signal correspondence between data and control path is error prone � Presence of global control impedes the detection process [1,2] � Presence of global control impedes the detection process [1,2] � Balsa, Petrify, VeriSyn & AVS : Popular Control-driven synthesis tools GCD a data-dependent loop example: [GCD Data Path] [GCD Control Path] The Advanced Processor Technologies Research Group

  3. Motivation: Automatic GALSification of Data-driven Systems Concurrent dataflow Specification of a System Fine-grained Dataflow Synthesis Automatically Partitioning the System into multiple clocked islands The Advanced Processor Technologies Research Group 3

  4. Teak: Asynchronous Dataflow Backend for Balsa Language Released in 2010 as a Dataflow Syntax-directed Synthesis backend for Balsa language [ACSD’09]. Some of the properties of Teak Dataflow Networks (TDNs): Some of the properties of Teak Dataflow Networks (TDNs): � Communication : � Point-to-point communication between computation blocks. � Slack elastic channels are capable of storing ‘any number’ of tokens. � � Computational : Computational : � Macro-module style with separate Go and Done activation signals. These modules are chained in sequence or parallel according to the source level directives. � Dataflow which realises data-dependent computation. The Advanced Processor Technologies Research Group 4

  5. Teak: Behavioural Synthesis Flow Syntax Directed The Advanced Processor Technologies Research Group 5

  6. Teak Model of Computation: Macro-modules Single-Input, Single-Output Macro-modules Connected by buffers Control and Data move along through Macro-modules Teak uses three hierarchical primitives to form a dataflow Network � Sequential � SteerMerge � Iterative The Advanced Processor Technologies Research Group 6

  7. Protocol: Conventional Synchronous vs. Elastic Timing alignment In by inserting Conventional buffers in post- buffers in post- Sync. Systems Sync. Systems synthesis stage Latency = 0 [Synchronous] In System tolerates Elastic Systems variations in Latency can vary latencies through handshaking [Asynchronous] [Asynchronous] A common timing In Sync. Elastic discipline is Systems Latency introduced to the is discretised by handshake system clock [Synchronous Elastic] 7 The Advanced Processor Technologies Research Group

  8. A Common Timing Discipline for Asynchronous Dataflow Networks of Teak Synchronous Elastic * protocol is incorporated in Teak flow as a common timing discipline: � Deterministic behaviour (bounded delays) � Simplified deadlock issue in the network � Smaller circuit area (~4 times) � Still Preserves slack elasticity (any storage on links) � Improved power utility (clock gating + simple handshake) SDF Kahn Process CSP Networks Networks [non-Deterministic] *Synchronous Elastic Flow (SELF) [3] [Deterministic] The Advanced Processor Technologies Research Group 8

  9. Correctness in Asynchronous dataflow networks of Teak � Variables in dataflow networks: single write/multiple read � Variable provides a place for data tokens, so >2 latches to Variable provides a place for data tokens, so >2 latches to ensure deadlock freedom The Advanced Processor Technologies Research Group 9

  10. Correctness in SELF Adapted Networks � Variables in eTeak: Elastic Controllers with a pair of latches operating at opposite clock phases � Operations : write takes 1 cycle and read take 0 cycles � Operations : write takes 1 cycle and read take 0 cycles � Each variable provides two places for data tokens � Loops with write/read operations do not need extra latches The Advanced Processor Technologies Research Group 10

  11. Synchronous Crystallisation and Re-synthesis � Synchronous Crystallisation : Regional transformation of a dataflow into a synchronous control-driven circuit through re-synthesis � The candidates for Crystallisation are selected based on their physical � The candidates for Crystallisation are selected based on their physical characteristics (e.g. critical path) � Synthesis at system level enables us to rapidly explore different trade-offs between power, performance and area The Advanced Processor Technologies Research Group 11

  12. Crystallisation: Through RTL Transformation By extracting the occurrence graph and detecting concurrent dataflows within the Teak Network What we achieve by this transformation: � Locally synchronous – deterministic behaviour – reduced fine-grained communication overhead � Easier modelling and partitioning towards GALSification � Use the power of Clocked EDA to re-synthesise � Use the power of Clocked EDA to re-synthesise � Pipelined structures – better Throughput The Advanced Processor Technologies Research Group 12

  13. Elastic to RTL Transformation: The Algorithm B A Case A Root When Root is a Fork and MM1 / MM2 are When Root is a Fork and MM1 / MM2 are independent: always @ (posedge CLK) : FSM_A1 MM1 Out _1 <= φ1 (A,B) MM2 always @ (posedge CLK) : FSM_A2 Out_2 <= φ2 (A,B) Out_2 <= φ2 (A,B) φ2 φ2 φ1 φ1 assign Out = Join (Out_1, Out_2) Sink Out The Advanced Processor Technologies Research Group 13

  14. Elastic to RTL Transformation: The Algorithm B A Case B Root When Root is a Fork and MM1 / MM2 are When Root is a Fork and MM1 / MM2 are dependent: always @ (posedge CLK) : FSM_B MM1 State1: Out_temp <= φ1 (A, B) State2: Out_2 <= φ2 (A, B, Out_temp ) MM2 φ1 φ2 φ2 assign Out = Out_2 Sink Out The Advanced Processor Technologies Research Group 14

  15. Elastic to RTL Transformation: The Algorithm B A Case C Root When Root is a Splitter/Steer: When Root is a Splitter/Steer: always @ (posedge CLK) : FSM_C State_Root: MM1 Case (A,B) 1: State1 MM2 2: State2 State1: Out_1 <= φ1 (A, B) State1: Out_1 <= φ1 (A, B) φ2 φ2 State2: Out_2 <= φ2 (A, B) φ1 assign Out = Merge (Out_1, Out_2) Sink Out The Advanced Processor Technologies Research Group 15

  16. RTL Transformation for the Shifter Example In this example within Macro-modules Root is a Splitter ( Case C ) whilst Macro-modules are dependent ( Case B ), therefore the whole structure is transformed to a single FSM therefore the whole structure is transformed to a single FSM The Advanced Processor Technologies Research Group 16

  17. eTeak Snapshot: Visual Crystallised Partitions The Advanced Processor Technologies Research Group 17

  18. Async. vs. Sync. Elastic: Area Cost � Case Study: SSEM, A three stage iterative Processor implemented in Balsa � Deadlock-free design: Async. ( 65 Buffers ) vs. Sync. Elastic ( 6 Buffers ) � The slack elastic property is preserved � The slack elastic property is preserved Synchronous Asynchronous Elastic F-J-M-S 60000 Variables 50000 Subtracter Subtracter 40000 Area Cost Latch 30000 20000 10000 0 The Advanced Processor Technologies Research Group 18

  19. Asynchronous vs. Synchronous Elastic SSEM � Application: GCD (67, 2) : 250 Instructions � Slack Matching can potentially improve the performance by a factor of 3 Asynchronous vs. Synchronous Elastic SSEM Asynchronous vs. Synchronous Elastic SSEM 160 140 120 1/Throughput 100 80 Area 60 40 20 0 Synchronous Elastic* Synchronous Elastic Solid Synchronous Asynchronous* Asynchronous (f = 1.250 GHz ) (f = 435 MHz) (f = 1.1GHz) Total Cell Area (k*mm^2) 68.41 47.447 56.183 12.563 7.723 Exec. Time (10*ms) 40.61 147.47 46.5 62.04 16.438 *Fully buffered to approve the slack elastic property The Advanced Processor Technologies Research Group 19

  20. Summary & Future work Summary: � A framework for exploring GALSification: an extension to the Teak EDA flow which provides a framework for exploring GALSification EDA flow which provides a framework for exploring GALSification techniques and Behavioural partitioning � A re-synthesis mechanism to exploit synchronous EDA: exploiting the synchronous elastic protocol to move from the asynchronous domain to the synchronous domain where it is possible to leverage synchronous EDAs to improve the circuits Future Work: � Automatic partitioning the system into multiple clock domains: Running the re-synthesised structures with different clock frequency based on their behaviour is what we pursue as future work The Advanced Processor Technologies Research Group 20

  21. References [1]. Wei Song, Jim D. Garside, Doug Edwards: ”Automatic data path [1]. Wei Song, Jim D. Garside, Doug Edwards: ”Automatic data path extraction in large-scale register-transfer level designs” . ISCAS 2014: 377-380 [2]. Wei Song, Jim D. Garside: ” Automatic Controller Detection for Large Scale RTL Designs ”. DSD 2013: 844-851 [3]. Cortadella, Jordi, Mike Kishinevsky, and Bill Grundmann. " SELF: Specification and design of a synchronous elastic architecture for DSM systems ." TAU’2006: Handouts of the International Workshop on Timing Issues in the Specification and Synthesis of Digital on Timing Issues in the Specification and Synthesis of Digital Systems. 2006. The Advanced Processor Technologies Research Group 21

Recommend


More recommend