concurrency enhancing transformations for asynchronous
play

Concurrency-Enhancing Transformations for Asynchronous Behavioral - PowerPoint PPT Presentation

Concurrency-Enhancing Transformations for Asynchronous Behavioral Specifications: A Data-Driven Approach John Hansen and Montek Singh University of North Carolina Chapel Hill, NC, USA 1 Introduction: Motivation &MAIN : main proc (IN?


  1. Concurrency-Enhancing Transformations for Asynchronous Behavioral Specifications: A Data-Driven Approach John Hansen and Montek Singh University of North Carolina Chapel Hill, NC, USA 1

  2. Introduction: Motivation &MAIN : main proc (IN? chan <<byte, byte, byte, byte, byte forever do Most high-level async tools are begin &ContextCHAN1 : chan <<byte, byte, byte, byte, byte &ContextCHAN2 : chan <<byte, byte, byte, byte>> syntax-directed (Haste/Balsa) &ContextCHAN3 : chan <<byte, byte, byte>> &ContextCHAN4 : chan <<byte, byte>> &ContextCHAN5 : chan <<byte>> | ( These tools are inadequate for contextproc1(IN, ContextCHAN1)|| contextproc2(ContextCHAN1, ContextCHAN2)|| contextproc3(ContextCHAN2, ContextCHAN3)|| contextproc4(ContextCHAN3, ContextCHAN4)|| designing high-speed circuits contextproc5(ContextCHAN4, ContextCHAN5)|| contextproc6(ContextCHAN5, OUT) ) Straightforward spec ➜ slow circuit end od &contextproc1 = proc (IN? chan <<byte, byte, byte, byte, by Fast circuits require significant effort begin context : var <<a: byte, b: byte, c: byte, d: byte, | Need better tool support! forever do IN?context; OUT!<<c, d, context od end &MAIN : main proc (IN? chan <<byte, byte, byte, byte, byte, byte>> & OUT! chan byte). &contextproc2 = proc (IN? chan <<byte, byte, byte, byte, by begin begin a, b, c, d, e, f, g, h, i, j, k : var byte context : var <<c: byte, d: byte, e: byte, f: byte, ~100 lines | | forever do forever do IN?<<a, b, c, d, e, f>>; IN?context; g := a * b; ~10 lines OUT!<<e, f, context h := c * d; od i := e * f; end j := g + h; k := i * j; &contextproc3 = proc (IN? chan <<byte, byte, byte, byte>> & OUT!k begin od context : var <<e: byte, f: byte, g: byte, h: byte end 2 | 2

  3. Our Contribution “Source-to-Source Compiler” Rewrites specs to enhance concurrency Fully-automated and integrated into Haste flow Arsenal of several powerful optimizations: parallelization, pipelining, arithmetic opt., communication opt. Benefits: Up to 59x speedup (throughput) of implementation ... ... 290x speedup with arithmetic optimization Or: Reduces design effort by up to 95% (lines of code) with our method: high performance with low design effort without our method: high performance requires significant effort! 3 3

  4. Our Contribution Our tool integrated as “preprocessor” to Haste compiler leverages Haste compilation and backend Behavioral Spec Parallelize Pipeline Arithmetic Opt. Compiler Communication Opt. Handshake Graph TechMap Original Haste Flow Netlist 4 4

  5. Our Contribution 4 concurrency-enhancing optimizations: X?<<a,b,c,d>>; Parallelization e:=a+b|| e:=a+b|| e:=a+b; e:=a+b; e:=a+b|| remove unnecessary sequencing f:=c+d; f:=c+d; f:=c+d; f:=c+d; f:=c+d; Pipelining g:=f+1; g:=f+1; allow overlapped execution h:=g*2; h:=g*2; Arithmetic Optimization k:=e*f*g*h; k:=e*f*g*h; k:=(e*f)*(g*h); k:=(e*f)*(g*h); decompose/restructure long-latency operations Y!k; Channel Communication Optimization Z!e; Z!e; re-ordering for increased concurrency 5 5

  6. Our Contribution Benefits of automatic code rewriting: &ContextCHAN1 : chan ... &ContextCHAN2 : chan ... &ContextCHAN3 : chan ... Eases burden on designer &ContextCHAN4 : chan ... &ContextCHAN5 : chan ... ... allows focus on functionality instead of perf. contextproc1(IN, ContextCHAN1)|| contextproc2(ContextCHAN1, ContextCHAN2)|| contextproc3(ContextCHAN2, ContextCHAN3)|| greater readability ➜ less chance of bugs contextproc4(ContextCHAN3, ContextCHAN4)|| contextproc5(ContextCHAN4, ContextCHAN5)|| contextproc6(ContextCHAN5, OUT) Step towards design space exploration &contextproc1 = proc (IN? chan ...& OUT! selectively apply optimizations where needed... chan ...). begin context : var <<...>> ... based on a cost function (speed/energy/area) | forever do IN?context; Backwards compatible with legacy code OUT!<<c, d, e, f, a * b>> od Transformed end simply recompile for high-speed implementation &contextproc2 = ... Code &contextproc3 = ... ... forever do &contextproc6 = ... IN?<<a,b,c,d,e,f>>; g := a * b; Designer’s h := c * d; i := e * f; Code j := g + h; k := i * j; OUT!k od 6 6

  7. Solution Domain: Class of Specifications Input Domain: Requires “slack-elastic” specifications Spec must be tolerant of additional slack on channels Formally: deadlock-free, restriction on probes, ... [Manohar/Martin98] Output: Produces “data-driven” specifications Pipelined: data drives computation, not control-dominated Preserves top-level system topology, including cycles Replaces each module with parallelized+pipelined version Correctness model (slack elasticity): spec maintains original token order per channel no guarantees about relative token order across channels 7 7

  8. Solution Domain: Target Architectures Can handle arbitrary topologies B A D E Breaks down C each module into smaller parts 8 8

  9. Talk Outline Previous Work and Background Basic Approach Advanced Techniques Results Conclusion 9 9

  10. Previous Work “Spatial Computation” [Budiu 03] Convert ANSI C programs to dataflow hardware Spec language has inherent limitations cannot model channel communication no fork-join type of concurrency Data-Driven Compilation [Taylor 08, Plana 05] New data-driven specification language “Push” instead of “pull” components Designer must still be skillful at writing highly concurrent specs our approach effectively automates this by code rewriting 10 10

  11. Previous Work Peephole Optzn/Resynthesis [Chelcea/Nowick 02, Plana 05] improve concurrency at circuit and handshake levels do not target higher-level (system-wide) concurrency CHP Specifications [Teifel 04, Wong 01] translate CHP specs into pipelined implementations Balsa/Haste ⇄ CDFG Conversion [Nielsen 04, Jensen 07] main goal is to leverage synchronous tools for resource sharing some peephole optimizations only 11 11

  12. Background: Haste Language Key language constructs: channel reads / writes &fifo=proc(IN?chan byte & IN?x / OUT!y OUT!chan byte). assignments begin & x: var byte ff a := expr | sequential / parallel composition forever do A ; B / A || B IN?x; conditionals x:=x+1; OUT!x if C then X else Y fi od loops end forever do for while 12 12

  13. Background: Haste Compilation Behavioral Spec &fifo=proc(IN?chan byte & OUT!chan byte). begin & x: var byte ff | forever do Compiler IN?x; OUT!x od end Handshake Graph TechMap A syntax-directed design flow Netlist for rapid development 13 13

  14. Background: Haste Limitations forever do IN?a; b:=f1(a); c:=f2(b); d:=f3(c); OUT!f4(d) od straightforward coding ➜ long critical cycles ➜ poor performance 14 14

  15. Talk Outline Introduction Background Basic Approach Advanced Techniques Results Conclusion 15 15

  16. Basic Approach: Overview Four step method: 1. Input a behavioral specification 2. Perform parallelization on statements 3. Create a pipeline stage for each group of parallel statements 4. Produce new code incorporating these optimizations proc(IN?chan byte & OUT!chan byte). forever do (IN?a; forever do OUT!<<a,a*2>>) od IN?a; ... 1: b:=a*2; forever do (IN?<<a,b>>; 2: c:=b+5; OUT!<<b+5,a+b>>) od 3: d:=a+b; ... 4: e:=c+d: forever do (IN?<<a,b,c>>; 5: f:=d*3; OUT!<<c+d,d*3>>) od 6: g:=f+e; ... OUT!g forever do (IN?<<e,f>>; od OUT!<<e+f>>) od 16 16

  17. Parallelizing Transformation proc(IN?chan byte & OUT!chan byte). Increases instruction-level concurrency forever do IN?a; 1: b:=a*2; statements are re-ordered or parallelized 2: c:=b+5; 3: d:=a+b; 4: e:=c+d: Original Example 5: f:=d*3; 6: g:=f+e; proc(IN?chan byte & OUT!chan byte). OUT!g od forever do IN?a; b:=a*2; (c:=b+5 || (c:=b+5 || d:=a+b); d:=a+b); (e:=c+d || (e:=c+d || f:=d*3); f:=d*3); g:=f+6; OUT!g od Reduced Latency! 17 17

  18. Parallelizing Transformation Algorithm: forever do forever do IN?a; IN?a; Generate a dependence graph 1: b:=a*2; b:=a*2; Perform a topological sort 2: c:=b+5; (c:=b+5 || (group parallelizable statements) 3: d:=a+b; d:=a+b); Sequence parallel groupings 4: e:=c+d: (e:=c+d || 5: f:=d*3; f:=d*3); g:=f+e; 6: g:=f+e; OUT!g OUT!g od od 18 18

  19. Parallelizing: What About Cycles? Cycles are collapsed into atomic nodes Parallelization is performed recursively 19 19

  20. Pipelining Transformation proc(IN?chan byte Allows execution to overlap & OUT!chan byte). forever do Control is distributed IN?a; 1: b:=a*2; 2: c:=b+5; 3: d:=a+b; Stage1 (IN?chan byte & OUT!chan byte). 4: e:=c+d: Original Example forever do forever do 5: f:=d*3; 6: g:=f+e; IN?a; IN?a; OUT!g OUT!<<a,a*2>> OUT!<<a,a*2>> od od od Increased ... Stage2 (IN?chan byte & OUT!chan byte). Throughput forever do forever do IN?<<a,b>>; IN?<<a,b>>; OUT!<<a,b,b+5>> OUT!<<a,b,b+5>> od od ... 20 20

Recommend


More recommend