pico asic synthesis from c
play

PICO: ASIC Synthesis from C Rob Schreiber Shail Aditya Bob Rau - PowerPoint PPT Presentation

PICO: ASIC Synthesis from C Rob Schreiber Shail Aditya Bob Rau Vinod Kathail Scott Mahlke Darren Cronquist Mukund Sivaraman HP Labs, Palo Alto R. Schreiber MPsoc Workshop, July 2002 Outline What Can PICO Do for an SOC


  1. PICO: ASIC Synthesis from C Rob Schreiber Shail Aditya Bob Rau Vinod Kathail Scott Mahlke Darren Cronquist Mukund Sivaraman HP Labs, Palo Alto R. Schreiber – MPsoc Workshop, July 2002

  2. Outline • What Can PICO Do for an SOC Designer? • The PICO System Design Hierarchy • From Sequential to Parallel Loop Nest • Parallel Loop Nest to Processor Design R. Schreiber – MPsoc Workshop, July 2002

  3. PICO overview P rogram I n PICO Architecture Synthesis VHDL for Compiler Processors CAD Tools Logic Synthesis, Physical Design C O hip ode ut Program In --> IP Out R. Schreiber – MPsoc Workshop, July 2002

  4. Using PICO • User provides application, test data, and design space limits • User indicates hot loop nests • PICO creates Pareto set of ASIP designs. • Each design has a customized VLIW with zero or more loop nests realized in HW • User selects appropriate design for SOC based on area, power, performance tradeoff R. Schreiber – MPsoc Workshop, July 2002

  5. PICO’s ASIP Architecture Cache G.P. Processor control Global Memory Local Memory Systolic Array R. Schreiber – MPsoc Workshop, July 2002

  6. Hierarchical Design Frameworks R. Schreiber – MPsoc Workshop, July 2002

  7. An Automated Design Template Function Parameter Specification Ranges SpaceWalker Constructor Evaluator Pareto Filter R. Schreiber – MPsoc Workshop, July 2002

  8. Good Systems from Good Subsystems VLIW Cache Pareto NPA Pareto Pareto System Constructor System Evaluator System Pareto Filter R. Schreiber – MPsoc Workshop, July 2002

  9. design space exploration Design Space Exploration 77 Pareto Compile systems Estimate Cycle Count Runs per second 3,145 systems considered Area Synthesize Estimate Area 2.5 million systems specified R. Schreiber – MPsoc Workshop, July 2002

  10. PICO GUI R. Schreiber – MPsoc Workshop, July 2002

  11. Limiting the Design Space R. Schreiber – MPsoc Workshop, July 2002

  12. Exploration R. Schreiber – MPsoc Workshop, July 2002

  13. Pareto Optimal Machines: VLIW- only R. Schreiber – MPsoc Workshop, July 2002

  14. Pareto Optimal Machines : All systems Hybrid Machines VLIW Machines R. Schreiber – MPsoc Workshop, July 2002

  15. Systolic Design: Exploration 2 Processors, II=1 1 Processor, II=1 1 Processor, II=2 1 Processor, II=8 R. Schreiber – MPsoc Workshop, July 2002

  16. Synthesis of a Non-Programmable, Application-Specific Accelerator: From Sequential Loop Nest to Parallel Loop Nest R. Schreiber – MPsoc Workshop, July 2002

  17. Input Language • A perfect loop nest � A systolic array • • A sequence of nests � A pipeline of arrays • Constant loop bounds • Dependence analysis must be feasible: • No aliasing through pointers • Language extensions • #pragma bitsize x 12 • #internal coeff R. Schreiber – MPsoc Workshop, July 2002

  18. From C to VHDL Sequential C loop nest Sequential loop nest, tiled and register promoted Iteration scheduled, parallel loop nest Function units and software pipelined loop nest Registers, interconnect, FUs, memory Verilog/VHDL Design R. Schreiber – MPsoc Workshop, July 2002

  19. From C to VHDL C program Tiles, schedules, maps, transforms Compiler front end loops, eliminates loads/stores (SUIF+Omega) Optimizes, analyzes bitwidth, allocates Compiler back end function units, software pipelining (Elcor) Allocates registers and interconnect. HDL Synthesis Builds VHDL description of processor. Verilog/VHDL R. Schreiber – MPsoc Workshop, July 2002

  20. What does it take to make this efficient? R. Schreiber – MPsoc Workshop, July 2002

  21. The Memory Wall CPU Memory R. Schreiber – MPsoc Workshop, July 2002

  22. Cache and Local Memory CPU Cache Memory Local DSP/NPA Memory R. Schreiber – MPsoc Workshop, July 2002

  23. Goal of Code Transformation for each TILE { for (t = 0; t < Tfinal; t++) { forall processors p { X[t][p] = . . . Y[t-1][p+1] . . . } } } R. Schreiber – MPsoc Workshop, July 2002

  24. Tiling the Iteration Space data computation Volume/Surface = O(radius) Computation/Footprint = Ω (radius) Computation/Footprint = CPU/Memory R. Schreiber – MPsoc Workshop, July 2002

  25. Load/Store Elimination • For affine array references, intermediate results in registers • For affine, read-only array references, data routed through registers; no value loaded more than once. R. Schreiber – MPsoc Workshop, July 2002

  26. Tile Shapes Big tiles � More local memory Small tiles � less reuse of data, more global memory bandwidth Optimal tile � smallest tile that does not oversubscribe memory bandwidth R. Schreiber – MPsoc Workshop, July 2002

  27. Estimating the Footprint Affine array reference X[i+j][2*j-3*k] How many integer points in an affine image of a rectangular iteration space? R. Schreiber – MPsoc Workshop, July 2002

  28. Example: the Affine Image of an Iteration Space R. Schreiber – MPsoc Workshop, July 2002

  29. Corrected Estimates •Published bounds on the size of the image of a Z- polytope are wrong •Our corrections: - footprint = iteration space for 1-1 mappings - 1-1 if no integer null vector in the iteration space - corrected bounds from finding number of iterations that differ by a null vector - within 20 percent in practice R. Schreiber – MPsoc Workshop, July 2002

  30. Reindexing to Reduce Local Memory x x x x x x x x x x x x x x x x x x x x xxxx x x x x xxxx x x x x xxxx x x x x xxxx R. Schreiber – MPsoc Workshop, July 2002

  31. Finding the Parallel Iteration Schedule Annotated Dataflow Graph Iteration Linear Timing number of procs Scheduler Function initiation interval • Processors a mesh of processors is given • Initiation Interval (II) every processor starts an iteration periodically with period equal to II ( hardware pipelining) • Mapping clusters of iterations are mapped to each processor • Schedule one iteration per processor every II cycles • Honor data dependence constraints • Find the schedule via efficient direct search method R. Schreiber – MPsoc Workshop, July 2002

  32. Hardware/Software Pipelining for (i=0; i < 100; i++) a[i] += b[i]*c[i] mpy add str ld b ld c i=0 ld b ld c mpy add str i=1 II ld b ld c mpy i=2 time Lower Bounds on II (RecMII, ResMII) R. Schreiber – MPsoc Workshop, July 2002

  33. The Mapping of Iterations to Processors for (i = 0; i < 8; i++) for (i = 0; i < 8; i++) for (j = 0; j < 4; j++) for (j = 0; j < 4; j++) { { y[i] += w[j] * x[i- y[i] += w[j] * x[i -j]; j]; } } j j p=0 Iteration Space: (8,4) Mapping: proc(i,j) = j / 2 Cluster shape = (2) p=1 i i R. Schreiber – MPsoc Workshop, July 2002

  34. A Tight Schedule: (i,j) --> 2i+3j for (i = 0; i < 8; i++) for (i = 0; i < 8; i++) for (j = 0; j < 4; j++) for (j = 0; j < 4; j++) { { y[i] += w[j] * x[i- y[i] += w[j] * x[i -j]; j]; } } j j 9 11 13 15 17 19 21 23 p=0 6 8 10 12 14 16 18 20 3 5 7 9 11 13 15 17 p=1 0 2 4 6 8 10 12 14 i i R. Schreiber – MPsoc Workshop, July 2002

  35. Tight Schedules – Prior Work Darte/Delosme, Chen/Megson. • GIVEN : Iteration space, projection direction, linear schedule • DETERMINE : The allowed cluster shapes • Tail Wags Dog! R. Schreiber – MPsoc Workshop, July 2002

  36. Constructing the Schedule array array Dependence Dependence spec. spec. Analysis Analysis loop loop nest nest Bounding Bounding Generate Generate Region Region (lots of) Tight (lots of) Tight Schedules Schedules Test for Test for Correctness Correctness Estimate Estimate Hardware Cost Hardware Cost Select Select Schedule Schedule R. Schreiber – MPsoc Workshop, July 2002

  37. Processor Synthesis loop Processor Processor Synthesis II • Optimize the loop body • Analyze bitwidth of all values • Allocate the function units • Map operations to function units • Schedule operations • Allocate registers and memory • Interconnect communicating elements Parallel, custom, designed to spec: EFFICIENT! R. Schreiber – MPsoc Workshop, July 2002

  38. Bitwidth analysis - basic idea Input information limits the amount information that can be produced c a b Opcode semantics relate input and output information Information required by consumers limits the amount that must be produced R. Schreiber – MPsoc Workshop, July 2002

  39. Optimal FU allocation FU Operation count cost type type count 2 1 10 3 + + 0 10 1 1 - - 1 1 13 +/- MILP: minimize cost subject to sufficient capacity R. Schreiber – MPsoc Workshop, July 2002

  40. Allocation and Op Scheduling Find : Cheapest processor Given : Inner loop and II that achieves II on the loop LOOP Count operations f.u. library Modulo Operation Preallocate Schedule Achieved II Reallocate achieved N Y f <= required? Required II R. Schreiber – MPsoc Workshop, July 2002

  41. Conclusions • Accurate static analysis of memory bandwidth – optimal tiling • Linear iteration scheduling: solved problem • Efficient datapath synthesis – a hard problem, good heuristics • Automatic NPA synthesis is practical • Automatic synthesis of full embedded systems is feasible, too R. Schreiber – MPsoc Workshop, July 2002

Recommend


More recommend