PICO: ASIC Synthesis from C Rob Schreiber Shail Aditya Bob Rau Vinod Kathail Scott Mahlke Darren Cronquist Mukund Sivaraman HP Labs, Palo Alto R. Schreiber – MPsoc Workshop, July 2002
Outline • What Can PICO Do for an SOC Designer? • The PICO System Design Hierarchy • From Sequential to Parallel Loop Nest • Parallel Loop Nest to Processor Design R. Schreiber – MPsoc Workshop, July 2002
PICO overview P rogram I n PICO Architecture Synthesis VHDL for Compiler Processors CAD Tools Logic Synthesis, Physical Design C O hip ode ut Program In --> IP Out R. Schreiber – MPsoc Workshop, July 2002
Using PICO • User provides application, test data, and design space limits • User indicates hot loop nests • PICO creates Pareto set of ASIP designs. • Each design has a customized VLIW with zero or more loop nests realized in HW • User selects appropriate design for SOC based on area, power, performance tradeoff R. Schreiber – MPsoc Workshop, July 2002
PICO’s ASIP Architecture Cache G.P. Processor control Global Memory Local Memory Systolic Array R. Schreiber – MPsoc Workshop, July 2002
Hierarchical Design Frameworks R. Schreiber – MPsoc Workshop, July 2002
An Automated Design Template Function Parameter Specification Ranges SpaceWalker Constructor Evaluator Pareto Filter R. Schreiber – MPsoc Workshop, July 2002
Good Systems from Good Subsystems VLIW Cache Pareto NPA Pareto Pareto System Constructor System Evaluator System Pareto Filter R. Schreiber – MPsoc Workshop, July 2002
design space exploration Design Space Exploration 77 Pareto Compile systems Estimate Cycle Count Runs per second 3,145 systems considered Area Synthesize Estimate Area 2.5 million systems specified R. Schreiber – MPsoc Workshop, July 2002
PICO GUI R. Schreiber – MPsoc Workshop, July 2002
Limiting the Design Space R. Schreiber – MPsoc Workshop, July 2002
Exploration R. Schreiber – MPsoc Workshop, July 2002
Pareto Optimal Machines: VLIW- only R. Schreiber – MPsoc Workshop, July 2002
Pareto Optimal Machines : All systems Hybrid Machines VLIW Machines R. Schreiber – MPsoc Workshop, July 2002
Systolic Design: Exploration 2 Processors, II=1 1 Processor, II=1 1 Processor, II=2 1 Processor, II=8 R. Schreiber – MPsoc Workshop, July 2002
Synthesis of a Non-Programmable, Application-Specific Accelerator: From Sequential Loop Nest to Parallel Loop Nest R. Schreiber – MPsoc Workshop, July 2002
Input Language • A perfect loop nest � A systolic array • • A sequence of nests � A pipeline of arrays • Constant loop bounds • Dependence analysis must be feasible: • No aliasing through pointers • Language extensions • #pragma bitsize x 12 • #internal coeff R. Schreiber – MPsoc Workshop, July 2002
From C to VHDL Sequential C loop nest Sequential loop nest, tiled and register promoted Iteration scheduled, parallel loop nest Function units and software pipelined loop nest Registers, interconnect, FUs, memory Verilog/VHDL Design R. Schreiber – MPsoc Workshop, July 2002
From C to VHDL C program Tiles, schedules, maps, transforms Compiler front end loops, eliminates loads/stores (SUIF+Omega) Optimizes, analyzes bitwidth, allocates Compiler back end function units, software pipelining (Elcor) Allocates registers and interconnect. HDL Synthesis Builds VHDL description of processor. Verilog/VHDL R. Schreiber – MPsoc Workshop, July 2002
What does it take to make this efficient? R. Schreiber – MPsoc Workshop, July 2002
The Memory Wall CPU Memory R. Schreiber – MPsoc Workshop, July 2002
Cache and Local Memory CPU Cache Memory Local DSP/NPA Memory R. Schreiber – MPsoc Workshop, July 2002
Goal of Code Transformation for each TILE { for (t = 0; t < Tfinal; t++) { forall processors p { X[t][p] = . . . Y[t-1][p+1] . . . } } } R. Schreiber – MPsoc Workshop, July 2002
Tiling the Iteration Space data computation Volume/Surface = O(radius) Computation/Footprint = Ω (radius) Computation/Footprint = CPU/Memory R. Schreiber – MPsoc Workshop, July 2002
Load/Store Elimination • For affine array references, intermediate results in registers • For affine, read-only array references, data routed through registers; no value loaded more than once. R. Schreiber – MPsoc Workshop, July 2002
Tile Shapes Big tiles � More local memory Small tiles � less reuse of data, more global memory bandwidth Optimal tile � smallest tile that does not oversubscribe memory bandwidth R. Schreiber – MPsoc Workshop, July 2002
Estimating the Footprint Affine array reference X[i+j][2*j-3*k] How many integer points in an affine image of a rectangular iteration space? R. Schreiber – MPsoc Workshop, July 2002
Example: the Affine Image of an Iteration Space R. Schreiber – MPsoc Workshop, July 2002
Corrected Estimates •Published bounds on the size of the image of a Z- polytope are wrong •Our corrections: - footprint = iteration space for 1-1 mappings - 1-1 if no integer null vector in the iteration space - corrected bounds from finding number of iterations that differ by a null vector - within 20 percent in practice R. Schreiber – MPsoc Workshop, July 2002
Reindexing to Reduce Local Memory x x x x x x x x x x x x x x x x x x x x xxxx x x x x xxxx x x x x xxxx x x x x xxxx R. Schreiber – MPsoc Workshop, July 2002
Finding the Parallel Iteration Schedule Annotated Dataflow Graph Iteration Linear Timing number of procs Scheduler Function initiation interval • Processors a mesh of processors is given • Initiation Interval (II) every processor starts an iteration periodically with period equal to II ( hardware pipelining) • Mapping clusters of iterations are mapped to each processor • Schedule one iteration per processor every II cycles • Honor data dependence constraints • Find the schedule via efficient direct search method R. Schreiber – MPsoc Workshop, July 2002
Hardware/Software Pipelining for (i=0; i < 100; i++) a[i] += b[i]*c[i] mpy add str ld b ld c i=0 ld b ld c mpy add str i=1 II ld b ld c mpy i=2 time Lower Bounds on II (RecMII, ResMII) R. Schreiber – MPsoc Workshop, July 2002
The Mapping of Iterations to Processors for (i = 0; i < 8; i++) for (i = 0; i < 8; i++) for (j = 0; j < 4; j++) for (j = 0; j < 4; j++) { { y[i] += w[j] * x[i- y[i] += w[j] * x[i -j]; j]; } } j j p=0 Iteration Space: (8,4) Mapping: proc(i,j) = j / 2 Cluster shape = (2) p=1 i i R. Schreiber – MPsoc Workshop, July 2002
A Tight Schedule: (i,j) --> 2i+3j for (i = 0; i < 8; i++) for (i = 0; i < 8; i++) for (j = 0; j < 4; j++) for (j = 0; j < 4; j++) { { y[i] += w[j] * x[i- y[i] += w[j] * x[i -j]; j]; } } j j 9 11 13 15 17 19 21 23 p=0 6 8 10 12 14 16 18 20 3 5 7 9 11 13 15 17 p=1 0 2 4 6 8 10 12 14 i i R. Schreiber – MPsoc Workshop, July 2002
Tight Schedules – Prior Work Darte/Delosme, Chen/Megson. • GIVEN : Iteration space, projection direction, linear schedule • DETERMINE : The allowed cluster shapes • Tail Wags Dog! R. Schreiber – MPsoc Workshop, July 2002
Constructing the Schedule array array Dependence Dependence spec. spec. Analysis Analysis loop loop nest nest Bounding Bounding Generate Generate Region Region (lots of) Tight (lots of) Tight Schedules Schedules Test for Test for Correctness Correctness Estimate Estimate Hardware Cost Hardware Cost Select Select Schedule Schedule R. Schreiber – MPsoc Workshop, July 2002
Processor Synthesis loop Processor Processor Synthesis II • Optimize the loop body • Analyze bitwidth of all values • Allocate the function units • Map operations to function units • Schedule operations • Allocate registers and memory • Interconnect communicating elements Parallel, custom, designed to spec: EFFICIENT! R. Schreiber – MPsoc Workshop, July 2002
Bitwidth analysis - basic idea Input information limits the amount information that can be produced c a b Opcode semantics relate input and output information Information required by consumers limits the amount that must be produced R. Schreiber – MPsoc Workshop, July 2002
Optimal FU allocation FU Operation count cost type type count 2 1 10 3 + + 0 10 1 1 - - 1 1 13 +/- MILP: minimize cost subject to sufficient capacity R. Schreiber – MPsoc Workshop, July 2002
Allocation and Op Scheduling Find : Cheapest processor Given : Inner loop and II that achieves II on the loop LOOP Count operations f.u. library Modulo Operation Preallocate Schedule Achieved II Reallocate achieved N Y f <= required? Required II R. Schreiber – MPsoc Workshop, July 2002
Conclusions • Accurate static analysis of memory bandwidth – optimal tiling • Linear iteration scheduling: solved problem • Efficient datapath synthesis – a hard problem, good heuristics • Automatic NPA synthesis is practical • Automatic synthesis of full embedded systems is feasible, too R. Schreiber – MPsoc Workshop, July 2002
Recommend
More recommend