Regular Distributed Register Fabric Regular Distributed Register Fabric and Synthesis for Multi- -Cycle Cycle and Synthesis for Multi Communications Communications Jason Cong, Yiping Fan, Xun Yang and Zhiru Zhang Jason Cong, {cong, fanyp fanyp, , yangxun yangxun, , zhiruz}@cs.ucla.edu zhiruz}@cs.ucla.edu {cong, Department of Computer Science Department of Computer Science University of California, Los Angeles University of California, Los Angeles Partially supported by NSF under award CCR-0096383, MARCO/DARPA GSRC, and Altera Corp. under the California MICRO program.
Outline Outline Needs for Multi- -Cycle On Cycle On- -Chip Communication Chip Communication � Needs for Multi � Contributions � Contributions � � Regular Distributed Register (RDR) Architecture Regular Distributed Register (RDR) Architecture � � MCAS: Architectural Synthesis for Multi MCAS: Architectural Synthesis for Multi- -Cycle Communication Cycle Communication � • Scheduling • Scheduling- -driven placement driven placement • Placement • Placement- -driven rescheduling & rebinding driven rescheduling & rebinding Experimental Results � Experimental Results � Conclusions & Future Work � Conclusions & Future Work �
Needs for Multi- -Cycle On Cycle On- -Chip Communication Chip Communication Needs for Multi � Interconnect delays dominate the timing in DSM tech. Interconnect delays dominate the timing in DSM tech. � � Single Single- -cycle full chip synchronization is no longer possible cycle full chip synchronization is no longer possible � 7 clock � NTRS’97 0.07um Tech 6 clock � 5 G Hz across-chip clock � 620 mm 2 (24.9mm x 24.9mm) � IPEM BIWS estimations 5 clock � Buffer size: 100x � Driver/receiver size: 100x � From corner to corner: � 7 clock cycles 4 clock 3 clock 1 clock 2 clock � Source: J. Cong, “Timing Closure 15.04 22.56 24.9 (mm) 0 7.52 Based on Physical Hierarchy,” ISPD’02.
Multi- -Cycle Interconnect Communication at Logic / Cycle Interconnect Communication at Logic / Multi Physical Level Physical Level � Simultaneous retiming + placement / Simultaneous retiming + placement / floorplanning floorplanning � � [Cong et al, ICCAD [Cong et al, ICCAD’ ’00] [Cong et al, DAC 00] [Cong et al, DAC’ ’03] 03] � � [ [Chong Chong & & Brayton Brayton, IWLS , IWLS’ ’01] 01] � � [Singh & Brown, FPGA [Singh & Brown, FPGA’ ’02] 02] � � Limitation: Limitation: � � Minimum clock period can be achieved by logic optimization is bo Minimum clock period can be achieved by logic optimization is bounded by unded by � max. delay- -to to- -register (DR) ratio of the loops in the circuits register (DR) ratio of the loops in the circuits max. delay • In a loop, 4 logic cells, 2 registers • Cell delay =1ns • Interconnect delay=4ns • DR ratio = (D logic +D int )/#Registers = (4+16)/2=10ns • Clock cycle >= 10ns
Our Contributions Our Contributions Regular Distributed Register (RDR) micro- -architecture architecture � Regular Distributed Register (RDR) micro � � Highly regular Highly regular � � Direct support of multi Direct support of multi- -cycle on cycle on- -chip communication chip communication � MCAS: Architectural Synthesis for Multi- -cycle cycle � MCAS: Architectural Synthesis for Multi � Communication Communication � Integrated architectural synthesis (e.g. resource binding, Integrated architectural synthesis (e.g. resource binding, � scheduling) with physical planning scheduling) with physical planning � Target at RDR architecture Target at RDR architecture �
Regular Distributed Register Architecture (1) Regular Distributed Register Architecture (1) Island Reg. file Reg. file Reg. file … … … MUL Register File MUX FSM FSM FSM ADD LCC LCC LCC …. Cluster with area constraint H i Local Global Interconnect FSM Computational Reg. file Cluster (LCC) Reg. file Reg. file … … … W i FSM FSM FSM LCC LCC LCC � Distribute registers to each “island” � Chose the island size such that local computation and communication in each island can be done in a single cycle: = + ≤ + + ≤ D D D D 2 D ( W H ) T − − − int ra island log ic opt int log ic opt int i i
Regular Distributed Register Architecture (2) Regular Distributed Register Architecture (2) Island Reg. file Reg. file Reg. file … … … MUL Register File MUX 1 cycle FSM 2 cycle k cycle FSM FSM ADD LCC LCC LCC …. Cluster with area constraint H i Local Global Interconnect FSM Computational Reg. file Cluster (LCC) Reg. file Reg. file … … … W i FSM FSM FSM LCC LCC LCC � Use register banks: � Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle interconnect communication in each island � Highly regular
Example : Regular Distributed Register Example : Regular Distributed Register Architecture for 70nm Technology Architecture for 70nm Technology � NTRS’97 70nm Tech � Chip dimension: 620 mm 2 (24.9mm x 24.9mm) � 5 G Hz across-chip clock • Can travel up to 7.52mm within 1 clock cycle under best interconnect optimization • Need 7 clock cycles to cross the chip � Each island base dimension • W i = H i =2.08mm • ≈ 1/3 of distance a wire can travel in 1 clock cycle • Logic volume: 6.76M min-size 2-NAND gates � 12X12 array of islands � Local registers are partitioned to 7 banks
RDR Architecture vs. DRA RDR Architecture vs. DRA � Distributed Register File Architecture (DRA) Distributed Register File Architecture (DRA) � � Behavior Behavior- -to to- -Placed RTL Synthesis with Performance Placed RTL Synthesis with Performance- -Driven Placement [Kim, Driven Placement [Kim, � et al, ICCAD’ et al, ICCAD ’01] 01] � Similarities: Similarities: � � Distribute registers near the local computational units Distribute registers near the local computational units � � Supports multi Supports multi- -cycle communication cycle communication � � Allows concurrent computation and communication Allows concurrent computation and communication � � Distinction: Distinction: � regular The RDR architecture is highly regular � The RDR architecture is highly � • Facilitates interconnect delay estimation • Facilitates interconnect delay estimation • Enables the systematic exploration of cycle • Enables the systematic exploration of cycle- -time/latency time/latency tradeoff by varying the size of the basic island tradeoff by varying the size of the basic island
Example: Impact of Interconnect on Scheduling Example: Impact of Interconnect on Scheduling � Data flow graph extracted from discrete cosine transformation (DCT) � The nodes with the same color are assigned to the same functional unit. 1 ns Reg. file Reg. file resource delay num … … - - - 1 + 2 + + Alu1 multiplier 2 ns 2 Mul2 1,5,10 * * * 3 * * 4 * 3,7,12 Alu2 alu 1 ns 2 2,6,9 - - - - 5 - - 6 2 ns * * 7 * Reg. file Reg. file * * 8 * … … * 12 - * * * - - 9 * * 11 Mul1 LCC 4,8,11 - - - 10 Long interconnect Performance-driven Placement Short interconnect
Single- -cycle vs. Multi cycle vs. Multi- -cycle Interconnect Communication cycle Interconnect Communication Single represents registers Cycle1 + 2 - 1 + 2 Cycle1 - 1 Cycle2 * 3 * 3 * 4 * 4 Cycle3 Cycle2 - 5 Cycle4 - 5 - 6 Cycle3 - 6 Cycle5 Cycle4 * 7 * 8 * 7 * 11 Cycle6 * 8 * 12 * 12 Cycle7 Cycle5 * 11 Cycle8 - 10 - 10 Cycle6 - 9 Cycle9 - 9 � Single-cycle interconnect communication � Multi-cycle interconnect communication � Scheduled in 6 clock cycles � Scheduled in 9 clock cycles � Clock period is 4ns � Clock period is 2ns � Total latency is 24ns � Total latency is 18ns
Enhancement 1: Simultaneous Placement and Scheduling Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimization for Performance Optimization Reg. file Reg. file … … - 1 + 2 Cycle1 Mul2 Alu1 * 3 * 4 Cycle2 3,7,12 1,5,10 - 5 - 6 Cycle3 Cycle4 Reg. file Reg. file … … Cycle5 * 7 * 8 Alu2 Mul1 Cycle6 2,6,9 4,8,11 * 12 * 11 Cycle7 - 9 Simultaneous Placement and Scheduling Cycle8 - 10 � With placement integrated with scheduling, critical path is reduced. � The DFG can be scheduled in 8 clock cycles, with clock period of 2ns. � The total latency is 16ns.
Enhancement 2: Simultaneous Placement, Scheduling Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimization and Binding for Performance Optimization Reg. file Reg. file … … - 1 + 2 Cycle1 Mul2 Alu1 3,7,11 1,5,10 * 3 * 4 Cycle2 - 5 - 6 Cycle3 Reg. file Reg. file … … Cycle4 Mul1 Alu2 * 7 * 12 Cycle5 4,8,12 2,6,9 Cycle6 * 8 * 11 Simultaneous Placement, Scheduling and Binding Cycle7 - 9 - 10 � With placement integrated with scheduling and binding, the critical path is further reduced. � The DFG can be scheduled in 7 clock cycles, with clock period of 2ns. � The total latency is 14ns
Recommend
More recommend