Day 2 VLSI Microprocessor Design Flow Session A: Circuit design styles Break Session B: Design paths Lunch Session C: Verification Break Session D: Manufacture, fabrication testing, packaging Today Organized Bottom-Up Circuit design style Full-custom design path Standard cell design path RTL design Verification strategy Packaging Manufacture & testing Important: real designs proceed at all levels simultaneously
T0 Circuit Design Style Datapaths and memories Control logic Full-custom layout Standard cells Regular structures Irregular structures Most of the die area Most of the complexity Few design bugs Most of the design bugs Mostly hand-specified procedural layout and routing (some hand layout Placed and routed automatically and routing) Sometimes exotic circuit designs Conservative static CMOS circuits (dynamic, self-timed) Typical design style for modern microprocessor T0 Die Breakdown Std. Cell Full-Custom
Global Design Style Decisions Extremely important: Clock methodology and latch design Power, ground, and clock distribution Must be settled early since these affect every circuit on the chip. T0 Clock and Latch Style Input clock signal at 2x on-chip frequency (e.g., 80MHz crystal for 40MHz Spert-II board) divided by 2 on-chip to guarantee 50% duty cycle. Clock buffered up, last stage drives single clock grid across entire chip, <1ns skew across chip, <500ps rise/fall time. Clock output pad to phase lock external circuitry to T0 clock. TSPC dynamic latches (T0 has minimum operating frequency). Also, some special pseudo-static load-enabled latches. Very similar to Alpha 21064 clocking strategy.
T0 Clock Distribution Clock Output Clock Grid (In reality hundreds of wires) Clock Buffer 2x Clock Input T0 Latch Style Standard-cell controller designed with edge-triggered flip-flops • Only negative edge-triggered flip-flops • Simpler for state machines • Simplifies synthesis timing specification • State stall handled with mux around flip-flop - no clock gating Full-custom datapaths and memories used transparent latches • p- and n- type latches transparent on clock low or high respectively • Can steal time across clock cycle boundaries • Can place latches in convenient place in signal flow to save area • Simplifies double-cycling (used in vector register file, some buses) • Special stallable n-latch (small area without clock gating) Designed library of latches verified to operate across all process corners with clock skew/rise/fall spec, and when placed in series with other latches.
T0 Power/Ground Distribution Half of all pins were power and ground (204/408) Chip-on-board packaging gave low-inductance path to board (~1nH per wire) Grid across whole chip in wide M1 and M2 strapped whereever possible. Required IR drop less than 5% of Vdd in middle of chip. On-chip gate oxide decoupling capacitors placed everywhere possible, especially under power rails. Enough bypass capacitance for <5% power bounce, even if power/ground wires open circuit for one cycle. T0 Power/Ground Distribution Bypass cap. under power Every other rails pad is power or ground Additional bypass cap. in empty space M2 Power Grid M1
T0 Custom Memories Instruction cache • 1KB storage + tags + valid • Classic 6T SRAM design • One port: differential write (128b) or differential read (32b) • 1 word line and 2 bit lines per bit cell • Special wire to clear all valid bits in one cycle for cache flush • Fast dynamic tag comparator built into tag sense amps - critical path Scalar Register File • 128B storage (32x4B registers) • Three ports: One differential write plus two single-ended reads • 3 word lines and 4 bit lines per bit cell Vector Register File (Trickiest piece of circuit design in T0) • 2KB storage (16x32x4B registers) • Eight ports: three diff. write on clock low, five single-end. read on clock high • Self-timed to generate all timing edges in one cycle • 5 word lines and 6 bit lines per bit cell T0 Datapath Design Style Select datapath pitch, tradeoff between: •wasted space for simple cells •crunched inefficient design for complex cells Vector unit has 72 λ bit pitch (late change from 80 λ to fit reticle). Scalar unit has 80 λ bit pitch. Decide on metal layer assignments. Data busses in Metal 1, control/clock/Vdd/GND in Metal 2. Roughly half of datapath bit pitch is used for busses passing by cell. Design library of datapath cells (mostly latches and muxes). Special cells created where needed (maybe 5% are special) Mostly static CMOS logic and static pass-transistor logic, some critical places use dynamic logic: • Adder carry-chains • Branch zero comparator • Saturation overflow comparators
T0 Datapath Latch Designs Latches mostly dynamic TSPC plus holders (a la 21064) 4 PHI D 14 14 12 12 4 14 14 12 12 X Q X 9 4 16 9 D Q 4 10 16 4 PHI 10 16 p-latch n-latch Special Psuedo-Static n-Latch PHI 4x4 D 80F 20 8 LENB X Q 20 8 8 LEN 8 4 8 4 Restrictive enable control line timing caused problems later
T0 Datapath Mux Designs Muxes n-pass-transistor with level restoring p-transistor: CSEL 4x4 8 C 4 6 BSEL 8 OUT 8 B 4 6 6 ASEL 8 A 4 3-input mux 6 Example Datapath Layout
T0 Standard Cell Designs Started with public domain library, but hand-inspected each cell and threw away/redesigned bad cells • Some cells had too many series transistors or bad output driver Changed every cell to have much wider power/ground rails • To avoid IR drop in middle of long standard cell row Added separate clock rail into every cell • Fits into overall clock gridding scheme • Ensures controlled skew on clock (don’t want clock auto-routed!) Designed our own standard cell flip-flops and latches • Connects to special clock rail - uses our clocking methodology • Latches used to synchronize with datapath signals Added greater variety of inverters and buffers • Existing buffers not big enough to drive loads on our chip • More flexibility for synthesis to trade area and delay T0 Pads Pad design is especially tricky Many esoteric device structures used to provide protection against latch up and ESD damage Obtained HP’s design guidelines under NDA Designed custom pads using most of HP’s recommendations for pad protection Pad output drivers used n-type pullup to reduce power consumption - output only swings to ~4V not 5V Separate power supply rings for output drivers and core logic
Summary T0 circuit design mostly conservative, low risk Robustness engineered into all cells and overall design Only a few tricks where big wins possible Fast dynamic datapath logic to shorten critical paths Double-pumped vector register file to save area Novel output drivers to reduce power Day 2, Session B: Design Paths Full-custom Standard cell Final global checks
Full-Custom Tools Pre-existing tools used: • Viewlogic schematic editor (commercial) • Magic layout editor and extraction (university) • HSpice circuit simulator (commercial) • CAzM table-driven circuit simulator (university, now commercial) • irsim switch-level simulator (university) • gemini layout versus schematic compare (university) • Dracula design rule checker (commercial) In-house tools: • flat SPICE netlist flattener/processor • tilem procedural layout generator Full-Custom Design Process Initial specification with high-level schematic plus verbal communication (most full-custom work done before RTL finished) Design loop: Viewlogic schematic design (functionality and transistor sizing) Timing simulations with HSpice Functionality simulations with irsim magic layout Extractions with magic (get real parasitics - feed back into schematic) Iterate until design goals met. Clock cycle initially fixed at <50MHz to prevent over optimization.
Example Viewlogic Schematic RSEL IBIT IBITB 4 4 6 6 8 8 BIT BITB (I-Cache SRAM bit) Example magic Layout (Two halves of SRAM cache bits)
Standard Cell Design Path Initial RTL (Register Transfer Level) in C++ Each RTL control block manually translated into BDS • BDS, a limited, combinational-circuit-only hardware description language bdsyn compiles BDS into blif (Berkeley Logic Interchange Format) blif optimized and synthesized into gates using sis Gate netlist input to TimberWolf place and route. Also, generate Viewlogic schematic from gate netlist. RTL Model RTL (Register Transfer Level) design in C++. RTL model is “golden reference” for whole T0 design. Models state in every latch on every clock phase. Ran at 1,500 cycles/second on Sparcstation-20/61. 100-1000 times faster than Verilog or VHDL RTL model. (More on RTL in next session)
BDS Blocks C++ RTL control logic was manually split into about 20 blocks that the synthesis tool could handle (by trial and error). Each control block manually translated into equivalent BDS. Example BDS code (piece of JTAG block): routine run_tdo; state tdo<7:0>; if tapcin<3> then tdo = regioin else if iregin<3> then tdo = regioin else tdo = memioin; tdob = not tdo; endroutine; Synthesis with sis Each BDS block was translated into logic equations in blif Also, had to create timing specs for each block. Optimized and synthesized by sis (Berkeley synthesis package) Two basic synthesis scripts created: • target minimal area • target minimal delay Some critical blocks were tuned with own custom synthesis scripts. Synthesis could sometimes take infinite time or infinite memory. => had to split blocks further or rewrite script.
Recommend
More recommend