Flexible Hardware Design at Flexible Hardware Design at Low Levels of Abstraction Low Levels of Abstraction Emil Axelsson Hardware Description and Verification May 2009
Why low-level? Why low-level? Related question: Why is some software written in C? (but difference between high- and low-level is much greater in hardware) Ideal: Software-like code → magic compiler → chip masks gadget a b = case a of 2 -> thing (b+10) 3 -> thing (b+20) _ -> fixNumber a
Why low-level? Why low-level? Related question: Why is some software written in C? (but difference between high- and low-level is much greater in hardware) Ideal: Software-like code → magic compiler → chip masks gadget a b = case a of 2 -> thing (b+10) 3 -> thing (b+20) _ -> fixNumber a
Why low-level? Why low-level? Reality: “Ascii schematic” → chain of synthesis tools → chip masks
Why low-level? Why low-level? Reality: “Ascii schematic” → chain of synthesis tools → chip masks Reiterate to improve timing/power/area/etc. Very costly / time-consuming Each fabrication costs ≈ $ 1.000.000
Failing abstraction Failing abstraction Realistic flow cannot avoid low-level awareness Paradox Modern designs require higher abstraction level ...but... Modern chip technologies make abstraction harder Main problem: Routing wires are dominant in signal delays and power consumption Controlling the wires is key to the performance!
Gate vs. wire delay under scaling Gate vs. wire delay under scaling Relative delay Process technology node [nm]
Physical design level Physical design level Certain high-performance components (e.g. arithmetic) need to be designed at even lower level Physical level: A set of connected standard cells (implemented gates) Absolute or relative positions of cells (placement) Shape of connecting wires (routing)
Physical design level Physical design level Design by interfacing to physical CAD tools Call automatic tools for certain tasks (mainly routing) Often done through scripting code Tedious Hard to explore design space Limited design reuse Aim of this work: Raise the abstraction level of physical design! Raise the abstraction level of physical design!
Two ways to raise abstraction Two ways to raise abstraction Automatic synthesis + Powerful abstraction – May not be optimal for e.g. high-performance arithmetic – Opaque (hard to control the result) – Unstable (heuristics-based) Language-based techniques (higher-order functions, recursion, etc.) + Transparent, stable – Still quite low-level – Somewhat limited to regular circuits
Two ways to raise abstraction Two ways to raise abstraction Automatic synthesis + Powerful abstraction – May not be optimal for e.g. high-performance arithmetic – Opaque (hard to control the result) – Unstable (heuristics-based) Language-based techniques (higher-order functions, recursion, etc.) + Transparent, stable – Still quite low-level – Somewhat limited to regular circuits Our approach
Lava Lava Gate-level hardware description in Haskell Parameterized module generators : Haskell programs that generate circuits Can be smart, e.g. optimize for speed in a given environment Basic placement expressed through combinators Used successfully to generate high-performance FPGA cores
Wired: Extension to Lava Wired: Extension to Lava Finer control over geometry More accurate performance models Feedback from timing/power analysis enables self-optimizing generators Wire-awareness (unique for Wired) Performance analysis based on wire length estimates Control routing through “guides” (experimental) ...
Monads in Haskell Monads in Haskell Haskell functions are pure Side-effects can be “simulated” using monads Syntactic sugar, add a b = do prog = do expands to a pure as <- get a <- add 5 6 program with explicit put (a:as) b <- add a 7 state passing return (a+b) add b 8 *Main> runState prog [] (26, [18,11,5]) Monads can also be used to model e.g. IO, exceptions, Result Side-effect non-determinism etc.
Monad combinators Monad combinators Haskell has a general and well-understood combinator library for monadic programs *Main> runState ( mapM (add 2) [11..13]) [] ([13,14,15],[2,2,2]) *Main> runState ( mapM (add 2 >=> add 4) [11..13]) [] ([17,18,19],[4,2,4,2,4,2])
Example: Parallel prefix Example: Parallel prefix Given inputs x 1 , x 2 , … x n y 1 = x 1 compute y 2 = x 1 ∘ x 2 … y n = x 1 ∘ x 2 ∘ … ∘ x n for ∘ , an associative (but not necessarily commutative) operator
Parallel prefix Parallel prefix Very central component in microprocessors Most common use: Computing carries in fast adders Trying different operators: Addition: prefix (+) [1,2,3,4]
Parallel prefix Parallel prefix Very central component in microprocessors Most common use: Computing carries in fast adders Trying different operators: Addition: prefix (+) [1,2,3,4] = [1, 1+2, 1+2+3, 1+2+3+4] = [1,3,6,10]
Parallel prefix Parallel prefix Very central component in microprocessors Most common use: Computing carries in fast adders Trying different operators: Addition: prefix (+) [1,2,3,4] = [1, 1+2, 1+2+3, 1+2+3+4] = [1,3,6,10] Boolean OR: prefix (||) [F,F,F, T ,F, T , T ,F]
Parallel prefix Parallel prefix Very central component in microprocessors Most common use: Computing carries in fast adders Trying different operators: Addition: prefix (+) [1,2,3,4] = [1, 1+2, 1+2+3, 1+2+3+4] = [1,3,6,10] Boolean OR: prefix (||) [F,F,F, T ,F, T , T ,F] = [F,F,F, T , T , T , T , T ]
Parallel prefix Parallel prefix Implementation choices (relying on associativity): prefix ( ∘ ) [ x 1 , x 2 , x 3 , x 4 ] = [ y 1 , y 2 , y 3 , y 4 ] Serial: y 4 = (( x 1 ∘ x 2 ) ∘ x 3 ) ∘ x 4 y 4 = ( x 1 ∘ x 2 ) ∘ ( x 3 ∘ x 4 ) Parallel: y 4 = y 3 ∘ x 4 Sharing:
There are many of them... There are many of them... Sklansky Brent-Kung Ladner-Fischer
Parallel prefix: Sklansky Parallel prefix: Sklansky Simplest approach (divide-and-conquer) sklansky op [a] = return [a] sklansky op as = do Purely structural let k = length as `div` 2 (no geometry) (ls,rs) = splitAt k as' ls' <- sklansky op ls rs' <- sklansky op rs rs'' <- sequence [op (last ls', r) | r <- rs'] return (ls' ++ rs'') Could have been (monadic) Lava
Refinement: Add placement Refinement: Add placement sklansky op [a] = space cellWidth [a] sklansky op as = downwards 1 $ do let k = length as `div` 2 (ls,rs) = splitAt k as' (ls',rs') <- rightwards 0 $ liftM2 (,) (sklansky op ls) (sklansky op rs) rs'' <- rightwards 0 $ sequence [op (last ls', r) | r <- rs'] return (ls' ++ rs'')
Sklansky with placement Sklansky with placement Simple postscript allows interactive development of placement
Refinement: Add routing guides Refinement: Add routing guides bus = rightwards 0 . mapM bus1 where bus1 = space 2750 >=> guide 3 500 >=> space 1250 sklanskyIO op = downwards 0 $ inputList 16 "in" >>= bus >>= space 1000 >>= sklansky op >>= space 1000 >>= bus >>= output "out" Reusing standard (monadic) Haskell combinators (nothing Wired-specific)
Sklansky with guides Sklansky with guides
Refinement: More guides Refinement: More guides sklansky op [a] = space cellWidthD [a] sklansky op as = downwards 1 $ do bus as let k = length as `div` 2 (ls,rs) = splitAt k as (ls',rs') <- rightwards 0 $ liftM2 (,) (sklansky op ls) (sklansky op rs) rs'' <- rightwards 0 $ sequence [op (last ls', r) | r <- rs'] bus (ls' ++ rs'')
Sklansky with guides Sklansky with guides
Experiment: Compaction Experiment: Compaction sklansky op [a] = space cellWidthD [a] Buses were compacted separately sklansky op [a] = return [a]
Export to CAD tool (Cadence Soc Encounter) Export to CAD tool (Cadence Soc Encounter) Exchanged using DEF file format Auto-routed in Encounter Odd rows flipped to share power rails Simple change in recursive call: sklansky (flipY.op) ls
Fast, low-power prefix networks Fast, low-power prefix networks Mary Sheeran has developed circuit generators in Lava that search for fast, low-power parallel prefix networks Initially, crude performance models Delay: Logical depth Power: Number of operators Still good results Now using Wired to improve accuracy Static timing/power analysis using models from cell library
Minimal change to search algorithm Minimal change to search algorithm prefix f p = memo pm where pm ([],w) = perhaps id' ([],w) pm ([i],w) = perhaps id' ([i],w) pm (is,w) | 2^(maxd(is,w)) < length is = Fail pm (is,w) = (bestOn is f . dropFail) [ wrpC ds (prefix f p ) (prefix p p ) | ds <- igen ... ] where wrpC ds p1 p2 = wrp ds (perhaps id’ c) (p1 c1) (p2 c2) ...
Minimal change to search algorithm Minimal change to search algorithm prefix f p = memo pm where pm ([],w) = perhaps id' ([],w) pm ([i],w) = perhaps id' ([i],w) pm (is,w) | 2^(maxd(is,w)) < length is = Fail pm (is,w) = (bestOn is f . dropFail) [ wrpC ds (prefix f p ) (prefix p p ) | ds <- igen ... ] where wrpC ds p1 p2 = wrp ds (perhaps id’ c) (p1 c1) (p2 c2) ... Plug in cost functions that analyze the placed network through Wired
Recommend
More recommend