Implementing Logic in FPGA Embedded Memory Arrays: Heterogeneous Memory Architectures Steve Wilton University of British Columbia Vancouver, B.C., Canada stevew@ece.ubc.ca
As FPGAs Get Bigger... Embedded Memory is becoming critical Implementing Storage on-chip is important: • Integration • Relax I/O Constraints • Speed • Flexibility Today, most FPGAs have large embedded memory arrays
Problem : If a circuit doesn’t need all memory blocks, valuable chip area wasted Solution : Configure memory blocks as ROMs and use them to implement logic
Implementing Logic in Memory: L N H P J K M Q G D F B C E A
Implementing Logic in Memory: L N N H Q P J K P M M Q G D F B C A C E E A Two published algorithms: SMAP, EMB_Pack
The ability of memory arrays to implement logic depends on the memory array architecture Previous Work: 2Kbit arrays with 8 outputs are good
Heterogeneous Memory Architectures Altera Stratix: Three types of memories MegaRAM M4K Blocks M512 Blocks
This Talk: A given: For storage: Several types of memories on a single chip is a good idea In this paper: For logic: 1. Heterogeneous memory architectures: a good idea? 2. How much does it help? 3. What memory sizes are best?
Methodology: Benchmark Circuits Architecture SMAP Area Model Pack as much logic as possible into memory arrays Amount of Area Logic Packed Packing Ratio = Amount of logic packed Area
SMAP Algorithm: Overall approach: 1. Map to 4-LUTs using Flowmap 2. Pack as many 4-LUTs as possible into arrays L N N H Q P J K P M M Q G D F B C A C E E A Goal : Maximize number of LUTs that can be packed
SMAP Algorithm: Goal: Maximize number of LUTs that can be packed Four Steps: 1. Choose a “seed node” 2. Choose signals that will become array inputs 3. Choose signals that will become array outputs 4. Insert memory into circuit, and remove 4-LUTs no longer needed
Choosing Inputs of Memory Array: Find maximum-volume d-feasible cut (Flowpack) 8-input memory Seed Node Cut edges become memory array inputs
Choosing Outputs of Memory Array: A bad way to choose output signal: L L N N H H P P J K J K M Q M Q G G D D F F B C C E E A Since D and F fan-out outside the fanin cone, we still need D and F (and their predecessors)
Suppose there are two memory outputs: L L N N H H P P J K J K M Q M Q G G D D F F B B C C E E A A N L N P H M P J K Q M Q D G C F E F E A C A Better Solution Bad Solution
Choosing Outputs of Memory Array: Goal : We want to select the w nodes such that the largest number of nodes can be deleted Problem : For w > 1 , it is computationally expensive to check all combinations of w potential outputs Heuristic: 1. For each potential output individually, find that node’s maximum fanout-free cone 2. Choose the w nodes with the largest MFFC’s.
Choosing a Seed Node: It turns out that the choice of seed node is very important - Try all nodes as potential seeds, choose whichever gives the best results - There are ways to speed this up, especially if there are many arrays
Results: Homogeneous Architectures 350 300 250 Packed Logic Blocks 200 150 100 50 0 128 256 512 1024 2048 4096 8192 Bits Per Array
Results: Homogeneous Architectures 350 350 300 300 Area (equiv. logic blocks) 250 250 Packed Logic Blocks 200 200 150 150 100 100 50 50 0 0 128 256 512 1024 2048 4096 8192 Bits Per Array
Results: Homogeneous Architectures 3.0 2.5 Packing Ratio 2.0 1.5 Logic Blocks Packed 1.0 Packing Ratio = Area (Equiv Logic Blocks) 0.5 128 256 512 1024 2048 4096 8192 Bits Per Array
Modifying SMAP for Heterogeneous Archs: SMAP fills arrays sequentially We have looked at two strategies: 1. Fill all large arrays first 2. Fill all small arrays first Strategy 1 gives better results
Two Sizes: Four Arrays of Each 23 % Improvement Best: 2048 bits / 128 bits 3.5 3.0 Packing Density 2.5 2.0 1.5 8192 Homogeneous 4096 Results 1.0 2048 8192 4096 1024 2048 512 1024 Array 1 Size 256 512 256 Array 2 Size 128 128
Observations from our Results: Trend 1: A combination of 2048 / 128 bit arrays is always the best choice Trend 2: The more arrays, the higher the gain seen by using a heterogeneous architecture
One Type-1 array and Two Type-2 Arrays: 4.0 3.5 3.0 Packing Density 2.5 2.0 8192 1.5 4096 1.0 2048 8192 1024 4096 512 2048 1024 Array 1 Size 256 512 Array 2 Size (one of these) 256 128 128 (two of these)
Four Type-1 arrays and Eight Type-2 Arrays: 3.0 2.5 Packing Density 2.0 8192 1.5 4096 1.0 2048 8192 1024 4096 256 2048 Array 1 Size 1024 (four of these) 512 Array 2 Size 512 256 (eight of these) 128 128
One Type-1 array and Three Type-2 Arrays: 4.0 Packing Density 3.5 3.0 2.5 8192 2.0 4096 1.5 2048 1024 8192 4096 512 2048 Array 1 Size 1024 256 Array 2 Size 512 (one of these) 256 (three of these) 128 128
Three Type-1 arrays and Nine Type-2 Arrays: 2.5 Packing Density 2.0 1.5 8192 4096 1.0 2048 8192 1024 4096 512 2048 Array 1 Size 1024 256 512 (three of these) Array 2 Size 256 128 128 (nine of these)
Observations from our Results: Trend 1: A combination of 2048 / 128 bit arrays is always the best choice Trend 2: The more arrays, the higher the gain seen by using a heterogeneous architecture Trend 3: From above, we should have 2048 bit arrays and 128 bit arrays. As the number of arrays increases, more of the arrays should be small.
One Type-1 array and Three Type-2 Arrays: Better 4.0 Packing Density 3.5 One large array 3.0 and 3 small arrays 2.5 8192 2.0 Three large arrays 4096 and one small array 1.5 2048 1024 8192 4096 512 2048 Array 1 Size 1024 256 Array 2 Size 512 (one of these) 256 (three of these) 128 128
Three Type-1 arrays and Nine Type-2 Arrays: Better 2.5 Packing Density 3 large arrays 2.0 and 9 small arrays 1.5 8192 4096 Nine large arrays 1.0 2048 and 3 small arrays 8192 1024 4096 512 2048 Array 1 Size 1024 256 512 (three of these) Array 2 Size 256 128 128 (nine of these)
Things we haven't taken into account: Speed: - Heterogeneous architectures are likely to give gains in speed (compared to homogeneous) since an array of "just the right size" can be used - Right now, SMAP doesn't optimize for speed, but for homogeneous architectures, there is little impact on speed Routing: - With heterogeneous architectures, there may be longer routes to get to the right memory - But not too bad, if only a few memory types
Summary Heterogeneous Memory Architectures are efficient when implementing logic - Compared to homogeneous architectures 23 % improvement is typical - The more arrays, the higher the gain - A combination of 2048 / 128 bit arrays is always the best choice - As the number of arrays increases, more of the arrays should be small.
Recommend
More recommend