Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto Computer Engineering Research Group February 22, 2010
Parallelism in FPGAs Larger SoCs on FPGAs → Parallel Systems Parallel systems on FPGAs will need: − Queueing − Data sharing − Communication − Synchronization Boils down to: − FIFOs − Register files We can do all these with multi-ported memories 2
Multi-Ported Memory X X X X Existing workarounds are ad-hoc, “roll-your-own”, and have limited parallelism. 3
Conventional Approaches 4
2W/2R Multi-Ported Memory Doesn't exist on FPGAs Altera used to have one (Mercury) 5
Stratix III Building Blocks Adaptive Logic Modules Flexible, Registers but slow LUTs Adders Block RAMs Fast, but M9K (eg: 32 x 256) inflexible M144K (eg: 32 x 4098) 6
2W/2R Pure-ALM Scales very poorly with memory depth 7
1W/nR Replication Only one write port Multiple read ports 8
mW/nR Banking Multiple write ports Fragmented data 9
mW/nR “Multipumping” Multiple read/write ports Divides clock speed No fragmentation Read/write ordering 10
Block RAMs: Simple Dual Port Write Read 11
Block RAMs: True Dual Port R / W R / W 12
“Pure Multipumping” Read as banked memory (multiple reads) 13
“Pure Multipumping” Write as replicated memory (avoids fragmentation) 14
Methodology Generate design variations over space − Vary # of ports, depth, type of memories 1W/2R to 8W/16R 2 to 256 elements deep Pure-ALM, M9K, MLAB, Multipumped − Wrap in testbench for timing and correctness Target Quartus 9.0 to Stratix III − No synthesis optimizations for speed or area − Standard P&R effort (speed, avg. over 10 runs) Measure area as Total Equivalent Area − Expresses area in a single unit (ALMs) 15
Conventional Multi-Porting Performance 16
1W/2R Pure-ALM Area vs. Speed Too big and slow! Faster NiosII/f 290 MHz 500 ALMs 17 Smaller
1W/2R Replicated vs. Pure-ALM 18
1W/2R “Pure Multipumping” 19
LVT-Based Multi-Ported Memories 20
LVT-Based Memory 21
LVT-Based Memory Begin with one block RAM 22
LVT-Based Memory Replicate for two read ports 23
LVT-Based Memory Bank for two write ports 24
LVT-Based Memory Select bank to read from 25
LVT-Based Memory Add bank lookup table 26
LVT-Based Memory 27
Live Value Table Operation 28
LVT Operation 2W/2R, 4-deep 29
LVT Operation W 0 W 0 R 0 0 1 2 R 1 W 1 3 Write Addresses Read Addresses Live Value Table 30
LVT Operation: Write W 0 W 0 R 0 0 42 @ 1 1 0 2 R 1 W 1 23 @ 3 3 1 Records which write port last updated a location 31
LVT Operation: Read W 0 W 0 R 0 0 @ 3 1 1 0 2 R 1 W 1 @ 1 3 1 0 Steers read port to correct memory bank 32
LVT Implementation LVT remains practical because it is very narrow 33
LVT Operation Small Pure-ALM memory controlling larger block RAMs 34
Advantages of LVTs LVTs add a layer of indirection − Everything operates in parallel − Makes banked memory behave as consistent unit LVTs are narrow − Word width = log 2 (# of write ports) < 4 bits typically − Pure-ALM, but practical size and speed 35
LVT Performance 36
2W/4R Pure-ALM 37
2W/4R LVT-based vs. Pure-ALM 412 MHz to 375 MHz 84% smaller 43% faster 38
2W/4R Multipumping Must be careful about read/write ordering! 39
Multipumping Performance 40
2W/4R Multipumping 41
2W/4R Multipumping Pure Multipumping (279 MHz) 42
4W/8R Multipumping Worsens as # of ports increases 43
2W/4R Multipumping 54% slower 28% smaller on average on average 193 MHz to 174 MHz 44
Conclusions LVT-based memories are faster and smaller than Pure-ALM memories. LVT-based memories are faster than pure multipumping, but at a cost in area. Pure multipumped memories are better for memories with few ports or low speed. 45
Future Work Pure multipumping for LVT-based memories − Build banks with 2W/4R pure multipumping blocks − Possible further area improvement Relaxing the read/write order for multipumping − Allows multiplexing the write ports − Leaves designer to watch for WAR violations 46
Thank You 47
Recommend
More recommend