Beyond TCAMs: An SRAM-based Parallel Multi-Pipeline Architecture for - PDF document

Beyond TCAMs: An SRAM-based Parallel Multi-Pipeline Architecture for Terabit IP Lookup Weirong Jiang, Qingbo Wang and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, CA 90089, USA Email: { weirongj, qingbow, prasanna } @usc.edu shown in Table I, SRAM outperforms TCAM with respect to Abstract —Continuous growth in network link rates poses a strong demand on high speed IP lookup engines. While Ternary speed, density and power consumption. However, traditional Content Addressable Memory (TCAM) based solutions serve SRAM-based solutions, most of which can be regarded as most of today’s high-end routers, they do not scale well for some form of tree traversal, need multiple clock cycles to the next-generation [1]. On the other hand, pipelined SRAM- complete a lookup. For example, trie [3], a tree-like data based algorithmic solutions become attractive. Intuitively multiple pipelines can be utilized in parallel to have a multiplicative structure representing a collection of prefixes, is widely used effect on the throughput. However, several challenges must be in SRAM-based solutions. It needs multiple memory accesses addressed for such solutions to realize high throughput. First, to search a trie to find the longest matched prefix for an IP the memory distribution across different stages of each pipeline packet. as well as across different pipelines must be balanced. Second, the traffic on various pipelines should be balanced. TABLE I In this paper, we propose a parallel SRAM-based multi- C OMPARISON OF TCAM AND SRAM TECHNOLOGIES (18 M BIT CHIP ) pipeline architecture for terabit IP lookup. To balance the memory requirement over the stages, a two-level mapping scheme TCAM SRAM is presented. By trie partitioning and subtrie-to-pipeline mapping, Maximum clock rate (MHz) 266 [5] 400 [6], [7] we ensure that each pipeline contains approximately equal Power consumption (Watts) 12 ∼ 15 [8] ≈ 0.1 [9] number of trie nodes. Then, within each pipeline, a fine-grained Cell size (# of transistors per bit) [10] 16 6 node-to-stage mapping is used to achieve evenly distributed memory across the stages. To balance the traffic on different pipelines, both pipelined prefix caching and dynamic subtrie-to- Several researchers have explored pipelining to improve pipeline remapping are employed. Simulation using real-life data the throughput significantly. Taking trie-based solutions as shows that the proposed architecture with 8 pipelines can store a an example, a simple pipelining approach is to map each core routing table with over 200K unique routing prefixes using trie level onto a pipeline stage with its own memory and 3.5 MB of memory. It achieves a throughput of up to 3.2 billion packets per second, i.e. 1 Tbps for minimum size (40 bytes) processing logic. One IP lookup can be performed every clock packets. cycle. However, this approach results in unbalanced trie node distribution over the pipeline stages. This has been identified I. I NTRODUCTION as a dominant issue for pipelined architectures [11], [12]. IP lookup with longest prefix matching is a core function In an unbalanced pipeline, the “fattest” stage, which stores of Internet routers. It has become a major bottleneck for the largest number of trie nodes, becomes a bottleneck. It backbone routers as the Internet continues to grow rapidly adversely affects the overall performance of the pipeline for [2]. With the advances in optical networking technology, link the following reasons. First, it needs more time to access the rates in high speed IP routers are being pushed from OC- larger local memory. This leads to reduction in the global clock 768 (40 Gbps) to even higher rates. Such high rates demand rate. Second, a fat stage results in many updates, due to the that IP lookup in routers must be performed in hardware. For proportional relationship between the number of updates and instance, 40 Gbps links require a throughput of 8 ns per lookup the number of trie nodes stored in that stage. Particularly dur- for a minimum size (40 bytes) packet. Such throughput is ing the update process caused by intensive route insertion, the impossible using existing software-based solutions [3]. fattest stage can also result in memory overflow. Furthermore, Most hardware-based solutions for high speed IP lookup since it is unclear at hardware design time which stage will fall into two main categories: TCAM (ternary content ad- be the fattest, we need to allocate memory with the maximum dressable memory)-based and DRAM/SRAM (dynamic/static size for each stage. This results in memory wastage. random access memory)-based solutions. Although TCAM- To achieve a balanced memory distribution across stages, based engines can retrieve IP lookup results in just one clock several novel pipeline architectures have been proposed [13], cycle, their throughput is limited by the relatively low speed [14]. However, their non-linear pipeline structures result in of TCAMs. They are expensive and offer little flexibility for throughput degradation, and most of them must disrupt on- adapting to new addressing and routing protocols [4]. As going operations during a route update. Our previous work

Beyond TCAMs: An SRAM-based Parallel Multi-Pipeline Architecture for - PDF document

Beyond TCAMs: An SRAM-based Parallel Multi-Pipeline Architecture for Terabit IP Lookup Weirong Jiang, Qingbo Wang and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, CA 90089,

C ONTENT-ADDRESSABLE MEMORY (CAM) is a thus, slow updates retarded the lookup performance in

Processor + SRAM By: Jakub Hladik, Tim Lindquist The SRAM SRAM REQUIREMENTS: 256x8bit

AUTONOMOUS PERPENDICULAR AND PARALLEL PARKING USING MULTI-SENSOR BASED CONTROL David

High-Dimensional and Multi-Failure- Region SRAM Yield Analysis Xiao Shi 1,2 , Hao Yan 3 , Jinxin

Hardware Design with VHDL Design Example: SRAM ECE 443 External SRAM A common type of system

Industrial and economic problems E.g. : Logistics, telecommunications, IT, etc.

COMP 590-154: Computer Architecture Memory / DRAM SRAM vs. DRAM SRAM = Static RAM As

Highlights Highlights of of New New Pipeline Pipeline Medicines Medicines Based on Meds

Parallel Processing of Large-Scale XML-Based Application Documents on Multi-core Architectures

Background Allen Tanner built an SRAM/ROM generator program back in 2004 the ROM seems to

Background w Allen Tanner built an SRAM/ROM generator program back in 2004 n the ROM seems

A Data Remanence based Approach to Generate 100% Stable Keys from an SRAM Physical Unclonable

Stress-Aware Routing to Mitigate Aging Effects in SRAM-based FPGAs International

FFS@le.ac.uk M OTIVATION Developing a SRAM FPGA based Flash File System. Intended for

Grounding Issues in Parallel and Multi-Engine ASP Solving Francesco Ricca Dipartimento di

Single Event Effects in SRAM based FPGA for space applications Analysis and Mitigation

Correlations between Parallel Patterns and Multi-core Benchmarks Vivek Kale IWMSE workshop May

Background memCellsF09 Allen Tanner built an SRAM/ROM generator program back in 2004 Single-

Towards a Reconfigurable Bit-Serial/Bit-Parallel Vector Accelerator Using In-Situ

A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic

Parallel Programming Patterns Overview and Concepts Practical Outline Why parallel

WARM SRAM: A Novel Scheme to Reduce Static Leakage Energy in SRAM Arrays Mahadevan

Parallel Programming http://www.cs.bham.ac.uk/~hxt/2013/ parallel-programming/ based on: David

Multi-dimensional Packet Classification Yadi Ma, Suman Banerjee University of Wisconsin-Madison