A Memory-Balanced Linear Pipeline Architecture for Trie-based IP - PDF document

A Memory-Balanced Linear Pipeline Architecture for Trie-based IP Lookup Weirong Jiang and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, CA 90089, USA { weirongj, prasanna } @usc.edu Abstract lookup can be divided into two main categories: TCAM- based and SRAM-based solutions. Although TCAM-based Rapid growth in network link rates poses a strong de- engines can retrieve IP lookup results in just one clock, their throughput is limited by the low speed of TCAM 1 . mand on high speed IP lookup engines. Trie-based architectures are natural candidates for pipelined implementation SRAM outperforms TCAM with respect to speed, density to provide high throughput. However, simply mapping a trie and power consumption, but traditional SRAM-based en- level onto a pipeline stage results in unbalanced memory gines need multiple clock cycles to finish a lookup. As distribution over different stages. To address this problem, pointed out by a number of researchers, using pipelining several novel pipelined architectures have been proposed. can significantly improve the throughput. For trie-based IP But their non-linear pipeline structure results in some new lookup, a simple approach is to map each trie level onto a performance issues such as throughput degradation and de- private pipeline stage with its own memory and processing lay variation. In this paper, we propose a simple and effec- logic. With multiple stages in the pipeline, one IP packet tive linear pipeline architecture for trie-based IP lookup. can be looked up during a clock period. However, this ap- Our architecture achieves evenly distributed memory while proach results in unbalanced trie node distribution over dif- realizing high throughput of one lookup per clock cycle. It ferent pipeline stages. This has been identified as a domi- offers more freedom in mapping trie nodes to pipeline stages nant issue for pipelined architectures [1, 2, 15]. In an un- by supporting nop s. We implement our design as well as the balanced pipeline, the stage storing a larger number of trie state-of-the-art solutions on a commodity FPGA and eval- nodes needs more time to access the larger memory. It also results in more frequent updates, which are proportional to uate their performance. Post place and route results show that our design can achieve a throughput of 80 Gbps, up the number of trie nodes stored in the local memory. When to twice the throughput of reference solutions. It has con- there is intensive route insertion, the larger stage can lead to stant delay, maintains input order, and supports incremen- memory overflow. Hence, such a heavily utilized stage can tal route updates without disrupting the ongoing IP lookup become a bottleneck and affect the overall performance of operations. the pipeline. To address these problems, some novel pipeline architectures have been proposed for implementation using ASIC technology. They achieve a relatively balanced memory 1. Introduction distribution by using circular structures. However, their non-linear pipeline structures result in some new perfor- With the continuing growth of Internet traffic, IP address mance issues, such as throughput degradation and delay lookup has been a significant bottleneck for core routers. variation. Moreover, their performance is evaluated by esti- Advances in optical networking technology have pushed mation rather than on real hardware. For example, CACTI link rates in high speed routers beyond 40 Gbps, and Ter- [3], a popular tool for estimating the SRAM performance abit links are expected in near future. To catch up with the has been used. However, such estimations do not consider rapid increase of link rates, IP lookup in high speed routers many implementation issues, such as routing and logic de- must be performed in hardware. For example, OC-768 (40 lays. The actual throughput when implemented on FPGAs Gbps) links require a throughput of 8 ns per lookup for a may be lower. minimum size (40 bytes) packet. Software-based solutions cannot support such rates. 1 Currently the highest advertised TCAM speed is 133 MHz while state Current hardware-based solutions for high speed IP of the art SRAMs can easily achieve clock rates of over 400 MHz.

In this paper, we focus on trie-based IP lookup en- �� gines that utilize pipelining. Linear pipeline architecture �� is adopted due to its desirable properities, such as constant �� delay and high throughput of one output per clock cycle. �� Using a fine-grained node-to-stage mapping, trie nodes are � � � � � �� evenly distributed across most of the pipeline stages. For �� a realistic performance evaluation, we implement our de- �� sign as well as the state-of-the-art solutions on a commodity �� FPGA. Post place and route results shows that, the proposed �� architecture can achieve a throughput of 80 Gbps for mini- �� mum size (40 bytes) packets on a single Xilinx Virtex II Pro FPGA [19]. Average memory usage per entry is 115.2 bits, �� excluding the next-hop information. In addition, our design �� supports fast incremental on-line updates without disruption � � � � to the ongoing IP lookup process. �� The rest of the paper is organized as follows. In Sec- � � � � � � �� tion 2, we review the background and related works. In �� Section 3, we propose our optimized design named Opti- �� mized Linear Pipeline (OLP) architecture. In Section 4, we � � � � implement on FPGAs the OLP architecture as well as state- �� of-the-art pipelined architectures, and then compare their �� performance. Finally, in Section 5, we conclude the paper. Figure 1. (a) Prefix set; (b) Uni-bit trie; (c) 2. Background Leaf-pushed trie. IP lookup has been extensively studied [4, 13, 18]. From the perspective of data structures, these techniques Normally each trie node contains two fields: the repre- can be classified into two main catergories: trie-based sented prefix and the pointer to the child nodes. By using [8, 11, 14, 16] and hash-based solutions [5, 7]. In this paper, an optimization called leaf-pushing [17], each node needs we consider only trie-based IP lookup which is naturally only one field: either the prefix index or the pointer to the suitable for pipelining. child nodes. Some optimization schemes [4, 6] are also proposed to build a memory-efficient multi-bit trie. For sim- 2.1 Trie-based IP Lookup plicity, we consider only the leaf-pushed uni-bit trie in this paper, though our ideas can be applied to other more ad- vanced tries. A trie is a tree-like data structure for longest prefix matching. Each prefix is represented by a node in the trie, 2.2 Pipelined Architectures and the value of the prefix corresponds to the path from the root of the tree to the node. The prefix bits are scanned left to right. If the scanned bit is 0, the node has a child to the A straightforward way to pipeline a trie is to assign left. A bit of 1 indicates a child to the right. The routing each trie level to a distinct stage so that a lookup request table in Figure 1 (a) corresponds to the trie in Figure 1 (b). can be issued every cycle, thus increasing the throughput. For example, the prefix 010 corresponds to the path starting However, this simple pipeline scheme results in unbalanced at the root and ending in node P3: first a left-turn (0), then memory distribution, leading to low throughput and ineffi- a right-turn (1), and finally a turn to the left (0). cient memory allocation [1, 15]. IP lookup is performed by traversing the trie according Basu et al. [2] and Kim et al. [8] both reduce the memory to the bits in the IP address. When a leaf is reached, the last imbalance by using variable strides to minimize the largest seen prefix along the path to the leaf is the longest matching trie level. However, even with their schemes, the size of the prefix for the IP address. The time to look up a uni-bit trie memory of different stages can have a large variation. As (which is traversed in a bit-by-bit fashion), is equal to the an improvement upon [8], Lu et al. [10] proposes a tree- prefix length. The use of multiple bits in one scan increases packing heuristic to further balance the memory, but it does the search speed. Such a trie is called a multi-bit trie. The not solve the fundamental problem of how to retrieve one number of bits scanned at a time is called stride . node’s descendents which are not allocated in the following

A Memory-Balanced Linear Pipeline Architecture for Trie-based IP - PDF document

A Memory-Balanced Linear Pipeline Architecture for Trie-based IP Lookup Weirong Jiang and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, CA 90089, USA { weirongj, prasanna }

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

LOG-STRUCTURED MERGE-TRIE PART 1 Xingbo Wu and Yuehai Xu, Wayne State University; Zili Shao, The

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Immutability, or Putting the Dream Machine to Work The trie memory scheme is ine ffi cient for

Immutability, or Putting the Dream Machine to Work The trie memory scheme is ine ffi cient for

Pipeline A Presentation by Team Pipeline Ben Lai Brandon Bakhshai Jeffrey Serio Somya

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Bonsai: Balanced Lineage Authentication Ashish Gehani Bonsai:Balanced Lineage Authentication

Wind Turbines Wind Turbines A balanced wind turbine rotates smoothly A balanced wind turbine

Office of Pipeline Safety Office of Pipeline Safety Presentation on Presentation on Damage

Ma Magic Mountain Pipeline Phase 6 gic Mountain Pipeline Phase 6 Project ject Board Meeting

Acquisition of: Sid Richardson Energy Services Co. December 2005 The enclosed materials are

FIXED INCOME INVESTOR PRESENTATION A final base shelf prospectus containing important information

ARATOS Pipeline Surveillance System Presentation of the system Aratos Pipeline Surveillance

Targa Resources Investor Presentation Second Quarter 2016 August 3, 2016 Forward Looking

Energy Minimization of Pipeline Processor Using a Low Voltage Pipelined Cache Vincent J. Mooney

Nancy Broadbent, Executive Vice President Academic Dr. Trent Keough, President & CEO Portage

Thoroughfare Road Impact Fee Study Overview of Update October 2, 2018 Supplemental presentation

Talent Acquisition EMEA The Journey of Evolution 14th February 2019 Information

A Memory-Balanced Linear Pipeline Architecture for Trie-based IP - PDF document

A Memory-Balanced Linear Pipeline Architecture for Trie-based IP Lookup Weirong Jiang and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, CA 90089, USA { weirongj, prasanna }

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

LOG-STRUCTURED MERGE-TRIE PART 1 Xingbo Wu and Yuehai Xu, Wayne State University; Zili Shao, The

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Immutability, or Putting the Dream Machine to Work The trie memory scheme is ine ffi cient for

Immutability, or Putting the Dream Machine to Work The trie memory scheme is ine ffi cient for

Pipeline A Presentation by Team Pipeline Ben Lai Brandon Bakhshai Jeffrey Serio Somya

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Bonsai: Balanced Lineage Authentication Ashish Gehani Bonsai:Balanced Lineage Authentication

Wind Turbines Wind Turbines A balanced wind turbine rotates smoothly A balanced wind turbine

Office of Pipeline Safety Office of Pipeline Safety Presentation on Presentation on Damage

Ma Magic Mountain Pipeline Phase 6 gic Mountain Pipeline Phase 6 Project ject Board Meeting

Acquisition of: Sid Richardson Energy Services Co. December 2005 The enclosed materials are

FIXED INCOME INVESTOR PRESENTATION A final base shelf prospectus containing important information

ARATOS Pipeline Surveillance System Presentation of the system Aratos Pipeline Surveillance

Targa Resources Investor Presentation Second Quarter 2016 August 3, 2016 Forward Looking

Energy Minimization of Pipeline Processor Using a Low Voltage Pipelined Cache Vincent J. Mooney

Nancy Broadbent, Executive Vice President Academic Dr. Trent Keough, President &amp; CEO Portage

Thoroughfare Road Impact Fee Study Overview of Update October 2, 2018 Supplemental presentation

Talent Acquisition EMEA The Journey of Evolution 14th February 2019 Information

Nancy Broadbent, Executive Vice President Academic Dr. Trent Keough, President & CEO Portage