Yun R. Qu Viktor K. Prasanna Ming Hsieh Department of Electrical - PDF document

POWER-EFFICIENT RANGE-MATCH-BASED PACKET CLASSIFICATION ON FPGA ∗ Yun R. Qu Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, CA 90089 { yunqu, prasanna } @usc.edu ABSTRACT Many existing solutions for packet classification employ Ternary Content Addressable Memory (TCAM) [2]. TCAM Packet classification is a kernel application performed at net- is notorious for its high cost and power consumption. State- work routers. Many classification engines are optimized for of-the-art VLSI chips can be built with massive amount of prefix and exact match, while a range-to-prefix translation on-chip computation and memory resources, as well as large can lead to rule set expansion. Under limited power budget, number of I/O pins for off-chip memory accesses; FPGAs it is challenging to achieve high classification throughput. [3], with their flexibility and reconfigurability, are especially In this paper, we present a high-performance and power- suitable for accelerating network applications. efficient packet classification engine on FPGA. We construct In this paper, we propose a high-performance and power- a modular Processing Element (PE); each PE compares a efficient packet classification engine on FPGA. The engine stride of the input packet header against a stride of a range can perform prefix match, exact match, or range match on boundary. We concatenate multiple PEs into a systolic ar- any field. Efficient power optimization techniques are em- ray. Efficient power optimization techniques including self- ployed on this engine. Specifically: enabled power gating and entropy-based scheduling are ex- • We construct a modular PE to match a stride of the plored on our architecture. Experimental results show that, packet header against a stride of a range boundary. for 4 K 15-field rule sets, our prototype on a state-of-the-art We concatenate multiple PEs into a systolic array to FPGA can achieve 250 Million Packets Per Second (MPPS) sustain high clock rates for large rule sets. throughput. Using the proposed power optimization tech- • We employ a self-enabled power gating technique on niques, our classification engine consumes 30% of the power our architecture. The modular PEs are selectively en- without sacrificing the throughput. abled to save the memory access power. • We propose an entropy-based scheduling for various 1. INTRODUCTION fields. To improve the efficiency of our power gating technique, the fields corresponding to higher entropy The development of Internet demands routers to support a values are matched in the first few pipeline stages. variety of network applications, such as firewall processing • We prototype our designs on a state-of-the-art FPGA. and Quality of Service (QoS) differentiation. This makes Post place-and-route results demonstrate 250 MPPS packet classification a kernel function for network manage- throughput while using 1 . 655 W power ( 70% reduc- ment tasks; an incoming packet can be discarded, forwarded tion compared to the non-optimized designs). to specific ports, or broadcast based on many criteria. The rest of the paper is organized as follows: Section 2 in- Packet classification faces the following challenges: (1) troduces the packet classification problem. We present our the expanding depth and width of the classification rule sets, hardware architecture and optimization techniques in Sec- (2) the growing complexity of the rule sets, and (3) the in- tion 3 and Section 4, respectively. We evaluate the perfor- creasing demand for high throughput and low power. For mance on FPGA in Section 5. Section 6 compares our work example, in OpenFlow protocol [1], 15 fields of the packet with the related works. Section 7 concludes the paper. header have to be examined; some fields require generic range match to be performed. Meanwhile, many emerg- 2. BACKGROUND ing network applications require high throughput under con- strained power budget. These factors make packet classifi- 2.1. Packet Classification cation a critical task in high-performance routers. Packet classification involves classifying packets based on ∗ Supported by U.S. National Science Foundation under grant CCF- multiple fields in the packet header [2, 4]. The individual 1320211. Equipment grant from Xilinx Inc. is gratefully acknowledged.

Table 1 : An example of OpenFlow packet classification rule set [1], N = 4 rules, M = 15 fields Meta- Eth Eth Eth MPLS MPLS RID Ingr VID Vprty SA DA Prtl ToS SP DP data src dst type lbl tfc No. of bits 32 64 48 48 16 12 3 20 3 32 32 8 6 16 16 R 0 5 1024 00:13:A9:00:42:40 00:13:08:C6:54:06 0x0800 * 5 0 * 001* * TCP 0 * * R 1 * 1024 00:FF:FF:FF:FF:FF 00:13:08:C6:54:06 0x0800 100 7 163 0 00* 1011* UDP * * * R 2 * 2048 * 00:FF:FF:FF:FF:FF 0x8100 4095 7 * * 1* 1011* * * 2-1024 5-5 R 3 * * 00:13:E6:24:5F:31 11:7B:C5:98:F0:FF * * * * * * * * * * 80 predefined entries for classifying a packet are called rules , f_in which are stored in a rule set . Each rule has a rule ID 𝑿 𝒏 (RID), multiple fields and their associated values, a prior- x ity, and an action to be taken if matched. Different fields in 𝒕 -bit Comparator y en0 en0_out a rule require various types of match criteria, such as prefix en0_in memory Register d0 Data match, range match, and exact match. A packet is consid- eql0 eql0_out ered matching a rule only if it matches all the fields in that eql0_in rule. A packet may match multiple rules, but usually only less0 less0_out less0_in the rule with the highest priority is used to take action. We denote the total number of rules as N . We index all x 𝒕 -bit Comparator the fields as m = 0 , 1 , . . . , M − 1 , where M is the total y en1 en1_out en1_in memory Register d1 number of packet header fields. The classic packet classifi- Data cation [2] requires M = 5 fields to be examined, while the eql1 eql1_out eql1_in OpenFlow table lookup [1] checks in total M = 15 fields less1 of the packet header. In Table 1, we show an OpenFlow 15- less1_out less1_in field rule set consisting of 4 rules (omitting the actions) as an example. Our methodology in this paper can be applied Register to packet classification involving more than 15 fields. 𝒕 f_out 2.2. Range Match Fig. 1 : A modular PE comparing an s -bit stride with c = 2 We denote the field requiring prefix match as prefix match range boundaries in parallel field , while we define the “projection” of the rule in this field as prefix match rule . For example, “001*” is a prefix match rule in the SA field of the rule set in Table 1. Similarly, exact m = 0 , 1 , . . . , M − 1 . A naive approach to match ranges match field , exact match rule , range match field , and range is to deploy a W m -bit comparator for each range boundary. match rule can be defined. However, since (1) W m can be relatively large ( e.g. 64 bits), Many existing packet classification engines are optimized and (2) the critical path in the comparator is O ( W m ) , this for prefix match and exact match [5], especially when TCAM naive approach often results in low clock rate. is used. For range match, they usually require range-to- The key idea of our approach in this section is to split a prefix translations [4]. This leads to rule set expansion : range boundary in a long field into multiple shorter strides; for instance, a range 1 [1 , 8) in a 3-bit field corresponds to this leads to shorter critical paths and higher clock rate. An- a union of 3 prefixes: { 001 , 01 ∗ , 1 ∗} . Note, however, any other major difference between this work and prior works given prefix or exact value can be represented by a single [2,4,6] is that our architecture is self-aware and can be tuned (generic) range. Hence, in this paper, we propose a PE that for better power consumption (see Section 4). matches ranges; this means our overall architecture can perform prefix match, exact match, and range match on any field of a packet header without rule set expansion. 3.1. Modular PE We construct a modular Processing Element (PE) to com- 3. ARCHITECTURE pare an s -bit stride of the input packet header against n strides of n range boundaries independently. We show the To handle all types of matches, prefixes or exact values are modular PE in Figure 1, where clock and control signals first translated into ranges; this step is trivial so we ignore are omitted for simplicity. We denote the number of range it in this paper. Suppose we have W m -bit ranges, where boundaries compared by a modular PE as c . The modular PE in Figure 1 compares an s -bit stride of the packet header 1 Without loss of generality, we use half-closed and half-open closures.

Yun R. Qu Viktor K. Prasanna Ming Hsieh Department of Electrical - PDF document

POWER-EFFICIENT RANGE-MATCH-BASED PACKET CLASSIFICATION ON FPGA Yun R. Qu Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, CA 90089 { yunqu, prasanna } @usc.edu ABSTRACT

Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. Prasanna Bhargava Gundala,

Head-Body Partitioned String Matching for Deep Packet Inspection with Scalable and

A Memory-Balanced Linear Pipeline Architecture for Trie-based IP Lookup Weirong Jiang and Viktor

2 Related Work and Background 2.2 Aho-Corasick Algorithm 2.1 Related Work A class of

Commercial MLaaSPlatforms Yun-Yun Tsai & Tsung-Yi Ho National Tsing Hua University #BHUSA

William Yun Chen William Yun Chen chen_w@math.psu.edu Pennsylvania State University ICERM

Automatic Generation of High Throughput Energy Efficient Streaming Architectures for Arbitrary

EECS 678: Introduction to Operating Systems Heechul Yun 1 About Me Heechul Yun, Assistant

Q Group October 19, 2011 Institutional Quality Hedge Funds David A Hsieh (c) David A. Hsieh,

Video 3.1 Vijay Kumar and Ani Hsieh Robo3x-1.3 1 Property of Penn Engineering, Vijay Kumar

Beyond TCAMs: An SRAM-based Parallel Multi-Pipeline Architecture for Terabit IP Lookup Weirong

Founders, Ventures, and Wardley Maps Prasanna Krishnamoorthy Co-founder Upekkha @prasanna_says

Peripheral mycotic aneurysm with Enterococcus faecalis bacteremia: A rare case report Yen-Chen Yu

Urbashi Mitra Ming Hsieh Department of Electrical Engineering University of Southern California,

Processes at UC RUSAL Aluminium Smelters Mikhail Grinishin, Viktor Buzunov Presented by: Viktor

Design Methodologies Power Consumption Power Consumption Area Viktor wall Viktor

Tornado/Hail: To Model or Not to Model Casualty Actuaries in Reinsurance: CARe June 4 - 5, 2012

CuPy NumPy compatible GPU library for fast computation in Python Preferred Networks Crissman

Supervisor: Prof Robert W Stewart Dr Louise Crockett Outline Motivation and Objective

Project: IEEE P802.15 Working Group for Wireless Personal Area Networks ( etworks (WPANs WPANs)

A Business vie iew of f SAS Vis isual Analytics Presented by Geo eoff Gordon April 2017

Video based Animation Synthesis with the Essential Graph Adnane Boukhayma, Edmond Boyer MORPHEO

Technology for Video Translation Susanne Weber Language Technology Producer, BBC News Labs In

and Elementary Data Structures Linear Sorting Algorithms Biostatistics 615/815 Lecture 6: . .

Yun R. Qu Viktor K. Prasanna Ming Hsieh Department of Electrical - PDF document

POWER-EFFICIENT RANGE-MATCH-BASED PACKET CLASSIFICATION ON FPGA Yun R. Qu Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, CA 90089 { yunqu, prasanna } @usc.edu ABSTRACT

Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. Prasanna Bhargava Gundala,

Head-Body Partitioned String Matching for Deep Packet Inspection with Scalable and

A Memory-Balanced Linear Pipeline Architecture for Trie-based IP Lookup Weirong Jiang and Viktor

2 Related Work and Background 2.2 Aho-Corasick Algorithm 2.1 Related Work A class of

Commercial MLaaSPlatforms Yun-Yun Tsai &amp; Tsung-Yi Ho National Tsing Hua University #BHUSA

William Yun Chen William Yun Chen chen_w@math.psu.edu Pennsylvania State University ICERM

Automatic Generation of High Throughput Energy Efficient Streaming Architectures for Arbitrary

EECS 678: Introduction to Operating Systems Heechul Yun 1 About Me Heechul Yun, Assistant

Q Group October 19, 2011 Institutional Quality Hedge Funds David A Hsieh (c) David A. Hsieh,

Video 3.1 Vijay Kumar and Ani Hsieh Robo3x-1.3 1 Property of Penn Engineering, Vijay Kumar

Beyond TCAMs: An SRAM-based Parallel Multi-Pipeline Architecture for Terabit IP Lookup Weirong

Founders, Ventures, and Wardley Maps Prasanna Krishnamoorthy Co-founder Upekkha @prasanna_says

Peripheral mycotic aneurysm with Enterococcus faecalis bacteremia: A rare case report Yen-Chen Yu

Urbashi Mitra Ming Hsieh Department of Electrical Engineering University of Southern California,

Processes at UC RUSAL Aluminium Smelters Mikhail Grinishin, Viktor Buzunov Presented by: Viktor

Design Methodologies Power Consumption Power Consumption Area Viktor wall Viktor

Tornado/Hail: To Model or Not to Model Casualty Actuaries in Reinsurance: CARe June 4 - 5, 2012

CuPy NumPy compatible GPU library for fast computation in Python Preferred Networks Crissman

Supervisor: Prof Robert W Stewart Dr Louise Crockett Outline Motivation and Objective

Project: IEEE P802.15 Working Group for Wireless Personal Area Networks ( etworks (WPANs WPANs)

A Business vie iew of f SAS Vis isual Analytics Presented by Geo eoff Gordon April 2017

Video based Animation Synthesis with the Essential Graph Adnane Boukhayma, Edmond Boyer MORPHEO

Technology for Video Translation Susanne Weber Language Technology Producer, BBC News Labs In

and Elementary Data Structures Linear Sorting Algorithms Biostatistics 615/815 Lecture 6: . .

Commercial MLaaSPlatforms Yun-Yun Tsai & Tsung-Yi Ho National Tsing Hua University #BHUSA