This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2018.2818716, IEEE Transactions on Knowledge and Data Engineering 1 EMOMA: Exact Match in One Memory Access Salvatore Pontarelli, Pedro Reviriego, Michael Mitzenmacher Abstract —An important function in modern routers and switches is to perform a lookup for a key. Hash-based methods, and in particular cuckoo hash tables, are popular for such lookup operations, but for large structures stored in off-chip memory, such methods have the downside that they may require more than one off-chip memory access to perform the key lookup. Although the number of off-chip memory accesses can be reduced using on-chip approximate membership structures such as Bloom filters, some lookups may still require more than one off-chip memory access. This can be problematic for some hardware implementations, as having only a single off-chip memory access enables a predictable processing of lookups and avoids the need to queue pending requests. We provide a data structure for hash-based lookups based on cuckoo hashing that uses only one off-chip memory access per lookup, by utilizing an on-chip pre-filter to determine which of multiple locations holds a key. We make particular use of the flexibility to move elements within a cuckoo hash table to ensure the pre-filter always gives the correct response. While this requires a slightly more complex insertion procedure and some additional memory accesses during insertions, it is suitable for most packet processing applications where key lookups are much more frequent than insertions. An important feature of our approach is its simplicity. Our approach is based on simple logic that can be easily implemented in hardware, and hardware implementations would benefit most from the single off-chip memory access per lookup. ✦ 1 I NTRODUCTION updates in the tables without affecting the traffic. To achieve those goals, they commonly use hardware in the form of Packet classification is a key function in modern routers and Application Specific Integrated Circuits (ASICs) or Field switches used for example for routing, security, and quality Programmable Gate Arrays (FPGAs) [8], [9]. The logic in of service [1]. In many of these applications, the packet is those circuits has to be simple to be able to process packets compared against a set of rules or routes. The comparison at high speed. The time needed to process a packet has also can be an exact match, as for example in Ethernet switch- to be small and with a predictable worst case. For example, ing, or it can be a match with wildcards, as in longest for multiple-choice based hashing schemes such as cuckoo prefix match (LPM) or in a firewall rule. The exact match hashing, multiple memory locations can be accessed in can be implemented using a Content Addressable Memory parallel so that the operation completes in one access cycle (CAM) and the match with wildcards with a Ternary Con- [8]. This reduces latency, and can simplify the hardware tent Addressable Memory (TCAM) [2], [3]. However, these implementation by minimizing queueing and conflicts. memories are costly in terms of circuit area and power and Both ASICs and FPGAs have internal memories that therefore alternative solutions based on hashing techniques can be accessed with low latency but that have a limited using standard memories are widely used [4]. In particu- size. They can also be connected to much larger external lar, for exact match, cuckoo hashing provides an efficient memories that have a much longer access time. Some tables solution with close to full memory utilization and a low used for packet processing are necessarily large and need and bounded number of memory accesses for a match [5]. to be stored in the external memory, limiting the speed For other functions that use match with wildcards, schemes of packet processing [10]. While parallelization may again that use several exact matches have also been proposed. seem like an approach to hold operations to one memory For example, for LPM a binary search on prefix lengths can access cycle, for external memories parallelization can have be used where for each length an exact match is done [6]. a huge cost in terms of hardware design complexity. Parallel More general schemes have been proposed to implement access to external memories would typically use different matches with wildcards that emulate TCAM functionality memory chips to perform parallel reads, different buses to using hash based techniques [7]. In addition to reducing exchange addresses and data between the network device the circuit complexity and power consumption, the use of and the external memory, and therefore a significant number hash based techniques provides additional flexibility that is of I/O pins are needed to drive the address/data bus of beneficial to support programmability in software defined multiple memory chips. Unfortunately, switch chips have networks [8]. a limited number of pins count and it seems that this High speed routers and switches are expected to process limitation will be maintained over the next decade [11]. packets with low and predictable latency and to perform While the memory I/O interface must work at high speed, parallelization is often unaffordable from the point of view S. Pontarelli is with Consorzio Nazionale Interuniversitario per le Tele- • comunicazioni (CNIT), Via del Politecnico 1, 00133 Rome, Italy. E-mail: of the hardware design. When a single external memory is salvatore.pontarelli@uniroma2.it used, the time needed to complete a lookup depends on P. Reviriego is with Universidad Antonio de Nebrija, C/ Pirineos, 55, • the number of external memory accesses. This makes the E-28040 Madrid, Spain. E-mail: previrie@nebrija.es hardware implementation more complex if lookups are not M. Mitzenmacher is with Harvard University, 33 Oxford Street, Cam- • bridge, MA 02138, USA. E-mail: michaelm@eecs.harvard.edu always completed in one memory access cycle, and hence finding methods where lookups complete with a single Manuscript submitted 13 Sept. 2017 and in revised form 17 Feb. 2018 1041-4347 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Recommend
More recommend