Make the Most out of Last Level Cache in Intel Processors Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. Maguire Jr. * , Dejan Kostić * KTH Royal Institute of Technology (EECS/COM) + Ericsson Research *
Motivation 1
Motivation Some of these services demand bo bounde ded d lo low-latency latency and predic icta table ble service time. 2
Motivation 3
Motivation A server receiving 64 B packets at 100 Gbps has only 5.12 12 ns to process a packet before the next packet arrives. 4
Motivation 7 10 6 10 Transistors 5 10 (thousands) 4 10 3 Single-Thread 10 Performance 2 10 (SpecINT x 10 3 ) 1 10 0 10 1980 1990 2000 2010 2020 Year 5
Motivation 7 10 6 10 Transistors 5 10 (thousands) 4 10 3 It is essential to use our current Single-Thread 10 Performance 2 hardware more efficiently. 10 (SpecINT x 10 3 ) 1 10 0 10 1980 1990 2000 2010 2020 Year 6
Memory Hierarchy <4 CPU cycles Registers Getting Slower 4-40 Cache cycles L1,L2, LLC >200 DRAM cycles (>60ns) Memory ry Hierarchy rchy For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns. 7
Memory Hierarchy To keep up with 100 Gbps <4 time budget (5.12 ns) CPU cycles Registers Getting Slower 4-40 Cache becomes Cache cycles L1,L2, LLC valuable, as every access to DRAM is expen ensi sive ve >200 DRAM cycles (>60ns 0ns) Memory ry Hierarchy rchy For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns. 8
Memory Hierarchy To keep up with 100 Gbps <4 time budget (5.12 ns) CPU cycles Registers Getting Slower 4-40 Cache becomes Cache cycles L1,L2, LLC valuable, as every access to DRAM is expen ensi sive ve >200 We focus on be bett tter er management gement of cache. DRAM cycles (>60ns 0ns) Memory ry Hierarchy rchy For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns. 9
Better Cache Management Reduce tail latencies of NFV service chains running at 100 Gbps by up to 21.5% .5% 10
Last Level Cache (LLC) Intel Processor 11
Non-uniform Cache Architecture (NUCA) Since Sandy Bridge (~2011), LLC is not unified any more! Intel Processor 12
Non-uniform Cache Architecture (NUCA) Intel’s Complex Addressing Determines the mapping between memory address space and LLC Slices. Almost every cache line (64 B) maps to a different LLC slice. Known own Methods: ethods: Intel Clémentine Maurice et al. Processor [RAID ‘15]* • Performance Counters * Clémentine Maurice, Nicolas Scouarnec, Christoph Neumann, Olivier Heen, and Aurélien Francillon. 2015. Reverse Engineering Intel Last-Level Cache Complex Addressing Using 13 Performance Counters.
Measuring Access Time to LLC Slices Different access time to different LLC slices Intel Xeon E5-2667 v3 (Haswell) aswell) 14
Measuring Access Time to LLC Slices Measuring Read Access Time from Core 0 to all LLC slices 15
Opportunity Accessing the clo loser ser LLC slice can save up to ~20 0 cycle les, i.e., 6.25 ns. For a CPU that is running at 3.2 GHz. 16
Slice-aware Memory Management Allocate memory from physical memory in a way that it maps to the appropriate LLC slice(s). DRAM 17
Slice-aware Memory Management Use se Case ses: s: • Isolation 18
Slice-aware Memory Management Use se Case ses: s: • Isolation • Shared Data 19
Slice-aware Memory Management Use se Case ses: s: • Isolation • Shared Data • Performance 20
Slice-aware Memory Management Use se Case ses: s: • Isolation • Shared Data • Performance Every core is associated to its closest LLC slice. 21
Slice-aware Memory Management 256 6 2.5 5 20 20 KB KB MB MB MB MB 22
Slice-aware Memory Management Beneficial when working ng 256 6 2.5 2.5 20 20 set can fit into KB KB MB MB MB MB a slice. 23
Slice-aware Memory Management There are many applications that have this characteristic. 24
Slice-aware Memory Management There are many applications that have this characteristic. Key-Value Stores Frequently Accessed keys 25
Slice-aware Memory Management There are many applications that have this characteristic. Key-Value Stores Frequently Accessed keys Virtualized Packet’s Header Network Functions 26
Slice-aware Memory Management There are many applications that have this characteristic. Key-Value Stores Frequently Accessed keys Virtualized Packet’s Header Network Functions Can fit into a slice 27
Slice-aware Memory Management There are many applications that have this characteristic. Key-Value Stores Frequently Accessed keys Virtualized Packet’s Header Network Functions Can fit into a slice We focus on vir irtual ualiz ized d networ ork k functi tions ons in this talk! 28
CacheDirector A network I/O solution which extends Data Direct I/O (DDIO) by employing Slice-aware Memory Management 29
Traditional I/O 1. NICs DMA* packets to DRAM 2. CPU will fetch them to LLC LLC LLC DRAM * Direct Memory Access (DMA) 30
Data Direct I/O (DDIO) DMA*-ing packets directly to LLC rather than DRAM. LLC LLC DRAM Sending/Receiving Packets via DDIO * Direct Memory Access (DMA) 31
Data Direct I/O (DDIO) Packets go to random slices! Sending/Receiving Packets via DDIO 32
Data Direct I/O (DDIO) Packets go to random slices! Sending/Receiving Packets via DDIO 33
CacheDirector Sending/Receiving Sending/Receiving CacheDirector Packets via DDIO Packets via DDIO 34
CacheDirector Sending/Receiving Sending/Receiving CacheDirector Packets via DDIO Packets via DDIO 35
CacheDirector • Sends packet’s header to the appropriate LLC slice. • Implemented as a part of user- space NIC drivers in the Data Plane Development Kit (DPDK). • Introduces dynamic headroom in DPDK data structures. 36
Evaluation - Testbed 100 0 Gbp bps Device under Test Packet Generator Running VNFs Intel Xeon E5 2667 v3 Mellanox ConnectX-4 37
Evaluation - Testbed Timestamp 100 0 Gbp bps Device under Test Packet Generator Running VNFs Actual Campus Trace Intel Xeon E5 2667 v3 Mellanox ConnectX-4 38
Metron [NSDI ‘18]* Evaluation - Testbed Stateful NFV Service Chain 100 0 Gbp bps Device under Test Packet Generator Running VNFs Actual Campus Trace Intel Xeon E5 2667 v3 Mellanox ConnectX-4 * Georgios P.Katsikas, Tom Barbette, Dejan Kostic, Rebecca Steinert, and Gerald Q. Maguire Jr. 2018. Metron: NFV Service Chains at the True Speed of the Underlying 39 Hardware.
Stateful NFV Service Chain Evaluation – 100 Gbps Achieved Throughput ~76 Gbps 40
Stateful NFV Service Chain Evaluation – 100 Gbps Achieved Throughput ~76 Gbps 21.5% .5% Impr prov ovem emen ent 41
Stateful NFV Service Chain Evaluation – 100 Gbps Achieved Throughput ~76 Gbps 21.5% .5% Faster access to Impr prov ovem emen ent packet header Faster processing time per packet Reduce queueing time 42
Stateful NFV Service Chain Evaluation – 100 Gbps Achieved Throughput ~76 Gbps 21.5% .5% Faster access to Impr prov ovem emen ent packet header More Predi dictabl ctable Faster processing Fewer SLO Violations time per packet Reduce queueing time * Service Level Objective (SLO) 43
Read More … • More NFV results • Slice-aware key-value store • Portability of our solution on Skylake architecture • Slice Isolation vs. Cache Allocation Technology (CAT) • More … 44
Conclusion • Hidden opportunity that can decrease average access time to LLC by ~20% • Useful for other applications https://github.com/aliireza/slice-aware • Meet us at the poster session This work is supported by WASP, SSF, and ERC. 45
Backup 46
Portability • Intel Xeon Gold 6134 (Skylake) • Mesh architecture • 8 cores and 18 slices • Non-inclusive LLC • Does not affect DDIO 47
Packet Header Sizes • IPv4: 14 B (Ethernet)+ 20 B (IPv4) + 20 B (TCP) < 64 B • IPv6: 14B (Ethernet) + 36 B (IPv6) + 20 B (TCP) > 64 B Any 64 B of the packet can be placed in the appropriate slice 48
Limitations and Considerations • Data larger than 64 B Using linked-list and scatter data • Future H/W features: • Bigger chunks (e.g., 4k pages) • Programmable • • Slice Imbalance Limiting our application to smaller portion of LLC, but with faster access. 49
Relevant and Future Works • NUCA • Cache-aware Memory Management (e.g., Partitioning and Page Coloring) • Extending CacheDirector for the whole packet • Slice-aware Hypervisor 50
Slice-aware Memory Management 51
Evaluation – Low Rate Simple Forwarding Application 1000 Packets/s 52
Evaluation – Tail vs. Throughput Slightly shifts the knee, which means CacheDirector is still beneficial when system is experiencing a moderate load. 53
Evaluation – Tail vs. Throughput Slightly shifts the knee, which means CacheDirector is still beneficial when system is experiencing a moderate load. 54
Recommend
More recommend