in intel processors
play

in Intel Processors Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. - PowerPoint PPT Presentation

Make the Most out of Last Level Cache in Intel Processors Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. Maguire Jr. * , Dejan Kosti * KTH Royal Institute of Technology (EECS/COM) + Ericsson Research * Motivation 1 Motivation Some of


  1. Make the Most out of Last Level Cache in Intel Processors Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. Maguire Jr. * , Dejan Kostić * KTH Royal Institute of Technology (EECS/COM) + Ericsson Research *

  2. Motivation 1

  3. Motivation Some of these services demand bo bounde ded d lo low-latency latency and predic icta table ble service time. 2

  4. Motivation 3

  5. Motivation A server receiving 64 B packets at 100 Gbps has only 5.12 12 ns to process a packet before the next packet arrives. 4

  6. Motivation 7 10 6 10 Transistors 5 10 (thousands) 4 10 3 Single-Thread 10 Performance 2 10 (SpecINT x 10 3 ) 1 10 0 10 1980 1990 2000 2010 2020 Year 5

  7. Motivation 7 10 6 10 Transistors 5 10 (thousands) 4 10 3 It is essential to use our current Single-Thread 10 Performance 2 hardware more efficiently. 10 (SpecINT x 10 3 ) 1 10 0 10 1980 1990 2000 2010 2020 Year 6

  8. Memory Hierarchy <4 CPU cycles Registers Getting Slower 4-40 Cache cycles L1,L2, LLC >200 DRAM cycles (>60ns) Memory ry Hierarchy rchy For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns. 7

  9. Memory Hierarchy To keep up with 100 Gbps <4 time budget (5.12 ns) CPU cycles Registers Getting Slower 4-40 Cache becomes Cache cycles L1,L2, LLC valuable, as every access to DRAM is expen ensi sive ve >200 DRAM cycles (>60ns 0ns) Memory ry Hierarchy rchy For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns. 8

  10. Memory Hierarchy To keep up with 100 Gbps <4 time budget (5.12 ns) CPU cycles Registers Getting Slower 4-40 Cache becomes Cache cycles L1,L2, LLC valuable, as every access to DRAM is expen ensi sive ve >200 We focus on be bett tter er management gement of cache. DRAM cycles (>60ns 0ns) Memory ry Hierarchy rchy For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns. 9

  11. Better Cache Management Reduce tail latencies of NFV service chains running at 100 Gbps by up to 21.5% .5% 10

  12. Last Level Cache (LLC) Intel Processor 11

  13. Non-uniform Cache Architecture (NUCA) Since Sandy Bridge (~2011), LLC is not unified any more! Intel Processor 12

  14. Non-uniform Cache Architecture (NUCA) Intel’s Complex Addressing Determines the mapping between memory address space and LLC Slices. Almost every cache line (64 B) maps to a different LLC slice. Known own Methods: ethods: Intel Clémentine Maurice et al. Processor [RAID ‘15]* • Performance Counters * Clémentine Maurice, Nicolas Scouarnec, Christoph Neumann, Olivier Heen, and Aurélien Francillon. 2015. Reverse Engineering Intel Last-Level Cache Complex Addressing Using 13 Performance Counters.

  15. Measuring Access Time to LLC Slices Different access time to different LLC slices Intel Xeon E5-2667 v3 (Haswell) aswell) 14

  16. Measuring Access Time to LLC Slices Measuring Read Access Time from Core 0 to all LLC slices 15

  17. Opportunity Accessing the clo loser ser LLC slice can save up to ~20 0 cycle les, i.e., 6.25 ns. For a CPU that is running at 3.2 GHz. 16

  18. Slice-aware Memory Management Allocate memory from physical memory in a way that it maps to the appropriate LLC slice(s). DRAM 17

  19. Slice-aware Memory Management Use se Case ses: s: • Isolation 18

  20. Slice-aware Memory Management Use se Case ses: s: • Isolation • Shared Data 19

  21. Slice-aware Memory Management Use se Case ses: s: • Isolation • Shared Data • Performance 20

  22. Slice-aware Memory Management Use se Case ses: s: • Isolation • Shared Data • Performance Every core is associated to its closest LLC slice. 21

  23. Slice-aware Memory Management 256 6 2.5 5 20 20 KB KB MB MB MB MB 22

  24. Slice-aware Memory Management Beneficial when working ng 256 6 2.5 2.5 20 20 set can fit into KB KB MB MB MB MB a slice. 23

  25. Slice-aware Memory Management There are many applications that have this characteristic. 24

  26. Slice-aware Memory Management There are many applications that have this characteristic. Key-Value Stores Frequently Accessed keys 25

  27. Slice-aware Memory Management There are many applications that have this characteristic. Key-Value Stores Frequently Accessed keys Virtualized Packet’s Header Network Functions 26

  28. Slice-aware Memory Management There are many applications that have this characteristic. Key-Value Stores Frequently Accessed keys Virtualized Packet’s Header Network Functions Can fit into a slice 27

  29. Slice-aware Memory Management There are many applications that have this characteristic. Key-Value Stores Frequently Accessed keys Virtualized Packet’s Header Network Functions Can fit into a slice We focus on vir irtual ualiz ized d networ ork k functi tions ons in this talk! 28

  30. CacheDirector A network I/O solution which extends Data Direct I/O (DDIO) by employing Slice-aware Memory Management 29

  31. Traditional I/O 1. NICs DMA* packets to DRAM 2. CPU will fetch them to LLC LLC LLC DRAM * Direct Memory Access (DMA) 30

  32. Data Direct I/O (DDIO) DMA*-ing packets directly to LLC rather than DRAM. LLC LLC DRAM Sending/Receiving Packets via DDIO * Direct Memory Access (DMA) 31

  33. Data Direct I/O (DDIO) Packets go to random slices! Sending/Receiving Packets via DDIO 32

  34. Data Direct I/O (DDIO) Packets go to random slices! Sending/Receiving Packets via DDIO 33

  35. CacheDirector Sending/Receiving Sending/Receiving CacheDirector Packets via DDIO Packets via DDIO 34

  36. CacheDirector Sending/Receiving Sending/Receiving CacheDirector Packets via DDIO Packets via DDIO 35

  37. CacheDirector • Sends packet’s header to the appropriate LLC slice. • Implemented as a part of user- space NIC drivers in the Data Plane Development Kit (DPDK). • Introduces dynamic headroom in DPDK data structures. 36

  38. Evaluation - Testbed 100 0 Gbp bps Device under Test Packet Generator Running VNFs Intel Xeon E5 2667 v3 Mellanox ConnectX-4 37

  39. Evaluation - Testbed Timestamp 100 0 Gbp bps Device under Test Packet Generator Running VNFs Actual Campus Trace Intel Xeon E5 2667 v3 Mellanox ConnectX-4 38

  40. Metron [NSDI ‘18]* Evaluation - Testbed Stateful NFV Service Chain 100 0 Gbp bps Device under Test Packet Generator Running VNFs Actual Campus Trace Intel Xeon E5 2667 v3 Mellanox ConnectX-4 * Georgios P.Katsikas, Tom Barbette, Dejan Kostic, Rebecca Steinert, and Gerald Q. Maguire Jr. 2018. Metron: NFV Service Chains at the True Speed of the Underlying 39 Hardware.

  41. Stateful NFV Service Chain Evaluation – 100 Gbps Achieved Throughput ~76 Gbps 40

  42. Stateful NFV Service Chain Evaluation – 100 Gbps Achieved Throughput ~76 Gbps 21.5% .5% Impr prov ovem emen ent 41

  43. Stateful NFV Service Chain Evaluation – 100 Gbps Achieved Throughput ~76 Gbps 21.5% .5% Faster access to Impr prov ovem emen ent packet header Faster processing time per packet Reduce queueing time 42

  44. Stateful NFV Service Chain Evaluation – 100 Gbps Achieved Throughput ~76 Gbps 21.5% .5% Faster access to Impr prov ovem emen ent packet header More Predi dictabl ctable Faster processing Fewer SLO Violations time per packet Reduce queueing time * Service Level Objective (SLO) 43

  45. Read More … • More NFV results • Slice-aware key-value store • Portability of our solution on Skylake architecture • Slice Isolation vs. Cache Allocation Technology (CAT) • More … 44

  46. Conclusion • Hidden opportunity that can decrease average access time to LLC by ~20% • Useful for other applications https://github.com/aliireza/slice-aware • Meet us at the poster session This work is supported by WASP, SSF, and ERC. 45

  47. Backup 46

  48. Portability • Intel Xeon Gold 6134 (Skylake) • Mesh architecture • 8 cores and 18 slices • Non-inclusive LLC • Does not affect DDIO 47

  49. Packet Header Sizes • IPv4: 14 B (Ethernet)+ 20 B (IPv4) + 20 B (TCP) < 64 B • IPv6: 14B (Ethernet) + 36 B (IPv6) + 20 B (TCP) > 64 B Any 64 B of the packet can be placed in the appropriate slice 48

  50. Limitations and Considerations • Data larger than 64 B Using linked-list and scatter data • Future H/W features: • Bigger chunks (e.g., 4k pages) • Programmable • • Slice Imbalance Limiting our application to smaller portion of LLC, but with faster access. 49

  51. Relevant and Future Works • NUCA • Cache-aware Memory Management (e.g., Partitioning and Page Coloring) • Extending CacheDirector for the whole packet • Slice-aware Hypervisor 50

  52. Slice-aware Memory Management 51

  53. Evaluation – Low Rate Simple Forwarding Application 1000 Packets/s 52

  54. Evaluation – Tail vs. Throughput Slightly shifts the knee, which means CacheDirector is still beneficial when system is experiencing a moderate load. 53

  55. Evaluation – Tail vs. Throughput Slightly shifts the knee, which means CacheDirector is still beneficial when system is experiencing a moderate load. 54

Recommend


More recommend