Reex amining Direct Cache Access to Optimize I/O Intensive Applications for Multi- hundred- gigabit Netw orks Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. Maguire Jr. * , Dejan Kostić * KTH Royal Institute of Technology, School of Electrical Engineering and Computer Science (EECS) + Ericsson Research
Traditional I/O 1. I/O device DMAs* packets to main memory 2. CPU later fetches them to cache I/O Device * Direct Memory Access (DMA) 2020-07-02 2
Traditional I/O 1. I/O device DMAs* packets to main memory 2. CPU later fetches them to cache Inefficient: Large number of accesses to main • memory High access latency (>60ns) • Unnecessary memory bandwidth usage • I/O Device * Direct Memory Access (DMA) 2020-07-02 3
Direct Cache Access (DCA) 1. I/O device DMAs packets to main memory 2. DCA exploits TPH* to prefetch a portion of packets into cache 3. CPU later fetches them from cache Prefetch I/O Device * PCIe Transaction protocol Processing Hint (TPH) 2020-07-02 4
Direct Cache Access (DCA) 1. I/O device DMAs packets to main memory 2. DCA exploits TPH* to prefetch a portion of packets into cache 3. CPU later fetches them from cache Prefetch Still inefficient in terms of memory bandwidth usage • Requires OS intervention and support from processor • I/O Device * PCIe Transaction protocol Processing Hint (TPH) 2020-07-02 5
Intel Data Direct I/O (DDIO) DDIO in Xeon processors since • Xeon E5 DMA packets or descriptors • directly to/from Last Level Cache (LLC) I/O Device 2020-07-02 6
Trends More in-network computing + offloading capabilities Push costly calculations into the network and perform state teful functions at the processor, which makes applications more I/O intensive. 2020-07-02 7
Pressure from these trends Every 6.72 ns a new (64-B+20-B*) packet arrives at 100 Gbps More in-network computing + offloading capabilities Faster link speeds Multi-hundred-gigabit networks cannot tolerate memory access and interarrival time of packets continues to shrin ink * 7B preamble + 1B start-of-frame delimiter +12B inter-frame gap = 20B 2020-07-02 8
DCA matters because Without DCA we are unable to process I/O at line rate, thus increasing packet loss or latency when utilizing multi-hundred-gigabit networks. 2020-07-02 9
Forw arding Packets at 100 Gbps 100 G 100 Gbps Device under Test Packet Forwarding Packets Generator 1400 99 th Percentile Latency (µs) 1200 1000 800 Intel Xeon Gold 6140 600 400 Mellanox ConnectX-5 200 0 Each NIC is placed in 100 Gbps 200 Gbps a PCIe 3.0 16x slot* 100 Rate Gbps * A PCIe 3.0 16x slot is capable of providing ~125 Gbps effective full-duplex bandwidth. 2020-07-02 10
What happens at 200 Gbps? When forwarding 2x100 G 00 Gbps at 200 Gbps, 30% higher latency for the NIC forwarding at 100 Gbps Device under Test Packet Forwarding Packets Generator 1400 99 th Percentile Latency (µs) 1200 1000 30% 800 Intel Xeon Gold 6140 600 400 Mellanox ConnectX-5 200 0 Each NIC is placed in 100 Gbps 200 Gbps a PCIe 3.0 16x slot* 100 100 Latency of the first NIC, when Gbps Gbps forwarding at indicated aggregate rate * A PCIe 3.0 16x slot is capable of providing ~125 Gbps effective full-duplex bandwidth. 2020-07-02 11
How does DDIO w ork? CPU Socket Writing packets/descriptors: C C C C DDIO overwrites a cache line if if it is already C C C C present in any ny LLC ways ( ≡ write update or hit) C C C C Logical Write to the Same cache line LLC Sending/Receiving Already Packets via DDIO Present In LLC 2020-07-02 12
How does DDIO w ork? CPU Socket Writing packets/descriptors: C C C C DDIO overwrites a cache line if if it is already C C C C present in any ny LLC ways ( ≡ write update or hit) C C C C Otherwise, DDIO allocates a cache line in a limited portion of LLC ( ≡ write allocate or miss) Logical Allocate a cache LLC line Sending/Receiving Not Packets via DDIO Present In LLC 2020-07-02 13
How does DDIO w ork? CPU Socket Writing packets/descriptors: C C C C DDIO overwrites a cache line if if it is already C C C C present in any ny LLC ways ( ≡ write update or hit) C C C C Otherwise, DDIO allocates a cache line in a limited portion of LLC ( ≡ write allocate or miss) Logical Reading packets/descriptors: LLC NIC reads a cache line if it is already present in any LLC ways ( ≡ read hit) Sending/Receiving Packets via DDIO Otherwise, NIC reads it from main memory ( ≡ read miss) 2020-07-02 14
How does DDIO w ork? CPU Socket Designed a set of micro-benchmarks to learn C C C C about DDIO: C C C C • Which ways are used for allocation? C C C C • How does DDIO interact with other applications? Logical • Does DMA via a remote CPU socket LLC pollute LLC? Sending/Receiving Packets via DDIO 2020-07-02 15
LLC w ays used by DDIO I/O Application C0 Logical LLC 1 2 3 4 5 6 7 8 9 10 11 Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology 2020-07-02 16
LLC w ays used by DDIO I/O Cache-sensitive Application Application + C0 C1 Logical LLC 1 2 3 4 5 6 7 8 9 10 11 Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 17
LLC w ays used by DDIO 10 I/O Cache-sensitive Application Application + Sum of Cache Misses (Million) 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 18
LLC w ays used by DDIO 10 I/O Cache-sensitive Application Application + Sum of Cache Misses (Million) 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 19
LLC w ays used by DDIO 10 I/O Cache-sensitive Application Application + Sum of Cache Misses (Million) 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 20
LLC w ays used by DDIO 10 I/O Cache-sensitive Application Application + Sum of Cache Misses (Million) 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 21
LLC w ays used by DDIO 10 I/O Cache-sensitive Contention with code/data causes a rise in Application Application + Sum of Cache Misses (Million) the cache misses of the I/O application 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 22
LLC w ays used by DDIO 10 I/O Cache-sensitive Application Application + Sum of Cache Misses (Million) 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 23
LLC w ays used by DDIO 10 I/O Cache-sensitive Application Application + Sum of Cache Misses (Million) 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 24
LLC w ays used by DDIO 10 I/O Cache-sensitive Application Application + Sum of Cache Misses (Million) 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 25
LLC w ays used by DDIO 10 I/O Cache-sensitive Application Application + Sum of Cache Misses (Million) 8 C0 C1 6 4 Logical 2 LLC 0 1,2 2,3 3,4 4,5 5,6 6,7 7,8 8,9 9,10 10,11 1 2 3 4 5 6 7 8 9 10 11 Ways Allocated by CAT to the Cache-sensitive Application Sending/Receiving Use CAT* to Packets via DDIO limit code/data * Cache Allocation Technology + water_nsquared from Splash-3 benchmark 2020-07-02 26
Recommend
More recommend