fast software cache design for network appliances
play

Fast Software Cache Design for Network Appliances Dong Zhou, - PowerPoint PPT Presentation

Fast Software Cache Design for Network Appliances Dong Zhou, Huacheng Yu, Michael Kaminsky, David G. Andersen Flow Caching in Open vSwitch Microflow Cache Exact Match Single Hash Table 2 Flow Caching in Open vSwitch srcAddr=10.1.2.3,


  1. Fast Software Cache Design for Network Appliances Dong Zhou, Huacheng Yu, Michael Kaminsky, David G. Andersen

  2. Flow Caching in Open vSwitch Microflow Cache Exact Match Single Hash Table 2

  3. Flow Caching in Open vSwitch srcAddr=10.1.2.3, dstAddr=12.4.5.6, srcPort=15213, dstPort=80 à output: 1 srcAddr=12.4.5.6, dstAddr=10.1.2.3, srcPort=80, dstPort=15213 à output: 2 srcAddr=12.4.5.6, dstPort=13.1.2.3, srcPort=80, dstPort=15213 à drop Microflow Cache Exact Match Single Hash Table 3

  4. Flow Caching in Open vSwitch Megaflow Cache Wildcard Match without Priority Multiple Masked Tables Miss Microflow Cache Exact Match Single Hash Table 4

  5. Flow Caching in Open vSwitch srcAddr=10.0.0.0/8, dstAddr=12.0.0.0/8, srcPort=*, dstPort=* à output: 1 srcAddr=12.0.0.0/8, dstAddr=10.0.0.0/8, srcPort=*, dstPort=* à output: 2 srcAddr=*, dstPort=13.0.0.0/8, srcPort=*, dstPort=* à drop Megaflow Cache Wildcard Match without Priority Multiple Masked Tables Miss Microflow Cache Exact Match Single Hash Table 5

  6. Flow Caching in Open vSwitch Packet Classifier Multiple OpenFlow Tables Miss Megaflow Cache Wildcard Match without Priority Multiple Masked Tables Miss Microflow Cache Exact Match Single Hash Table 6

  7. Flow Caching in Open vSwitch Packet Classifier Multiple OpenFlow Tables Match Action srcAddr==10.0.0.0/8, output:1 dstAddr==12.0.0.0/8 Miss srcAddr==12.0.0.0/8, output:2 dstAddr==10.0.0.0/8 Megaflow Cache Wildcard Match without Priority Multiple Masked Tables Miss Microflow Cache Exact Match Single Hash Table 7

  8. Flow Caching in Open vSwitch Packet Classifier Multiple OpenFlow Tables Miss Megaflow Cache • Cache Hit Rate Wildcard Match without Priority • Lookup Latency Multiple Masked Tables 8x! Miss Microflow Cache Exact Match Single Hash Table 8

  9. Basic Cache Design k h ( k ) • oversubscription factor α = # keys / # entries • Assumption • uniform workload • random eviction • α = 0.95 • 81% cache hit rate 4-way set-associative bucket 9

  10. Cache Design: Increase Set-Associativity k h ( k ) 8-way set-associative bucket 81 à 87% cache hit rate 4-way set-associative bucket 10

  11. Cache Design: More Candidate Buckets h 1 ( k ) k h 2 ( k ) Cuckoo hashing 4-way set-associative bucket 81 à ~99% cache hit rate 11

  12. Our Solution: Bounded Linear Probing (BLP) k h ( k ) overlapped bucket 2 buckets h ( k’ ) k’ 2,4 BLP 81 à ~94% cache hit rate 4-way set-associative bucket 12

  13. Qualitative Comparison Design Lookup Speed Hit Rate (cache line reads) 4-way set-assoc. 1 ~ 81% 8-way set-assoc. 1 ~ 87% 2-4 cuckoo 2 random ~ 99% 2-4 BLP 1.5 consecutive ~ 94% 13

  14. Qualitative Comparison Design Lookup Speed Hit Rate (cache line reads) 4-way set-assoc. 1 ~ 81% 8-way set-assoc. 1 ~ 87% 2-4 cuckoo 2 random ~ 99% 2-4 BLP 1.5 consecutive ~ 94% 14

  15. Why BLP is Better Than Set-Assoc.? 3 0 0 0 0 0 0 7 1 1 1 1 1 1 1 1 1 1 1 6 2 2 2 2 2 2 1 1 2 2 2 3 3 2 2 2 2 3 4 4 4 3 3 4 4 1 5 4 5 2 6 6 6 6 0 0 0 1 1 1 1 1 1 7 7 2 2 2 2 2 2 3 3 occupancy = 0.75 4 4 4 5 6 6 7 15 occupancy = 0.71875

  16. Qualitative Comparison Design Lookup Speed Hit Rate (cache line reads) 4-way set-assoc. 1 ~ 81% 8-way set-assoc. 1 ~ 87% 2-4 cuckoo 2 random ~ 99% 2-4 BLP 1.5 consecutive ~ 94% 16

  17. Qualitative Comparison Design Lookup Speed Hit Rate (cache line reads) 4-way set-assoc. 1 ~ 81% 8-way set-assoc. 1 ~ 87% 2-4 cuckoo 2 random ~ 99% 2-4 BLP 1.5 consecutive ~ 94% 17

  18. Better Cache Replacement • Traditional LRU – High space overhead – CLOCK: 1 bit / key • Our Solution: Probabilistic Bubble LRU (PBLRU) 18

  19. PBLRU: Bubbling D h ( D ) A B C D A B D C Promotion 19

  20. PBLRU: Bubbling X h ( X ) A B D C A B D X Eviction 20

  21. PBLRU • Basic bubbling – Combines both recency and frequency information • Probabilistic bubbling – We only promote every n -th cache hit to reduce the number of memory writes • Applying to 2-4 BLP – We choose a random bucket to apply bubbling 21

  22. Evaluation Traffic Generator Virtual Switch Ethernet Port 0 TX cores RX cores Port 1 22

  23. Throughput (Uniform) Uniform 10 15% higher tput 4-way 4-way w/ SIMD 9 8-way w/ SIMD Throughput (Mpps) 8 2-4 cuckoo-lite 2-4 BLP 7 w/ PBLRU 6 5 4 3 0.6 0.8 1.0 1.2 1.4 1.6 1.8 23

  24. Lookup Latency and Hit Rate cache hit rate improvement is not enough to compensate for its higher lookup latency 4-way 4-way w/ SIMD 8-way w/ SIMD 2-4 cuckoo-lite 2-4 BLP 2-4 BLP w/ PBLRU 100 140 better better Lookup Latency (Cycles) 90 Cache Hit Rate 120 80 100 70 80 60 60 50 0.50 0.75 1.00 1.25 1.50 1.75 0.50 0.75 1.00 1.25 1.50 1.75 24

  25. Throughput (Skewed) 10 Throughput (Mpps) 7.5% higher tput 9 4-way 4-way w/ SIMD 8 8-way w/ SIMD 2-4 Cuckoo 2-4 BLP 7 2-4 BLP w/ PBLRU 0.6 0.8 1.0 1.2 1.4 1.6 1.8 25

  26. Lookup Latency and Hit Rate 26

  27. Summary • Bounded Linear Probing • Probabilistic Bubble LRU • Balance between Cache Hit Rate and Lookup Latency 27

  28. Thank You! 28

Recommend


More recommend