Fast Software Cache Design for Network Appliances Dong Zhou, Huacheng Yu, Michael Kaminsky, David G. Andersen
Flow Caching in Open vSwitch Microflow Cache Exact Match Single Hash Table 2
Flow Caching in Open vSwitch srcAddr=10.1.2.3, dstAddr=12.4.5.6, srcPort=15213, dstPort=80 à output: 1 srcAddr=12.4.5.6, dstAddr=10.1.2.3, srcPort=80, dstPort=15213 à output: 2 srcAddr=12.4.5.6, dstPort=13.1.2.3, srcPort=80, dstPort=15213 à drop Microflow Cache Exact Match Single Hash Table 3
Flow Caching in Open vSwitch Megaflow Cache Wildcard Match without Priority Multiple Masked Tables Miss Microflow Cache Exact Match Single Hash Table 4
Flow Caching in Open vSwitch srcAddr=10.0.0.0/8, dstAddr=12.0.0.0/8, srcPort=*, dstPort=* à output: 1 srcAddr=12.0.0.0/8, dstAddr=10.0.0.0/8, srcPort=*, dstPort=* à output: 2 srcAddr=*, dstPort=13.0.0.0/8, srcPort=*, dstPort=* à drop Megaflow Cache Wildcard Match without Priority Multiple Masked Tables Miss Microflow Cache Exact Match Single Hash Table 5
Flow Caching in Open vSwitch Packet Classifier Multiple OpenFlow Tables Miss Megaflow Cache Wildcard Match without Priority Multiple Masked Tables Miss Microflow Cache Exact Match Single Hash Table 6
Flow Caching in Open vSwitch Packet Classifier Multiple OpenFlow Tables Match Action srcAddr==10.0.0.0/8, output:1 dstAddr==12.0.0.0/8 Miss srcAddr==12.0.0.0/8, output:2 dstAddr==10.0.0.0/8 Megaflow Cache Wildcard Match without Priority Multiple Masked Tables Miss Microflow Cache Exact Match Single Hash Table 7
Flow Caching in Open vSwitch Packet Classifier Multiple OpenFlow Tables Miss Megaflow Cache • Cache Hit Rate Wildcard Match without Priority • Lookup Latency Multiple Masked Tables 8x! Miss Microflow Cache Exact Match Single Hash Table 8
Basic Cache Design k h ( k ) • oversubscription factor α = # keys / # entries • Assumption • uniform workload • random eviction • α = 0.95 • 81% cache hit rate 4-way set-associative bucket 9
Cache Design: Increase Set-Associativity k h ( k ) 8-way set-associative bucket 81 à 87% cache hit rate 4-way set-associative bucket 10
Cache Design: More Candidate Buckets h 1 ( k ) k h 2 ( k ) Cuckoo hashing 4-way set-associative bucket 81 à ~99% cache hit rate 11
Our Solution: Bounded Linear Probing (BLP) k h ( k ) overlapped bucket 2 buckets h ( k’ ) k’ 2,4 BLP 81 à ~94% cache hit rate 4-way set-associative bucket 12
Qualitative Comparison Design Lookup Speed Hit Rate (cache line reads) 4-way set-assoc. 1 ~ 81% 8-way set-assoc. 1 ~ 87% 2-4 cuckoo 2 random ~ 99% 2-4 BLP 1.5 consecutive ~ 94% 13
Qualitative Comparison Design Lookup Speed Hit Rate (cache line reads) 4-way set-assoc. 1 ~ 81% 8-way set-assoc. 1 ~ 87% 2-4 cuckoo 2 random ~ 99% 2-4 BLP 1.5 consecutive ~ 94% 14
Why BLP is Better Than Set-Assoc.? 3 0 0 0 0 0 0 7 1 1 1 1 1 1 1 1 1 1 1 6 2 2 2 2 2 2 1 1 2 2 2 3 3 2 2 2 2 3 4 4 4 3 3 4 4 1 5 4 5 2 6 6 6 6 0 0 0 1 1 1 1 1 1 7 7 2 2 2 2 2 2 3 3 occupancy = 0.75 4 4 4 5 6 6 7 15 occupancy = 0.71875
Qualitative Comparison Design Lookup Speed Hit Rate (cache line reads) 4-way set-assoc. 1 ~ 81% 8-way set-assoc. 1 ~ 87% 2-4 cuckoo 2 random ~ 99% 2-4 BLP 1.5 consecutive ~ 94% 16
Qualitative Comparison Design Lookup Speed Hit Rate (cache line reads) 4-way set-assoc. 1 ~ 81% 8-way set-assoc. 1 ~ 87% 2-4 cuckoo 2 random ~ 99% 2-4 BLP 1.5 consecutive ~ 94% 17
Better Cache Replacement • Traditional LRU – High space overhead – CLOCK: 1 bit / key • Our Solution: Probabilistic Bubble LRU (PBLRU) 18
PBLRU: Bubbling D h ( D ) A B C D A B D C Promotion 19
PBLRU: Bubbling X h ( X ) A B D C A B D X Eviction 20
PBLRU • Basic bubbling – Combines both recency and frequency information • Probabilistic bubbling – We only promote every n -th cache hit to reduce the number of memory writes • Applying to 2-4 BLP – We choose a random bucket to apply bubbling 21
Evaluation Traffic Generator Virtual Switch Ethernet Port 0 TX cores RX cores Port 1 22
Throughput (Uniform) Uniform 10 15% higher tput 4-way 4-way w/ SIMD 9 8-way w/ SIMD Throughput (Mpps) 8 2-4 cuckoo-lite 2-4 BLP 7 w/ PBLRU 6 5 4 3 0.6 0.8 1.0 1.2 1.4 1.6 1.8 23
Lookup Latency and Hit Rate cache hit rate improvement is not enough to compensate for its higher lookup latency 4-way 4-way w/ SIMD 8-way w/ SIMD 2-4 cuckoo-lite 2-4 BLP 2-4 BLP w/ PBLRU 100 140 better better Lookup Latency (Cycles) 90 Cache Hit Rate 120 80 100 70 80 60 60 50 0.50 0.75 1.00 1.25 1.50 1.75 0.50 0.75 1.00 1.25 1.50 1.75 24
Throughput (Skewed) 10 Throughput (Mpps) 7.5% higher tput 9 4-way 4-way w/ SIMD 8 8-way w/ SIMD 2-4 Cuckoo 2-4 BLP 7 2-4 BLP w/ PBLRU 0.6 0.8 1.0 1.2 1.4 1.6 1.8 25
Lookup Latency and Hit Rate 26
Summary • Bounded Linear Probing • Probabilistic Bubble LRU • Balance between Cache Hit Rate and Lookup Latency 27
Thank You! 28
Recommend
More recommend