IX:$A$Protected$Dataplane$Opera3ng$ System$for$High$Throughput$and$ Low$Latency$ Adam%Belay ,$George$Prekas,$$ Samuel$Grossman,$Ana$Klimovic,$$ Christos$Kozyrakis,$Edouard$Bugnion$
HW$is$fast,$but$SW$is$a$BoLleneck$ $ $ 64Obyte$TCP$Echo:$ 60$ 10$ Millions% 50$ 8$ 40$ 6$ HW$Limit$ 30$ 4$ Linux$ 20$ IX$ 2$ 10$ 0$ 0$ Microseconds$ Requests$per$Second$ 2$
HW$is$fast,$but$SW$is$a$BoLleneck$ $ $ 64Obyte$TCP$Echo:$ 60$ 10$ Millions% 50$ 8$ 40$ 6$ HW$Limit$ 4.8x% 8.8x% 30$ Gap% 4$ Gap% Linux$ 20$ IX$ 2$ 10$ 0$ 0$ Microseconds$ Requests$per$Second$ 3$
IX$Closes$the$SW$Performance$Gap$ $ $ 64Obyte$TCP$Echo:$ 60$ 10$ Millions% 50$ 8$ 40$ 6$ HW$Limit$ 30$ 4$ Linux$ 20$ IX$ 2$ 10$ 0$ 0$ Microseconds$ Requests$per$Second$ 4$
Two$Contribu3ons$ #1:$Protec3on$and$direct$HW$access$through$virtualiza3on$ $ #2:$Execu3on$model$for$low$latency$and$high$throughput$ 60$ 10$ Millions% 50$ 8$ 40$ 6$ HW$Limit$ 30$ 4$ Linux$ 20$ IX$ 2$ 10$ 0$ 0$ Microseconds$ Requests$per$Second$ 5$
Why$is$SW$Slow?$ Complex$Interface$ Code$Paths$Convoluted$by$Interrupts$and$Scheduling$ Created$by:$Arnout$Vandecappelle$ 6$ hLp://www.linuxfounda3on.org/collaborate/workgroups/networking/kernel_flow$
Problem:$1980s$Sobware$Architecture$ • Berkeley$sockets,$designed$for$CPU$3me$sharing$ • Today’s$largeOscale$datacenter$workloads:$ Hardware:%Dense%Mul;core%+%10%GbE%(soon%40)% O API$scalability$cri3cal!$ O Gap$between$compute$and$RAM$O>$Cache$behavior$maLers$ O Packet$interOarrival$3mes$of$50$ns$ Scale%out%access%paFerns% O FanOin$O>$Large$connec3on$counts,$high$request$rates$ O FanOout$O>$Tail$latency$maLers!$ 7$
Conven3onal$Wisdom$ • Bypass$the$kernel$ – Move$TCP$to$userOspace$(Onload,$mTCP,$Sandstorm)$ – Move$TCP$to$hardware$(TOE)$ • Avoid$the$connec3on$scalability$boLleneck$ – Use$datagrams$instead$of$connec3ons$(DIY$conges3on$management)$ – Use$proxies$at$the$expense$of$latency$ • Replace$classic$Ethernet$ – Use$a$lossless$fabric$(Infiniband)$ – Offload$memory$access$(rDMA)$ • Common%thread:%Give%up%on%systems%soJware% 8$
Our$Approach$ • Bypass$the$kernel$ Robust%Protec;on% Between%App% – Move$TCP$to$userOspace$(Onload,$mTCP,$Sandstorm)$ – Move$TCP$to$hardware$(TOE)$ and%Netstack% • Avoid$the$connec3on$scalability$boLleneck$ Connec;on% – Use$datagrams$instead$of$connec3ons$(DIY$conges3on$management)$ Scalability% – Use$proxies$at$the$expense$of$latency$ • Replace$classic$Ethernet$ Commodity%10Gb% – Use$a$lossless$fabric$(Infiniband)$ Ethernet% – Offload$memory$access$(rDMA)$ • Tackle%the%problem%head%on…% 9$
Separa3on$of$Control$and$Data$Plane$ CP$ DP$ DP$ $ $ Userspace% Host$ Kernelspace% Kernel$ C$ C$ C$ C$ C$ 10$
Separa3on$of$Control$and$Data$Plane$ CP$ DP$ DP$ $ $ Userspace% Host$ Kernelspace% Kernel$ RX$ RX$ RX$ RX$ TX$ TX$ TX$ TX$ C$ C$ C$ C$ C$ 11$
Separa3on$of$Control$and$Data$Plane$ IX$CP$ Ring%3% IX$DP$ IX$DP$ $ $ Guest% Ring%0% Host$ Host% Kernel$ Ring%0% RX$ RX$ RX$ RX$ TX$ TX$ TX$ TX$ C$ C$ C$ C$ C$ 12$
Separa3on$of$Control$and$Data$Plane$ IX$CP$ Ring%3% IX$DP$ IX$DP$ $ $ Guest% Ring%0% Linux$kernel$ Host% $ Ring%0% Dune$ RX$ RX$ RX$ RX$ $ TX$ TX$ TX$ TX$ C$ C$ C$ C$ C$ 13$
Separa3on$of$Control$and$Data$Plane$ HTTPd$ Memcached$ IX$CP$ Ring%3% libIX$ libIX$ Guest% IX$ IX$ Ring%0% Linux$kernel$ Host% $ Ring%0% Dune$ RX$ RX$ RX$ RX$ $ TX$ TX$ TX$ TX$ C$ C$ C$ C$ C$ 14$
The$IX$Execu3on$Pipeline$ 3 Ring%3% eventOdriven$app$ Event$ Batched$ libIX$ Condi3ons$ Syscalls$ Guest% 2 4 Ring%0% TCP/IP$ TCP/IP$ 5 Timer$ RX$ FIFO$ 6 RX$ TX$ 1 15$
Design$(1):$Run$to$Comple3on$ 3 Ring%3% eventOdriven$app$ Event$ Batched$ libIX$ Condi3ons$ Syscalls$ Guest% 2 4 Ring%0% TCP/IP$ TCP/IP$ 5 Timer$ RX$ FIFO$ 6 RX$ TX$ 1 Improves%DataVCache%Locality% 16$ Removes%Scheduling%Unpredictably%
Design$(2):$Adap3ve$Batching$ 3 Ring%3% eventOdriven$app$ Event$ Batched$ libIX$ Condi3ons$ Syscalls$ Guest% 2 4 Ring%0% TCP/IP$ TCP/IP$ Adap3ve$Batch$ 5 Calcula3on$ Timer$ RX$ FIFO$ 6 RX$ TX$ 1 Improves%Instruc;onVCache%Locality%and%Prefetching% 17$
See$the$Paper$for$more$Details$ • Design$(3):$Flow$consistent$hashing$ – Synchroniza3on$&$coherence$free$opera3on$ • Design$(4):$Na3ve$zeroOcopy$API$ – Flow$control$exposed$to$applica3on$ • Libix:$LibeventOlike$eventObased$programming$ • IX$prototype$implementa3on$$ – Dune,$DPDK,$LWIP,$~40K$SLOC$of$kernel$code$ 18$
Evalua3on$ • Comparison$IX$to$Linux$and$mTCP$[NSDI$’14]$ • TCP$microbenchmarks$and$Memcached$ ~$25$Linux$Hosts$ 10GbE$Switch$ 4x10GbE$ 1x10GbE$ w/$L3+L4$bond$ IX$ IX$ 19$
TCP$Netpipe$ 10 8 ½% ½% Bandwidth% Bandwidth% Goodput (Gbps) @%20%KB% @%135%KB% 6 4 2 IX-IX Linux-Linux 5.7%us% mTCP-mTCP ½%RTT% 0 0 100 200 300 400 500 20$ Message Size (KB)
TCP$Echo:$Mul3core$Scalability$ for$Short$Connec3ons$ 4 IX 10GbE IX 4x10GbE 3.5 Linux 10GbE Messages/sec (x 10 6 ) Linux 4x10GbE 3 mTCP 10GbE 2.5 Saturates% 2 1x10GbE% 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 Number of CPU cores 21$
Connec3on$Scalability$ 14 IX-40Gbps IX-10Gbps 12 ~10,000% Linux-40Gbps Connec;ons% Linux-10Gbps Messages/sec (x 10 6 ) Limited%by%L3% 10 8 6 4 2 0 10 100 1000 10000 100000 Connection Count (log scale) 22$
Memcached$over$TCP$ 750 IX (p99) IX (avg) Linux (p99) Linux (avg) 500 SLA Latency ( µ s) 3.6x% More% RPS% 2x%Less% 250 Tail% Latency% 0 6x%Less%Tail% 0 250 500 750 1000 1250 1500 1750 2000 Latency% USR: Throughput (RPS x 10 3 ) With%IX%clients% 23$
IX$Conclusion$ • A$protected$dataplane$OS$for$datacenter$ applica3ons$with$an$eventOdriven$model$and$ demanding$connec3on$scalability$ requirements$ • Efficient$access$to$HW,$without$sacrificing$ security,$through$virtualiza3on$ • High$throughput$and$low$latency$enabled$by$a$ dataplane$execu3on$model$ 24$
Recommend
More recommend