Analysis of Techniques to Improve Protocol Processing Latency David Mosberger, Patrick Bridges, Larry L. Peterson, and Sean O’Malley The University of Arizona f davidm,bridges,llp g @cs.arizona.edu e-mail: sean@netapp.com www: http://www.cs.arizona.edu/scout SIGCOMM ’96 1
Latency: Where does it come from? � Speed of light � Data touching overheads? – No: messages (data) are small. � Execution overheads? – Too much code. – Badly structured code. SIGCOMM ’96 The University of Arizona 2
Test Environment � Protocol stacks XRPCTEST MSELECT – TCP/IP VCHAN – RPC TCPTEST CHAN TCP BID � Hardware platform BLAST IP – 175MHz Alpha IP VNET – 100MB/s memory VNET ETH – TURBOchannel bus ETH – 10Mbps Ethernet LANCE LANCE SIGCOMM ’96 The University of Arizona 3
Starting Point � Data cache footprint cycle count 20000 18000 – padding 16000 14000 18941 12000 10000 15688 – stack switching 8000 6000 4000 2000 – info duplication 0 Orig Opt instruction count � Tiny functions 6000 5000 � Machine idiosyncracies 5821 4000 4750 3000 2000 – byte load/store 1000 0 – integer division Orig Opt SIGCOMM ’96 The University of Arizona 4
How fast is TCP/IP? other TCP input tcp_input BSD/386 other IP input ipintr DUX/Alpha � xk/Alpha 0 250 500 750 1000 1250 1500 instruction count SIGCOMM ’96 The University of Arizona 5
Latency Bottlenecks Suspects � Frequent branching � Instruction-cache gaps � Cache collisions � Layering overheads Not instruction/data translation buffer. SIGCOMM ’96 The University of Arizona 6
Techniques � Outlining attacks: – frequent branching – i-cache gaps � Cloning attacks: – cache collisions � Path-inlining attacks: – layering overheads SIGCOMM ’96 The University of Arizona 7
Outlining � Exception-handling code – lots of it (up to 50%) – dilutes instruction-cache – causes taken branches � Remove from fast path – annotate if-statements with branch probability – move unlikely code to end of function SIGCOMM ’96 The University of Arizona 8
Outlining Example : f if (bad case @ 0) panic("ba d day"); g printf("g oo d day"); : : load r0, (bad case) : jump if not 0 r0, bad day load r0, (bad case) load addr a0, "good day" jump if 0 r0, good day call printf load addr a0, "bad day" continue : call panic : good day: return load addr a0, "good day" bad day: call printf load addr a0, "bad day" : call panic jump continue SIGCOMM ’96 The University of Arizona 9
Cloning � Make copy of functions on fast path – relocate to avoid conflict misses – specialize for a particular use (partial evaluation) � Alternative layout algorithms – micro-positioning – bipartite layout SIGCOMM ’96 The University of Arizona 10
Outlining & Cloning Summary Standard Layout: After Outlining: After Cloning: function A function A function A function B function B function B copy & relocate frequently executed code clone A frequently executed instructions infrequently executed instructions clone B SIGCOMM ’96 The University of Arizona 11
Path-Inlining Collapse deeply-nested functions � Assume fast path is known � Compile entire function as single unit Advantages � Removes call-overheads � Increases context for optimizer SIGCOMM ’96 The University of Arizona 12
End-to-End Latency Roundtrip time in � s: TCP RPC 500 500 400 400 498.8 300 300 457.1 399.2 365.5 200 200 351 310.8 100 100 0 0 BAD STD OPT BAD STD OPT SIGCOMM ’96 The University of Arizona 13
Processing Latency Processing-time per roundtrip in � s: RPC TCP 300 300 288.8 247.1 200 200 189.2 155.5 141 100 100 100.8 0 0 BAD STD OPT BAD STD OPT SIGCOMM ’96 The University of Arizona 14
Memory System Performance TCP RPC BAD BAD 1.61 4.58 1.69 4.66 DUX DUX 2 2.3 mCPI mCPI STD STD 1.72 1.58 1.78 1.69 iCPI iCPI OPT OPT 1.57 1.17 1.67 0.81 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 SIGCOMM ’96 The University of Arizona 15
Outlining Effectiveness � TCP No Outlining With Outlining Used Used Unused Unused 21% 15% 85% 79% � RPC – Essentially identical performance. SIGCOMM ’96 The University of Arizona 16
Conclusions � Instruction cache bandwidth major bottleneck � Cache collisions not particularly bad � Processor/Memory gap still growing; now: – 300MHz processor – 100Mbps Ethernet – 80MB/s memory system SIGCOMM ’96 The University of Arizona 17
Conclusions � Outlining – Readily applicable – Relatively convenient � Cloning and path-inlining – Requires “path” notion: see Scout OS – Need better (automatic) tools SIGCOMM ’96 The University of Arizona 18
Dynamics xCall() XRPCTEST MSELECT c VCHAN TCPTEST semWait() CHAN semSignal() TCP BID a BLAST IP IP b VNET VNET ETH ETH processFrame() LANCE LANCE SIGCOMM ’96 The University of Arizona 19
Recommend
More recommend