Advanced Network Performance Monitoring and Troubleshooting Richard Carlson March 5, 2009 rcarlson@internet2.edu
Basic Premise • Application’s performance should meet your expectations! • If they don’t you should complain! • But – you need to complain effectively!
Why is it hard to Find/Fix Problems? Network infrastructure is complex Network infrastructure is shared Network infrastructure consists of multiple components
Example 1 – SCP file transfer Bob and Carol are collaborating on a project. Bob needs to send a copy of the data (50 MB) to Carol every ½ hour. Bob and Carol are 2,000 miles apart. How long should each transfer take? • 5 minutes? • 1 minute? • 5 seconds?
What should we expect? Assumptions: • 100 Mbps Fast Ethernet is the slowest link • 50 msec round trip time Bob & Carol calculate: • 50 MB * 8 = 400 Mbits • 400 Mb / 100 Mb/sec = 4 seconds
Initial SCP Test Results
Initial Test Results This is unacceptable! First look for network infrastructure problem • Use NDT tester to examine both hosts
Initial NDT testing shows Duplex Mismatch at one end
NDT Found Duplex Mismatch Investigating this it is found that the switch port is configured for 100 Mbps Full- Duplex operation. • Network administrator corrects configuration and asks for re-test
Duplex Mismatch Corrected
SCP results after Duplex Mismatch Corrected
Intermediate Results Time dropped from 18 minutes to 40 seconds. But our calculations said it should take 4 seconds! • 400 Mb / 40 sec = 10 Mbps • Why are we limited to 10 Mbps? • Are you satisfied with 1/10 th of the possible performance?
Default TCP window settings
Calculating the Window Size Remember Bob found the round-trip time was 50 msec Calculate window size limit • 85.3KB * 8 b/B = 698777 b • 698777 b / .050 s = 13.98 Mbps Calculate new window size • (100 Mb/s * .050 s) / 8 b/B = 610.3 KB • Use 1MB as a minimum
Resetting Window Value
With TCP windows tuned
Steps so far Found and fixed Duplex Mismatch • Network Infrastructure problem Found and fixed TCP window values • Host configuration problem Are we done yet?
SCP results with tuned windows
Intermediate Results SCP still runs slower than expected • Hint: SCP uses internal buffers • Patch available from PSC
SCP Results with tuned SCP
Final Results Fixed infrastructure problem Fixed host configuration problem Fixed Application configuration problem • Achieved target time of 4 seconds to transfer 50 MB file over 2000 miles
Example 2 - PNNL Throughput Problem 950+ Mbps from remote sites to PNNL 966 Mbps 930 Mbps 328 Mbps Measured Speeds shows problem when PNNL sends 22
PNNL Throughput Problem 950+ Mbps from remote sites to PNNL 966 Mbps 6 msec 930 Mbps 23 msec 328 Mbps 76 msec Interesting: RTT increases by a factor of 3 and speed decreases by the same factor 23
PNNL Throughput Problem 950+ Mbps from remote sites to PNNL 966 Mbps 6 msec 0.0094% 6.04% ooo 930 Mbps 23 msec 0.0045% 328 Mbps 5.5% ooo 76 msec 0.0049% 5.15% ooo Finally: look at loss rate and packet reordering (ooo) rate, problem exists in Seattle – PNNL metro net 24
Advanced user tools • Existing NDT tool • Allows users to test network path for a limited number of common problems • Existing NPAD tool • Allows users to test local network infrastructure while simulating a long path
Network Diagnostic Tool (NDT) • Measure performance to users desktop • Identify real problems for real users • Network infrastructure is the problem • Host tuning issues are the problem • Make tool simple to use and understand • Make tool useful for users and network administrators
NDT sample Results
Finding a Server • What? You don’t have one running at your site? • Install the Internet2 Network Performance Toolkit Knoppix Disk
NPAD/pathdiag • A new tool from researchers at Pittsburgh Supercomputer Center • Finds problems that affect long network paths • Uses Web100-enhanced Linux based server • Web based Java client
Long Path Problem 70 msec H1 – H3 1 msec H1 – H2 H2 Switch 2 Switch 3 R5 R4 R8 Switch 1 R1 X R3 R6 H3 R9 R2 H1 R7 Switch 4
NPAD Server main page
NPAD Sample results
Finding a Server • What? You don’t have one running at your site? • Install the Internet2 Network Performance Toolkit Knoppix Disk
Sample BWCTL results
OWping Results
NPToolkit Knoppix Disk
Conclusions • OSG VDT will contain client tools • Network operators (campus, regional, national) are standing up servers • OSG site admins need to stand up server ‘near’ cluster
Recommend
More recommend