udp performance and pci x activity of the intel 10
play

UDP Performance and PCI-X Activity of the Intel 10 Gigabit Ethernet - PDF document

UDP Performance and PCI-X Activity of the Intel 10 Gigabit Ethernet Adapter on: HP rx2600 Dual Itanium 2 SuperMicro P4DP8-2G Dual Xenon Dell Poweredge 2650 Dual Xenon Richard Hughes-Jones Many people helped including: Sverre Jarp and Glen


  1. UDP Performance and PCI-X Activity of the Intel 10 Gigabit Ethernet Adapter on: HP rx2600 Dual Itanium 2 SuperMicro P4DP8-2G Dual Xenon Dell Poweredge 2650 Dual Xenon Richard Hughes-Jones Many people helped including: Sverre Jarp and Glen Hisdal CERN Open Lab Sylvain Ravot, Olivier Martin and Elise Guyot DataTAG project Les Cottrell, Connie Logg and Gary Buhrmaster SLAC Stephen Dallison MB-NG PFLDNet Argonne Feb 2004 1 R. Hughes-Jones Manchester ! Introduction ! 10 GigE on Itanium IA64 ! 10 GigE on Xeon IA32 ! 10 GigE on Dell Xeon IA32 ! Tuning the PCI-X bus ! SC2003 Phoenix PFLDNet Argonne Feb 2004 2 R. Hughes-Jones Manchester 1

  2. Latency & Throughput Measurements ! UDP/IP packets sent between back-to-back systems " Similar processing to TCP/IP but no flow control & congestion avoidance algorithms " Used UDPmon test program ! Latency " Round trip times using Request-Response UDP frames ! Tells us about: " Behavior of the IP stack " Latency as a function of frame size ¥ Slope s given by: " The way the HW operates 1 ! db ( % s " = & # " Interrupt coalescence dt ' $ data paths ¥ Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s) ¥ Intercept indicates processing times + HW latencies " Histograms of ÔsingletonÕ measurements ! UDP Throughput " Send a controlled stream of UDP frames spaced at regular intervals " Vary the frame size and the frame transmit spacing & measure: ¥ The time of first and last frames received ¥ The number packets received, lost, & out of order ¥ Histogram inter-packet spacing received packets ! Tells us about: ¥ Packet loss pattern " Behavior of the IP stack ¥ 1-way delay " The way the HW operates ¥ CPU load ¥ Number of interrupts " Capacity & Available throughput PFLDNet Argonne Feb 2004 3 of the LAN / MAN / WAN R. Hughes-Jones Manchester The Throughput Measurements ! UDP Throughput ! Send a controlled stream of UDP frames spaced at regular intervals Zero stats OK done Send data frames at Inter-packet time regular ___ (Histogram) intervals ___ Time to send Time to receive Get remote statistics Send statistics: No. received No. lost + loss pattern Signal end of test No. out-of-order OK done CPU load & no. int 1-way delay Number of packets n bytes ¥¥¥ time Wait time PFLDNet Argonne Feb 2004 4 R. Hughes-Jones Manchester 2

  3. The PCI Bus & Gigabit Ethernet Measurements ! PCI Activity ! Logic Analyzer with " PCI Probe cards in sending PC " Gigabit Ethernet Fiber Probe Card " PCI Probe cards in receiving PC Gigabit Ethernet CPU CPU NIC NIC Probe chipset chipset mem mem Logic Analyser Display PFLDNet Argonne Feb 2004 5 R. Hughes-Jones Manchester Example: The 1 Gigabit NIC Intel pro/1000 Throughput ! Motherboard: Supermicro P4DP6 ! Chipset: E7500 (Plumas) 1000 gig6-7 Intel pci 66 MHz 27nov02 50 bytes ! CPU: Dual Xeon 2 2GHz with 512k 100 bytes 800 200 bytes L2 cache Recv Wire rate 400 bytes 600 Mbits/s ! Mem bus 400 MHz 600 bytes PCI-X 64 bit 66 MHz 800 bytes 400 1000 bytes ! HP Linux Kernel 2.4.19 SMP 1200 bytes 200 1400 bytes ! MTU 1500 bytes 1472 bytes 0 0 5 10 15 20 25 30 35 40 ! Intel PRO/1000 XT Transmit Time per frame us Latency Bus Activity 300 Intel 64 bit 66 MHz Send Transfer y = 0.0093x + 194.67 250 200 Latency us y = 0.0149x + 201.75 150 100 50 0 0 500 1000 1500 2000 2500 3000 Message length bytes 900 64 bytes Intel 64 bit 66 MHz 800 800 800 1400 bytes Intel 64 bit 66 MHz 512 bytes Intel 64 bit 66 MHz 1024 bytes Intel 64 bit 66 MHz 800 700 700 700 700 600 600 600 600 500 500 500 500 N(t) N(t) N(t) N(t) 400 400 400 400 300 300 300 300 200 200 200 200 100 100 100 100 0 0 0 0 170 190 210 190 210 230 190 210 230 170 190 210 Latency us Latency us Latency us Latency us Receive Transfer PFLDNet Argonne Feb 2004 6 R. Hughes-Jones Manchester 3

  4. Data Flow: SuperMicro 370DLE: SysKonnect # Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset # CPU: PIII 800 MHz PCI:64 bit 66 MHz Send CSR setup Send Transfer # RedHat 7.1 Kernel 2.4.14 Send PCI Receive PCI Packet on Ethernet Fibre Receive Transfer # 1400 bytes sent ~36 us # Wait 100 us # ~8 us for send or receive # Stack & Application overhead ~ 10 us / node PFLDNet Argonne Feb 2004 7 R. Hughes-Jones Manchester 10 Gigabit Ethernet NIC with the PCI-X probe card. PFLDNet Argonne Feb 2004 8 R. Hughes-Jones Manchester 4

  5. Intel PRO/10GbE LR Adapter in the HP rx2600 system PFLDNet Argonne Feb 2004 9 R. Hughes-Jones Manchester 10 GigE on Itanium IA64: UDP Latency ! Motherboard: HP rx2600 IA 64 ! Chipset: HPzx1 ! CPU: Dual Itanium 2 1GHz with 512k L2 cache ! Mem bus dual 622 MHz 4.3 GByte/s ! PCI-X 133 MHz ! HP Linux Kernel 2.5.72 SMP ! Intel PRO/10GbE LR Server Adapter ! NIC driver with " RxIntDelay=0 " XsumRX=1 XsumTX=1 " RxDescriptors=2048 TxDescriptors=2048 ! MTU 1500 bytes ! Latency 100 ! s & very well behaved ! Latency Slope 0.0033 ! s/byte B2B Expect: 0.00268 ! s/byte ! " PCI 0.00188 ** " 10GigE 0.0008 " PCI 0.00188 PFLDNet Argonne Feb 2004 10 R. Hughes-Jones Manchester 5

  6. 10 GigE on Itanium IA64: Latency Histograms ! Double peak structure with the peaks separated by 3-4 ! s ! Peaks are ~1-2 ! s wide ! Similar to that observed with 1 Gbit Ethernet NICs on IA32 architectures PFLDNet Argonne Feb 2004 11 R. Hughes-Jones Manchester 10 GigE on Itanium IA64: UDP Throughput 16080 bytes Oplab29-30 10GE Xsum 512kbuf MTU16114 30Jul03 16000 bytes # HP Linux Kernel 2.5.72 SMP 14000 bytes 6000 12000 bytes 10000 bytes # MTU 16114 bytes 9000 bytes 5000 8000 bytes Recv Wire rate Mbits/s 7000 bytes # Max throughput 5.749 Gbit/s 6000 bytes 4000 5000 bytes 4000 bytes 3000 bytes 3000 2000 bytes # Int on every packet 1472 bytes 2000 No packet loss in 10M packets # 1000 0 0 5 10 15 20 25 30 35 40 Spacing between frames us 16080 bytes 100 Oplab29-30 10GE Xsum 512kbuf MTU16114 30Jul03 16000 bytes 14000 bytes # Sending host, 1 CPU is idle 80 % CPU kernel 12000 bytes 10000 bytes # For 14000-16080 byte packets, one Sender 60 9000 bytes 8000 bytes CPU is 40% in kernel mode 40 7000 bytes 6000 bytes # As the packet size decreases load 20 5000 bytes 4000 bytes rises to ~90% for packets of 4000 0 3000 bytes 2000 bytes bytes or less. 0 5 10 15 20 25 30 35 40 1472 bytes Spacing between frames us 16080 bytes 100 Oplab29-30 10GE Xsum 512kbuf MTU16114 30Jul03 16000 bytes # Receiving host both CPUs busy 14000 bytes 80 % CPU kernel Receiver 12000 bytes 10000 bytes # 16114 bytes 40% kernel mode 60 9000 bytes 8000 bytes # Small packets 80 % kernel mode 40 7000 bytes 6000 bytes 5000 bytes # TCP gensink data rate was 20 4000 bytes 745 MBytes/s = 5.96 Gbit/s 3000 bytes 0 2000 bytes 0 5 10 15 20 25 30 35 40 1472 bytes PFLDNet Argonne Feb 2004 Spacing between frames us 12 R. Hughes-Jones Manchester 6

  7. 10 GigE on Itanium IA64: UDP Throughput [04] 16080 bytes Openlab98-99 10GE MTU16114 12 Feb04 16000 bytes 6000 # HP Linux Kernel 2.6.1 #17 SMP 14000 bytes 12000 bytes Recv Wire rate Mbits/s 5000 10000 bytes # MTU 16114 bytes 9000 bytes 4000 8000 bytes # Max throughput 5.81 Gbit/s 7000 bytes 6000 bytes 3000 5000 bytes 4000 bytes 2000 3000 bytes # Int on every packet 2000 bytes 1000 1472 bytes # Some packet loss pkts < 4000 bytes 0 0 5 10 15 20 25 30 35 40 Spacing between frames us Openlab98-99 10GE MTU16114 12 Feb04 100 90 16080 bytes 80 16000 bytes % CPU Kernel # Sending host, 1 CPU is idle Ð but 14000 bytes 70 12000 bytes 60 Sender 10000 bytes swap over 50 9000 bytes 40 8000 bytes # For 14000-16080 byte packets, one 30 7000 bytes 20 6000 bytes CPU is 20-30% in kernel mode 5000 bytes 10 4000 bytes # As the packet size decreases load 0 3000 bytes 0 5 10 15 20 25 30 35 40 2000 bytes rises to ~90% for packets of 4000 Spacing between frames us 1472 bytes 100 16080 bytes bytes or less. Openlab98-99 10GE MTU16114 12 Feb04 16000 bytes 90 14000 bytes 80 12000 bytes % CPU Kernel 70 10000 bytes 9000 bytes 60 # Receiving host 1 CPU is idle Ð but Reciever 8000 bytes 50 7000 bytes swap over 40 6000 bytes 30 5000 bytes # 16114 bytes 40% kernel mode 4000 bytes 20 3000 bytes 10 2000 bytes # Small packets 70 % kernel mode 0 1472 bytes 0 5 10 15 20 25 30 35 40 Spacing between frames us PFLDNet Argonne Feb 2004 13 R. Hughes-Jones Manchester 10 GigE on Itanium IA64: PCI-X bus Activity ! 16080 byte packets every 200 ! s Intel PRO/10GbE LR Server Adapter MTU 16114 ! setpci -s 02:1.0 e6.b=2e (22 26 2a ) mmrbc 4096 bytes (512 1024 2048) CSR Access Gap 700ns CSR Access PCI-X Sequence PCI-X bursts 256 bytes PCI-X Sequence 4096 bytes Transfer of 16114 bytes PCI-X bursts 256 bytes PCI-X Signals transmit - memory to NIC ! ! Made up of 4 PCI-X sequences of ~4.55 ! s then a gap of 700 ns ! Interrupt and processing: 48.4 ! s after start ! Sequence contains 16 PCI bursts 256 bytes ! Data transfer takes ~22 ! s ! Sequence length 4096 bytes ( mmrbc) ! Data transfer rate over PCI-X: 5.86 Gbit/s PFLDNet Argonne Feb 2004 14 R. Hughes-Jones Manchester 7

Recommend


More recommend