Huge Data Transfer Experimentation over Lightpaths Corrie Kost, Steve McDonald TRIUMF Wade Hong Carleton University
Motivation • LHC expected to come on line in 2007 • data rates expected to exceed a petabyte a year • large Canadian HEP community involved in the ATLAS experiment • establishment of a Canadian Tier 1 at TRIUMF • replicate all/part of the experimental data • need to be able transfer “huge data” to our Tier 1
TRIUMF • Tri University Meson Facility • Canada’s Laboratory for Particle and Nuclear Physics • operated as a joint venture by UofA, UBC, Carleton U, SFU, and UVic • located on the UBC campus in Vancouver • five year funding from 2005 - 2009 announced in federal budget • planned as the Canadian ATLAS Tier 1
TRIUMF
Lightpaths • a significant design principle of CA*net 4 is the ability to provide dedicated point to point bandwidth over lightpaths under user control • similar philosophy of SURFnet provides the ability to establish an end to end lightpath from Canada to CERN • optical bypass isolates “huge data transfers” from other users of the R&E networks • lightpaths permit the extension of ethernet LANs to the wide area
Ethernet: local to global • the de facto LAN technology • original ethernet • shared media, half duplex, distance limited by protocol • modern ethernet • point to point, full duplex, switched, distance limited by the optical components • cost effective
Why native Ethernet Long Haul? • more than 90% of the Internet traffic originates from an Ethernet LAN • data traffic on the LAN increases due to new applications • Ethernet services with incremental bandwidth offer new business opportunities for carriers • why not native Ethernet? • scalability, reliability, service guarantees • all the above are research areas • native Ethernet long haul connections can be used today as a complement to the routed networks, not a replacement
Experimentation • experimenting with 10 GbE hardware for the past 3 years • engaged 10 GbE NIC and network vendors • mostly interested in disk to disk transfers with commodity hardware • tweaking performance of Linux-based disk servers • engaged hardware vendors to help build systems • testing data transfers over dedicated lightpaths • engineering solutions for the e2e lightpath last mile • especially for 10 GbE
2002 Activities • established the first end to end trans-atlantic lightpath between TRIUMF and CERN for iGrid 2002 • bonded dual GbEs transported across a 2.5 Gbps OC-48 • initial experimentation with 10GbE • alpha Intel 10GbE LR NICs, Extreme Black Diamond 6808 with 10GbE LRi blades • transfered data from ATLAS DC from TRIUMF to CERN using bbftp and tsunami
Live continent to continent • e2e lightpath up and running Sept 20 20:45 CET traceroute to cern-10g (192.168.2.2), 30 hops max, 38 byte packets 1 cern-10g (192.168.2.2) 161.780 ms 161.760 ms 161.754 ms
iGrid 2002 Topology
Exceeding a Gbps (Tsunami)
2003 Activities • Canarie funded directed research project, CA*net 4 IGT to continue with experimentation • Canadian HEP community and CERN • GbE lightpath experimentation between CERN and UofA for real-time remote farms • data transfers over a GbE lightpath between CERN and Carleton U for transferring 700GB of ATLAS FCAL test beam data • took 6.5 hrs versus 67 days
Current IGT Topology
2003 Activities • re-establishment of 10 GbE experiments • newer Intel 10 GbE NICs and Force 10 Networks E600 switches, IXIA network testers, servers from Intel and CERN OpenLab • established first native 10GbE end to end trans- atlantic lightpath between Carleton U and CERN • demonstrated at ITU Telecom World 2003
Demo during ITU Telecom World 2003 HP 10GE LAN PHY 10GE WAN PHY OC192c Itanium-2 Intel Itanium-2 HP Itanium-2 Intel Xeon Ixia Cisco Cisco Cisco 400T Cisco Cisco Force10 Force10 ONS 15454 Ixia ONS 15454 ONS 15454 ONS 15454 ONS 15454 E 600 E 600 400T Geneva Ottawa Toronto Chicago Amsterdam 10 GbE WAN PHY over an OC-192 circuit using lightpaths provided by SURFnet and CA*net 4 9.24 Gbps using 6 Gbps using 5.65 Gbps using traffic generators UDP on PCs TCP on PCs
Results on the transatlantic 10GbE Single stream UDP throughput Single stream TCP throughput Data rates limited by the PC, even for memory to memory tests UDP uses less resources than TCP on high-bandwidth delay product networks
2004-2005 Activities • arrival of the third CA*net 4 lambda in the summer of 2004, looked at establishing a 10 GbE lightpath from TRIUMF • Neterion (s2io) Xframe 10 GbE NICs, Foundry NetIron 40Gs, Foundry NetIron 1500, servers from Sun Microsystems, and custom built disk servers from Ciara Technologies. • distance problem between TRIUMF and the CA*net 4 OME 6500 in Vancouver • XENPAK 10 GbE WAN PHY at 1310nm
2004-2005 Activities • testing data transfers between TRIUMF and CERN, and TRIUMF and Carleton U over a 10 GbE lightpath • experimenting with robust data transfers • attempt to maximize disk i/o performance from Linux-based disk servers • experimenting with disk controllers and processors • ATLAS Service Challenges in 2005
2004-2005 Activities • exploring a more permanent 10 GbE lightpath to CERN and lightpaths to Canadian Tier 2 ATLAS sites from TRIUMF • CANARIE playing a lead role in helping to facilitate • still need to solve some last mile lightpath issues
Experimental Setup at TRIUMF MRV FD Storm 2 Sun 1 NI1500 Storm 1 Storm 1 Storm 1
Xeon-based Servers • Dual 3.2 GHz Xeons • 4GB memory • 4 3WARE 9500S-4LP (&8) • 16 SATA150 120GB drives • 40GB HITACHI 14R9200 drives • INTEL 10GBE PXLA8590LR
Some Xeon Server I/O Results • read a pair of 80 GB (xfs) files for 67 hours – 120TB – average 524 MB/sec (Software Raid0 of 8 Sata disks on each of pair hardware Raid0 RocketRaid 1820A controllers on Storm2) • 10GbE S2IO Nics – back-to-back 17 hrs – 10TB – average 180MB/se ( from Storm2 to Storm1 with software Raid0 of 4 disks on each of 3 3ware-9500S4 controllers in Raid0) • 10GbE lightpath: Storm2 to Itanium machine at CERN – 10,15,20,25 bbftp streams averaged 18, 24, 27, 29 MB/sec disk-to-disk. ( Only 1 disk at CERN – max write speed 48MB/sec) • continued Storm1 to Storm2 testing – many sustainability problems encountered and resolved. Details available on request. Don’t do test flights too close to ground echo 100000 > /proc/sys/vm/min_free_kbytes
Opteron-based Servers • Dual 2.4GHz Opterons • 4GB Memory • 1 WD800JB 80GB HD • 16 SATA 300GB HD (Seagate ST3300831AS) • 4 4 Port Infiniband-SATA • 2 RocketRaid 1820A • 10GbE NIC • 2 PCI-X at 133MHz (*) • 2 PCI-X at 100MHz (*) Note: 64bit * 133MHz = 8.4 Gb/s
Multilane Infini-band SATA
Server Specifications TYAN K8S S2882 SunFire V40z dd /dev/zero > /dev/null 60 GB/sec 32 GB/sec CPU Dual 2.5GHz Opterons Quad 2.5 GHz Opterons PCI-X (64 bit) 2@133 MHz (100 for two) 4@133 MHz full length 2@100MHz (66 for two) 1@133 MHz full length 1@100 Mhz half length 1@66 Mhz half length Memory 4 GB 8 GB Disks 16 300 GB SATA 2 x 73 GB 10 SCSI 320 3 x 147 GB 10K SCSI 320 I/O See slide “Optimal I/O Results” 3 x 147 GB as raid0 JBOD 160 to 123 MB/s write 176 to 130 MB/sec read
The Parameters • 5 types of controllers • number of controllers to use (1 to 4) • number of disks/controller (1 to 16) • RAID0, RAID5, RAID6, JBOD • dual or quad Opteron systems • 4-6 possible PCI-X slots (1 reserved for 10GigE) • linux kernels (2.6.9, 2.6.10, 2.6.11) • many tuning parameters (in addition to WAN) e.g. • blockdev –setra 8192 /dev/md0 • chunk-size in mdadm (1024) • /sbin/setpci –d 8086:1048 e6.b=2e • (modifies MMRBC field in PCI-X configuration space for vendor 8086 and device 1048 to increase transmit burst length on the bus • echo 100000 >/proc/sys/vm/min_free_kbytes • ifconfig eth3 txqueuelen 100000
The SATA Controllers 3Ware-9500S-4 3Ware-9500S-8 Areca 1160 SuperMicro DAC-SATA-MV8 Highpoint RocketRaid 1820A
Areca 1160 Details Extensive tests were done by tweakers.net on ARECA and 8 others www.tweakers.net/benchdb/search/product/104629 www.tweakers.net/reviews/557 PROS CONS Internal & External Web Access Flaky – external hangs require reboot, internal requires starting a new port Many options: display disk temps, SATA300 Trial and error to use them since few +NCQ, email alerts examples in documents. Supports filesystems >2TB, 16 disks, 64bit/ JBOD performance mostly = single disk 133MHz (24 disk / PCI-EXPRESS X8 available) 15 Disk RAID5 W/R 301/390 MB/s 2 RAID0 (7&8 disk) W/R 361/405 MB/s 15 Disk RAID6 W/R 237/328 MB/s RAID0 of 12 disks W/R 349/306 MB/s RAID6 very robust. Background rebuilds Background rebuilds 50-100 slower than has low impact on I/O performance. fast builds (at 20% priority).
Recommend
More recommend