Disk to Disk Data Transfers at 100Gbps SuperComputing 2011 Azher Mughal Caltech (HEP) CENIC 2012 http://supercomputing.caltech.edu
Agenda • Motivation behind SC 2011 Demo • Collaboration (Caltech, Univ of Victoria, Vendors) • Network & Servers Design • 100G Network testing • Disk to Disk Transfers • PCIe Gen3 Server Performance • How to design a Fast Data Transfer Kit • Questions ?
Motivation behind SC 2011 Demo The LHC experiments, with their distributed Computing Models and world-wide hands-on involvement in LHC physics, have brought renewed focus on networks, thus renewed emphasis on “capacity” and “reliability” of the networks Experiments have seen an exponential growth in capacity 10X in usage every 47 months in ESnet over 18 years About 6M times capacity growth over 25 years across the Atlantic (LEP3Net in 1985 to USLHCNet as of today) LHC experiments (CMS / ATLAS) are generating massively large data sets which needs to be efficiently transferred to the end sites, anywhere in the world A sustained ability to use ever-larger continental and transoceanic networks effectively: high throughput transfers HEP as a driver of R&E and mission-oriented networks Testing latest innovations both in terms of software and hardware Harvey Newman, Caltech
CMS Data Transfer Volume (Oct 2010– Feb. 2011) 10 PetaBytes Transferred Over 4 Months = 8.0 Gbps Avg. (15 Gbps Peak) • Fiber Distance from Seattle Show Floor to Univ of Victoria, about 217 Km. • Optical switches = Ciena OME-6500 using 100GE OTU4 cards • Brocade MLXe-4 Routers – One 100GE card with LR4 – One 8+8 port 10GE line card • Force10 Z9000 40GE switch • Mellanox PCIe Gen3 Dual port 40GE NICs • Servers with PCIe Gen3 slots using Intel E5 Sandy Bridge processors –
SuperComputing 2011 Collaborators Caltech Booth 1223 Courtesy of Ciena
SC11: Hardware Facts Caltech Booth • SuperMicro : SandyBridge E5 based Servers • Mellanox 40GE ConnectX-3, Active Fiber or Passive Copper Cables. • Dell-Force10 Z9000 Switch (all 40GE ports, 54MB shared buffer) • Brocade MLXe-4 Router with 100GE LR4 port and 16 x 10GE ports • LSI 9265-8i RAID Controllers (with FAST Path), OCZ Vertex 3 SSD Drives SCinet • Ciena OME6500 OTN switch BC Net • Ciena OME6500 OTN switch Univ of Victoria • Brocade MLXe-4 Router with 100GE LR4 port and 16 x 10GE ports • Dell R710 Servers with 10GE Intel NICs and OCZ Denva SSD Disks
SC11 - WAN Design for 100G
Caltech Booth - Detailed Network
Key Challenges • First hand experience with PCIe Gen3 servers using sample E5 Sandy Bridge processors, Not many vendors were available for testing. • Will FDT be able to go close to the line rate of Mellanox ConnectX-3 Network Cards, 39.6Gbps (theoretical peak) • What about the firmware and drivers for both Mellanox and LSI ? • LAG between Brocade 100G router and Z9000, 10 x 10GE , will it work ? • End to End 100G and 40G testing, any transport issues ? • What do we know on the BIOS settings for Gen3
Issues we faced • SuperMicro Servers, Mellanox CX3 drivers were all in BETA stage. • Mellanox NIC randomly throwing interface errors, though no physical errors. • QSFP Passive Copper cable has issues at full rates, occasional cable errors • Mellanox NIC randomly throwing interface errors, though no physical errors. • QSFP Passive Copper cable has issues at full rates. • LAG between Brocade and Z9000 had hashing issues for 10 x 10GE so we moved to 12 x 10GE • LSI drivers, single threaded, utilizing a single core to maximum.
How it looks from inside FDT Transfer Application Mellanox MSI Interrupts: 2/4/8 LSI RAID Controller PCIe-3 X8 Sandy Bridge Sandy Bridge DDR3 DDR3 CPU CPU QPI 0 DDR3 1 DDR3 0 DDR3 QPI DDR3 DDR3 DDR3 PCIe-2 X8 PCIe-2 X8 PCIe3 x8 PCIe3 x8 PCIe3 x8 PCIe3 x8 PCIe3 x8 PCIe3 x8 PCIe3 x8 PCIe2 x4 DMI2 LSI LSI
SC11: Software and Tuning • RHEL 6.1 / Scientific Linux 6.1 Distribution • Fast Data Transfer (FDT) utility for moving data among the sites – Writing on the RAID-0 (SSD disk pool) – /dev/zero /dev/null memory test • Kernel smp affinity: • Keep the Mellanox NIC driver queues to the processor cores where NIC’s PCIe Lane is connected • Move LSI Driver IRQ to the second processor • Using Numa Control to bind FDT application to the second processor • Change Kernel TCP/IP parameters
Hardware Setting/Tuning • SuperMicro Motherboards • PCI-e slot needs to be manually set to Gen3, otherwise defaults are Gen2 • Disable Hyper threading • Change PCI-e payload to the maximum (for Mellanox NICs) • Z9000 • Flow control needs to be turned on for Mellanox NIC to work properly in the Z9000 switch • Single Queue compared to 4 Queue model • Mellanox • Got latest firmware, helped reducing the interface errors NIC was throwing
Servers Testing, reaching 100G 100 In (Gbps) 70 40 0 Traffic: Out 30 60 90 Sustained 186 Gbps; Enough to transfer 100,000 Blue-ray per day
Disk to Disk Results; Peaks of 60Gbps 60Gbps Disk write on 7 Supermicro and Dell servers with a mix of 40GE and 10GE Servers.
Total Transfers Total Traffic among all the Links during the Show: 4.657 PetaBytes
Single Server Gen3 performance: 36.8 Gbps (TCP) (During SC11) Post SC11: Reaching 37.5Gbps inbound, and peaks of 38Gbps 37Gbps 37.5Gbps CUBIC HTCP 38Gbps Spike
40GE Server Design Kit SandyBridge E5 Based Servers: (SuperMicro X9DRi-F or Dell R720) Intel E5-2670 with C1 or C2 Stepping Mellanox ConnectX-3 PCIe Gen3 NIC Mellanox QSFP Active Fiber Cables LSI 9265-8i, 8 port SATA 6G RAID Controller OCZ Vertex 3 SSD, 6Gb/s Dell – Force10; Z9000 40GE Switch Server Cost = ~ $10k
Future Directions • Finding bottlenecks in the LSI Raid Card driver, New driver supporting MSI Vectors is available (many configurable queues) • A better refined approach to distribute application and drivers among the cores • Optimizing Linux kernel, timers, other unknowns • New Driver available from Mellanox, 1.5.7.2 • We are working with Mellanox team to find performance limitations not reaching close to the possible 39.6Gbps ethernet rate using 9K packets. • Ways to lower CPU Utilization … • Understand/overcome the SSD wearing out problems over a time • Yet to run tests with E5 2670 C2 chipsets, arrived a week ago.
Summary • The 100Gbps network technology has shown the potential possibilities to transfer peta scale physics datasets in matter of hours around the world. • Couple of highly tuned servers can easily reach to 100GE line rate, effectively utilizing the PCIe Gen3 technology. • Individual Server tests using E5 processors and PCIe Gen-3 based Network Cards have shown stable performance reaching close to 37.5Gbps. • Fast Data Transfer (FDT) application achieved an aggregate disk write of 60Gbps. • MonALISA intelligent monitoring software, effectively recorded and displayed the traffic at 40/100G and the other 10GE links.
Questions ?
Recommend
More recommend