LHCnet: Proposal for LHC Network infrastructure extending globally to Tier2 and Tier3 sites Artur Barczyk, Harvey Newman California Institute of Technology / US LHCNet LHCT2S Meeting CERN, January 13 th , 2011 1
THE PROBLEM TO SOLVE 2
LHC Computing Infrastructure WLCG in brief: WLCG in brief: • • 1 Tier-0 (CERN) • • 11 Tiers-1s; 3 continents • • 164 Tier-2s; 5 (6) continents Plus O(300) Tier Plus O(300) Tier-3s worldwide 3
CMS Data Movements (All Sites and Tier1-Tier2) 120 Days June-October 120 Days June-October 2.5 2 Throughput [GBy/s] Daily average total Daily average total Daily average Daily average 2 rates reach over T1-T2 rates reach reach 1.5 2 GBytes/s 1-1.8 GBytes/s 1.5 1 1 0.5 0.5 0 0 6/19 7/03 7/17 7/31 8/14 8/28 9/11 9/25 10/9 6/23 7/07 7/21 8/4 8/18 9/1 9/15 9/29 10/13 132 Hours 132 Hours Tier2-Tier2 ~25% 4 Throughput [GBy/s] Last Week of Tier1-Tier2 1 hour average: 1 hour average: to 3.5 GBytes/s Traffic 3 To ~50% 2 during Dataset Reprocessing & 1 Repopulation 0 10/7 10/6 10/8 10/9 10/10 4
Worldwide data distribution and analysis (F.Gianotti) Total throughput of ATLAS data through the Grid: 1 st January November. MB/s per day 6 GB/s ~2 GB/s (design) Peaks of 10 GB/s reached Grid-based analysis in Summer 2010: >1000 different users; >15M analysis jobs The excellent Grid performance has been crucial for fast release of physics results. E.g.: ICHEP: the full data sample taken until Monday was shown at the conference Friday 5
Changing LHC Data Models • 3 recurring themes: – Flat(ter) hierarchy: Any site might in the future pull data from any other site hosting it. – Data caching: Analysis sites will pull datasets from other sites “on demand”, including from Tier2s in other regions • Possibly in combination with strategic pre-placement of data sets – Remote data access: jobs executing locally, using data cached at a remote site in quasi-real time • Possibly in combination with local caching • Expect variations by experiment 6
Ian Bird, CHEP conference, Oct 2010 Ian Bird, CHEP conference, Oct 2010 7
Remote Data Access and Local Processing with Xrootd (CMS) Useful for smaller sites with less (or even no) data storage Only selected objects are read (with object read-ahead). No transfer of entire data sets CMS demonstrator: Omaha diskless Tier3, served data from Caltech and Nebraska (Xrootd) Strategic Decisions: Strategic Decisions: Remote Access vs Data Transfers Similar operations in Similar operations in ALICE for years Brian Bockelman, September 2010 Brian Bockelman, September 2010 8
Ian Bird, CHEP conference, Oct 2010 Ian Bird, CHEP conference, Oct 2010 9
Requirements summary (from Kors ’ document) • Bandwidth: – Ranging from 1 Gbps (Minimal site) to 5-10Gbps (Nominal) to N x 10 Gbps (Leadership) – No need for full-mesh @ full-rate, but several full-rate connections between Leadership sites – Scalability is important, • sites are expected to migrate Minimal Nominal Leadership • Bandwidth growth: Minimal = 2x/yr, Nominal&Leadership = 2x/2yr • “Staging”: – Facilitate good connectivity to so far (network-wise) underserved sites • Flexibility: – Should be able to include or remove sites at any time • Budget Neutrality: – Solution should be cost neutral [or at least affordable, A/N] 10
SOLUTION PROPOSAL 11
Lessons learned • The LHC OPN has proven itself, shall learn from it • Simple architecture – Point-to-point Layer 2 circuits – Flexible and scalable topology • Grew organically – From star to partial mesh – Open to several technology choices • each of which satisfies requirements • Federated governance model – Coordination between stakeholders – No single administrative body required – Made extensions and funding straight-forward • Remaining challenge: monitoring and reporting – More of a systems approach 12
Design Inputs • By the scale, geographical distribution and diversity of the sites as well as funding, only a federated solution is feasible • The current LHC OPN is not modified – OPN will become part of a larger whole – Some purely Tier2/Tier3 operations • Architecture has to be Open and Scalable – Scalability in bandwidth, extent and scope • Resiliency in the core, allow resilient connections at the edge • Bandwidth guarantees determinism – Reward effective use – End-to-end systems approach • Operation at Layer 2 and below – Advantage in performance, costs, power consumption 13
Design Inputs, cont. • Most/all R&E networks (technically) can offer Layer 2 services – Where not, commercial carriers can – Some advanced ones offer dynamic (user controlled) allocation • Leverage as much as possible on existing infrastructures and collaborations – GLIF, DICE, GLORIAD, … • Last but not least: – This would be the perfect occasion to start using IPv6, therefore we should, (at least) encourage IPv6, but support IPv4 • Admittedly the challenge is above Layer 3 14
Design Proposal • A design satisfying all requirements: Switched Core with Routed Edge • Sites interconnected through Lightpaths – Site-to-site Layer 2 connections, static or dynamic • Switching is far more robust and cost-effective for high- capacity interconnects • Routing (from end-site viewpoint) is deemed necessary 15
Switched Core • Strategically placed core exchange points – E.g. start with 2-3 in Europe, 2 in NA, 1 in SA, 1-2 in Asia – E.g. existing devices at Tier1s, GOLEs, GEANT nodes, … • Interconnected through high capacity trunks – 10-40 Gbps today, soon 100Gbps • Trunk links can be CBF, multi- domain Layer 1/ Layer 2 links, … – E.g. Layer 1 circuits with virtualised sub-rate channels, sub-dividing 100G links in early stages • Resiliency, where needed, provided at Layer 1/ Layer 2 – E.g. SONET/SDH Automated Protection Switching, Virtual Concatenation • At later stage, automated Lightpath exchanges will enable a flexible “stitching” of dynamic circuits – See demonstration (proof of principle) at last GLIF meeting and SC10 16
One Possible Core Technology: Carrier Ethernet • IEEE standard 802.1Qay (PBB-TE) – Separation of backbone and customer network through MAC-in-MAC – No flooding, no Spanning Tree – Scalable to 16 M services • Provides OAM comparable to SONET/SDH – 802.3ag, end-to-end service OAM • Continuity Check Message, loopback, linktrace – 802.3ah, link OAM • Remote loopback, loopback control, remote failure indication • Cost Effective – e.g. NSP study indicates TCO ~43% lower for COE (PBB-TE) vs MPLS-TE • 802.1Qay and ITU-T G.8031 Ethernet Linear Protection Standard provides 1+1 and 1:1 protection switching – Similar to SONET/SDH APS – Works by Y.1731 message exchange (ITU-T standard) 17
Routed Edge • End sites (might) require Layer 3 connectivity in the LAN – Otherwise a true Layer 2 solution might be adequate • Lightpaths terminate on a site’s router – Site’s border router, or, preferably, – Router closest to the storage elements • All IP peerings are p2p, site-to-site – Reduces convergence time, avoids issues with flapping links • Each site decides and negotiates with which remote site it desires to peer (e.g. based on experiment’s connectivity design) • Router (BGP) advertises only the SE subnet(s) through the configured Lightpath 18
Lightpath termination • Avoid LAN connectivity issues when terminating lightpath at campus edge • Lightpath should be terminated as close as possible to the Storage Elements, but can be challenging if not impossible (support a dedicated border router?) • Or, provide a “local lightpath ” (e.g. a VLAN with proper bandwidth, or a dedicated link where possible); border router does the “stitching” 19
IP backup • Foresee IP routed paths as backup – End- site’s BR is configured for both default IP connectivity, and direct peering through Lightpath – Direct peering takes precedence • Works also for dynamic Lightpaths • For full dynamic Lightpath setup, dynamic end-site configuration through e.g. LambdaStation or TeraPaths will be used 20
Resiliency • Resiliency in the core is provided by protection switching depending on technology used between core nodes – SONET/SDH or OTN protection switching (Layer 1) – MPLS failover – PBB-TE protection switching – Ethernet LAG • Sites can opt for additional resiliency (e.g. where protected trunk links are not available) by forming transit agreements with other site – akin to the current LHC OPN use of CBF 21
Layer1 through Layer 3 22
Scalability • Assuming Layer 2 point-to-point operations, a natural scalability limitation is the 4k VLAN IDs • This problem is naturally resolved in – PBB-TE (802.3Qay), through MAC-in-MAC encapsulation Customer Ethertype Ethertype B-DA B-SA B-VID I-SID Frame incl. B-FCS 0x88A8 0x88E7 Header+FCS – dynamic bandwidth allocation with re-use of VLAN IDs • Only constraint is no two connections through the same network element to use the same VLAN 23
Recommend
More recommend