cms remote data access aaa
play

CMS remote data access (AAA) Giacinto DONVITO (INFN-Bari) - PowerPoint PPT Presentation

CMS remote data access (AAA) Giacinto DONVITO (INFN-Bari) Thanks to: Brian Bockelman, Ken Bloom (UNL) Tommaso Boccali, Daniele Bonacorsi (INFN) AAA


  1. CMS ¡remote ¡data ¡access ¡ (AAA) ¡ Giacinto ¡DONVITO ¡(INFN-­‑Bari) ¡ Thanks ¡to: ¡ Brian ¡Bockelman, ¡Ken ¡Bloom ¡(UNL) ¡ Tommaso ¡Boccali, ¡Daniele ¡Bonacorsi ¡(INFN) ¡

  2. AAA Project • Goal l Use resources more effectively through remote data access in CMS Sub-goals l Low-ceremony/latency access to any single event l Reduce data access error rate l Overflow jobs from busy sites to less busy ones l Use opportunistic resources l Make life at T3s easier Any data, Anytime, Anywhere 2

  3. General overview • In CMS the association LFN->PFN is a simple “rule”. No DB involved • We can use plain root installation with a given prefix at each site Same)namespace,)irrespec2ve)of)data)loca2on ✦ TFile::Open(“root://xrootd.unl.edu//store/foo”) Any data, Anytime, Anywhere 3

  4. Xrootd world-wide Federation US ¡Xrootd ¡ Global ¡ Redirector ¡ Xrootd ¡ Redirector ¡ US ¡Tier2 ¡ EOS ¡ sites ¡ CERN ¡ EU ¡Xrootd ¡ Redirector ¡ EU ¡Tier2 ¡ Local ¡ sites ¡ storage ¡ CMS ¡jobs ¡ 4 ¡

  5. XRootd: Federating Storage Systems l Step 1: deploy seamless global storage interface l But preserve site autonomy: l xrootd plugin maps from global logical filename to physical filename at site - Mapping is typically trivial in CMS: /store/* → /store/* l xrootd plugin reads from site storage system - Example: HDFS, dCache, Lustre, GPFS, DPM l User authentication also pluggable - But we use standard GSI + lcmaps + GUMS - Also VOMS plugin is used in production Any data, Anytime, Anywhere 5

  6. Status of CMS Federation l US: l T1 (disk) + 8/8 T2s federated l Covers 100% of the data for analysis l Does not cover files only on tape l IT: l Complete deployment of Xrootd on Tier1+ all the Tier2 l Also few Tier3 has joined l Both as client and as servers World l 4 T1s (FNAL, CNAF, RAL, JINR) + 3/4 T2s accessible l Monitored but not a “turns your site red” service (yet) Any data, Anytime, Anywhere 6

  7. Regulation of Requests l On average 1 analysis job on AOD data needs about 250kb/s l To 1 st order, jobs still run at sites with the data l ~0.25 GB/s average remote read rate l O(10) GB/s average local read rate l ~1.5 GB/s PhEDEx transfer rate l Cases where data is read remotely: l Interactive - limited by # humans l Fallback - limited by error rate opening files l Overflow - limited by scheduling policy l Opportunistic - limited by scheduling policy l T3 - usually Tier3 is not that “big” Any data, Anytime, Anywhere 7

  8. More on Fallback l On file open error, CMS software can retry via alternate location/protocol l Configured by site admin l We fall back to regional xrootd federation - US, EU - Could also have inter-region fallback l Can recover from missing file error, but not for corrupted files (more on this later) l Has more uses than just error recovery ... Any data, Anytime, Anywhere 8

  9. More about Overflow l GlideinWMS scheduling policy l Candidates for overflow: - Idle jobs with wait time above threshold (6h) - Desired data available in a region supporting overflow l Regulation of overflow: - Limited number of overflow glideins submitted per source site l Data access l No reconfiguration of job required - Uses fallback mechanism - Try local access, fall back to remote access on failure Any data, Anytime, Anywhere 9

  10. Running Opportunistically l To run CMS jobs at non-CMS sites, we need: l Outbound network access l Access to CMS datafiles - Xrootd remote access l Access to conditions data - http proxy l Access to CMS software - CVMFS (also needs http proxy) - No need for data pre-placement or any kind of “site preparation” - Fully opportunistic use of computing resources - This is useful also to run in the cloud Any data, Anytime, Anywhere 10

  11. Fallback++ l Today we can recover when file is missing from local storage system l But corrupted files cause jobs to fail l And job may come back and fail again ... l Admin may need to intervene to recover the data l User may need to resubmit the job l Can we do better? Any data, Anytime, Anywhere 11

  12. Yes, We Hope l Concept l Fall back on read error l Cache remotely read data l Insert downloaded data back into storage system Any data, Anytime, Anywhere 12

  13. File Healing Any data, Anytime, Anywhere 13

  14. Cross-site Replication l Once we have file healing … l Could reduce replication level from 2 to 1, in the site that have an HDFS or similar infrastructure l use cross-site redundancy instead - Would need to enforce the replication policy at higher level - May not be good idea for hot data - Need to consider impact on performance Any data, Anytime, Anywhere 14

  15. Performance Mostly CMS application-specific stuff l Improved remote read performance by combining multiple reads into vector reads - Eliminates many round-trips l Working on bit-torrent-like capabilities in CMS application - Read from multiple xrootd sources - Balance load away from slower source - React in O(1) minute time frame Any data, Anytime, Anywhere 15

  16. AAA Dashboard Monitoring Any data, Anytime, Anywhere

  17. AAA Dashboard Monitoring

  18. AAA Dashboard Monitoring Any data, Anytime, Anywhere

  19. AAA Dashboard Monitoring

  20. AAA Accounting

  21. Examples of using Xrootd 15/04/2013: T2_Legnaro storage was down for a dCache upgrade. The site was accepting and running analysis job without any problem exploiting Xrootd fall-back INFN test using XRootd: • Reprocessing @ CNAF reading RAW data with Xrootd-Fallback from FNAL • Reprocessing @ T2_IT reading MC RAW with Xrootd-Fallback 21 from CNAF

  22. Regional set-up • INFN in italy has a dedicated Xrootd redirector where all the Italian resources are registered • INFN Tier3 could also join the federation as “data provider” and not only as “consumers” • All the data available on this redirector could be accessed with good bandwidth and very low latency • Thanks to GARR-X infrastructure Any data, Anytime, Anywhere 22

  23. High-availability of redirector • Working on high-availability for Xrootd redirector • We are exploring the possibility to use intelligent set-up of DNS • The idea is to have few instances of xrootd redirector distributed geographically • In case of failures of one server, the DNS is automatically reconfigured to use the others 23 Any data, Anytime, Anywhere

  24. Summary l XRootd storage federation rapidly expanding and proving useful within CMS l We hope to do more l Automatic error recovery l Opportunistic usage l Improving performance l Work on-going to provide geographically high-availability on the Xrootd redirector Any data, Anytime, Anywhere 24

Recommend


More recommend