Distributed Data Management in OSG OSG All Hands Meeting - UofU - PowerPoint PPT Presentation

Distributed Data Management in OSG OSG All Hands Meeting - UofU March 20, 2018 Benedikt Riedel Rob Gardner Judith Stephen University of Chicago 1

Overview ● Problem Statement ● Sample Scenario ● Rucio ● Why not Globus ● Evaluation Instances ● XENON1T ● Looking Ahead ● Summary 2

Problem Statement OSG is extremely good at providing compute resources ● (Distributed) Storage is a complex problem: ● ○ Limited storage (compared to compute) available - Stash, BYOS, institutional storage, etc. HEP-specific transfer methods (GridFTP, XRootD, SRM, WebDAV, etc.) are not ○ supported everywhere ○ There is no Condor for storage Hurdles for user - Grid certificates, VO membership, etc. ○ Wide-variety of storage architectures - dCache, Ceph, Gluster, GPFS, Lustre ○ ○ For the most part: No POSIX! - Scares users ○ Writeable StashCache will solve some of these ● How to create what OSG is for compute, for storage? 3

Sample Scenario ● Experiment A has storage allocations on: ○ Institutional cluster(s), NSF supercomputer (scratch space or dedicated), NERSC for archive ● How to tie all these allocations together? ● How to automatically move data between sites? ● How to automatically move data to an archive? ● Solution: Rucio! ● We will discuss Globus later 4

Why Rucio? ● Provides single namespace to data independent of location ● Automated replication of data through a subscription model ○ FTS currently only supported - There is a CERN public instance and an OSG instance ○ Globus in the plans ● Several APIs - REST, Python, CLI ● Per-user ACLs 5

Data Management software created by ● the ATLAS experiment at the LHC, used by Xenon1T, AMS, and ATLAS ● Automated replication of data through a “subscription” model, i.e. a site is “subscribed” to a certain data set ● Built with future in mind, i.e. scalable database infrastructure, common data transfer methods support (GridFTP, SRM, XrootD, S3, etc.), monitoring through ELK, etc. ● Testing by CMS, LIGO, and IceCube Future proof in mind - S3 support ● 6

Some Issues ● Most campus clusters only run a Globus Endpoint ● Distributed file systems (mainly GPFS, Lustre, and CephFS) have issues with high number of data transfers - “Please Stop” ● Adjusting user’s to lack of POSIX 7

Why not Globus? ● Globus went closed-source and is no longer a GridFTP server under the hood- Buh! ● Requires subscription to fully automate transfers ● Only useful for inter-site transfers, not for jobs ○ Globus requires endpoints at each end of the transfer Endpoints cannot be automatically generated ○ ● Does not work with multiple protocols without subscription 8

Evaluation Instances Experiment Rucio Instance DB Location DB Type Support CMS rucio-cms.grid.uchicago.edu UChicago PostgreSQL UNL, FNAL, OpenStack UChicago IceCube rucio-icecube.grid.uchicago.edu UChicago PostgreSQL UCSD, UNL, OpenStack UW--Madison, UChicago LIGO rucio-ligo.grid.uchicago.edu UChicago PostgreSQL Georgia Tech, OpenStack UNL, UChicago LSST rucio-lsst.grid.uchicago.edu UChicago PostgreSQL NCSA, OpenStack UChicago FIFE rucio-fife.grid.uchicago.edu UChicago PostgreSQL UChicago, OpenStack FNAL 9

Xenon1T Storage and Processing Challenge ● Storage allocated at European Grid Infrastructure (EGI) and Open Science Grid (OSG) sites - Not enough storage at any one site for all the data Computing and storage on ● OSG and EGI sites through single interface for each ● Could not use Globus Online to automate transfer to/from EGI sites How to manage the data? ● 11

XENON1T Infrastructure 12

Data Movement and Processing ● EGI storage selected at random to spread out data during large processing campaigns ● Jobs move from single node at UChicago If lands at EGI, pulls data from ● EGI; Same with OSG ● Easily expandable storage and compute pool Data movement is automated ● with rucio and FTS 13

XENON1T Experience ● The first six months were tough: ○ Rucio had a lot of ATLAS conventions baked in - Worked with devs to make things more flexible Getting collaboration used to rucio conventions, OSG/EGI ○ conventions, grid certs, etc. ○ Software differences - Python2 vs. 3 ● After first hurdles, very positive results: Rucio essential to XENON1T data management and ○ processing workflow Rucio being adopted for next generation experiment ○ (XENONnT) - 2 to 3x data than XENON1T 14

Status Today ● OSG has had two blueprint meetings on: ○ How to leverage rucio ○ How mid-sized VOs on OSG could use rucio - LIGO, IceCube, CMS, FIFE ● 1st Rucio Community Workshop at CERN: ○ Heard from a number of experiments (CTA, SKA, etc.) about their data management challenges ○ Lots of input from devs - Check out their slides for very good overview 15

Rucio - Looking ahead ● More improvements ○ Better Multi-VO support ○ Looking into tiered Rucio ○ More authentication methods - SciTokens ○ Globus support ● More testing of PostgreSQL in production needed - MariaDB holding up for XENON1T ● Future workshops ● Looking for more adopters! 16

HL-LHC Challegenes ● HL-HLC will bring a new set of challenges - Do more with the same/less ● How do we store data? ○ Do we store the data in the same format as we use for processing? ○ How can we use object stores or key-value stores? ● What about cloud? ● How do we incorporate HPC centers better? 17

Data Lakes Data Lake- “A single storage ● repository of raw or lightly processed data collection from which one can derive higher level data sets” How can we orchestrate a Data Lake ● for HEP? Lots of different storage ○ architecture, not a single one (object store) ● How do we serve data to compute sites? - Still GridFTP and XRootD? HTTP? Where do we cache data? ● 18

Summary ● Distributed storage is a complex problem ● Rucio is be a candidate to solve some of our distributed data management problems ○ Several experiments are evaluating rucio ○ Couple experiments already run it in production ● Looking ahead to HL-LHC - Need to take a hard look at storage and see how we can leverage technological trends to our advantage 19

Distributed Data Management in OSG OSG All Hands Meeting - UofU - PowerPoint PPT Presentation

Distributed Data Management in OSG OSG All Hands Meeting - UofU March 20, 2018 Benedikt Riedel Rob Gardner Judith Stephen University of Chicago 1 Overview Problem Statement Sample Scenario Rucio Why not Globus Evaluation

OSG As A Partner Brian Bockelman OSG Technology Area Lead Three Lessons for Today What OSG

OSG STORAGE OVERVIEW Tanya Levshina Talk Outline 2 OSG Storage architecture OSG Storage

Testing OSG Software Mtys Selmeci OSG Software Lead Developer OSG All Hands Meeting

Security infrastructure, certificates and responsibilities Anand Padmanabhan for the OSG

Data on OSG Frank Wrthwein OSG Executive Director Professor of Physics UCSD/SDSC

Open Science Grid Security Activities D. Olson, LBNL OSG Deputy Security Officer For the OSG

Security Policy Update Mike Stanfield OSG Security Team OSG Council Face-to-Face October 11 th ,

User Support, Campus Integration, OSG XSEDE Rob Gardner OSG Council Meeting June 25, 2015

OSG User Support Strategies March 24, 2015 OSG All Hands @ Northwestern University Rob Gardner

OSG Technologies Updates Brian Bockelman OSG AHM 2014 This presentation Ill cover topics

Initial comments See OSG from perspective of the Campus continue to commit HCC to OSG

Report from D on OSG Brad Abbott For the D Collaboration Past use of OSG Used for

OSG Technology Update Brian Bockelman 1 State of the Union OSG Technology has drastically

Getting the Most Out of GIP Anthony Tiradani tiradani@fnal.gov osg-gip@opensciencegrid.org GIP

OSG News Frank Wrthwein OSG Executive Director Professor of Physics UCSD/SDSC Two Slides of

OSG Research Facilitation Engaging Researchers and Campuses Lauren Michael University of

Understanding Data Motion in the Modern HPC Data Center Glenn K. Lockwood Shane Snyder Suren

DATA TRANSFER BETWEEN SCIENTIFIC FACILITIES -- BOTTLENECK ANALYSIS, INSIGHTS, AND OPTIMIZATIONS

Data Management and Best Practices for Data Movement Craig Steffen BW SEAS (User Support) Team

Outsourcing IT complexity Moving Ultraviz management from the

Update on the Globus Transition FEARLESS SCIENCE Reminder: Where are we coming from? In 2017,

Integrating Grid Services into a Cray XT4 Environment Hwa-Chun Wendy Lin and Shreyas Cholia

Globus Toolkit Support Transition Derek Simmel <dsimmel@psc.edu> TAGPMA Chair 41 st

A Look at Some Ideas and Experiments Jack Dongarra University of Tennessee and Oak Ridge

Distributed Data Management in OSG OSG All Hands Meeting - UofU - PowerPoint PPT Presentation

Distributed Data Management in OSG OSG All Hands Meeting - UofU March 20, 2018 Benedikt Riedel Rob Gardner Judith Stephen University of Chicago 1 Overview Problem Statement Sample Scenario Rucio Why not Globus Evaluation

OSG As A Partner Brian Bockelman OSG Technology Area Lead Three Lessons for Today What OSG

OSG STORAGE OVERVIEW Tanya Levshina Talk Outline 2 OSG Storage architecture OSG Storage

Testing OSG Software Mtys Selmeci OSG Software Lead Developer OSG All Hands Meeting

Security infrastructure, certificates and responsibilities Anand Padmanabhan for the OSG

Data on OSG Frank Wrthwein OSG Executive Director Professor of Physics UCSD/SDSC

Open Science Grid Security Activities D. Olson, LBNL OSG Deputy Security Officer For the OSG

Security Policy Update Mike Stanfield OSG Security Team OSG Council Face-to-Face October 11 th ,

User Support, Campus Integration, OSG XSEDE Rob Gardner OSG Council Meeting June 25, 2015

OSG User Support Strategies March 24, 2015 OSG All Hands @ Northwestern University Rob Gardner

OSG Technologies Updates Brian Bockelman OSG AHM 2014 This presentation Ill cover topics

Initial comments See OSG from perspective of the Campus continue to commit HCC to OSG

Report from D on OSG Brad Abbott For the D Collaboration Past use of OSG Used for

OSG Technology Update Brian Bockelman 1 State of the Union OSG Technology has drastically

Getting the Most Out of GIP Anthony Tiradani tiradani@fnal.gov osg-gip@opensciencegrid.org GIP

OSG News Frank Wrthwein OSG Executive Director Professor of Physics UCSD/SDSC Two Slides of

OSG Research Facilitation Engaging Researchers and Campuses Lauren Michael University of

Understanding Data Motion in the Modern HPC Data Center Glenn K. Lockwood Shane Snyder Suren

DATA TRANSFER BETWEEN SCIENTIFIC FACILITIES -- BOTTLENECK ANALYSIS, INSIGHTS, AND OPTIMIZATIONS

Data Management and Best Practices for Data Movement Craig Steffen BW SEAS (User Support) Team

Outsourcing IT complexity Moving Ultraviz management from the

Update on the Globus Transition FEARLESS SCIENCE Reminder: Where are we coming from? In 2017,

Integrating Grid Services into a Cray XT4 Environment Hwa-Chun Wendy Lin and Shreyas Cholia

Globus Toolkit Support Transition Derek Simmel &lt;dsimmel@psc.edu&gt; TAGPMA Chair 41 st

A Look at Some Ideas and Experiments Jack Dongarra University of Tennessee and Oak Ridge

Globus Toolkit Support Transition Derek Simmel <dsimmel@psc.edu> TAGPMA Chair 41 st