Data on OSG Frank Würthwein OSG Executive Director Professor of Physics UCSD/SDSC
The purpose of this presentation is to give the Council a summary of where we are at with supporting data on OSG
High Level Messages • First line of defense is HTCondor file transfer • If that’s not sufficient for needed scale: • Pull/put data to/from job via gridftp and/or xrdcp • We offer data hosting in some cases. • Use cashing if same input data is reused often. • We can support “reasonable” privacy of data, but not HIPAA or FISMA. • If data movement needs to be managed across multiple global locations, independent of jobs: • We helped Xenon1T to adopt Rucio, the ATLAS data management solutions. • We expect to help others evaluate this as possible solution in the future. • First potential customers for an eval are LSST and LIGO. The services provided by OSG in support of data on OSG vary, depending on the size of the needs of the communities we deal with. 3 October 3rd, 2017
Benchmarking HTCondor Filetransfer Initiated by GlueX and Jefferson Lab. Wanted to know if a single submit host at JLab can support glueX operations needs. Concern was primarily the IO in and out of the system. OSG did the test on our system. Then provided instructions for deployment at JLab. Then repeated test on their system, and helped debug until expected performance was achieved.
GlueX Requirements Parameter GlueX Spec OSG Test Running jobs 20,000 4,000 Output Size 10-100 MB 250 MB Input Size 1-10 MB 1-10 MB Job Runtime 8h - 9h 0.5 h GlueX specs translated into 55.5MB/sec and ~1Hz transaction rate = 20000 ∗ 90 ≈ 55 . 5MB O ( n, l, s ) = nJobs ∗ size 9 ∗ 3600 sec length We tested x10 larger IO and x3 more transactions per second. 5 October 3rd, 2017
Benchmarking Result Smooth operations at the scale tested. • Lessons Learned: • Stay away from exceeding significantly more than half the • 10Gbps network bandwidth on the submit host. Be careful with TCPIP settings to avoids latencies of schedd • communications with far away worker nodes. 10 Gbit interface limit 6 October 3rd, 2017
Put and Get at 100Gbps OSG offers installation instructions for deploying a cluster of Gridftp or xrdcp hosts, each of which is 10Gbps connected, and seen by the clients as a single service, using Linux Virtual Server. This is the OSG strategy pursued for replacing SRM. It’s also what LIGO used for its first gravitational wave detection work on OSG.
Aside on Reducing Complexity • We are working to reduce the complexity we support for the LHC in order to sustain it with less effort in the future. • E.g. SRM: • In OSG 3.2 there were 4 SRM clients • In OSG 3.4 there are none. • E.g. X509: • We are working on eliminating the need for X509 from OSG. • More on that later. 8 October 3rd, 2017
Caching via StashCache • OSG now operates its own Data Federation. • We support federations inside ours that have privacy from each other. • We support people to build their own. • The advantage of living inside OSG is that you have access to the StashCache deployed infrastructure. • If you roll your own, you are on your own. 9 October 3rd, 2017
OSG Data Federation Applications connect to regional cache transparently. Regional cache asks redirector for location of file. Redirector redirects to relevant origin. File gets cached in regional cache. Caches at: BNL, FZU, UNL, Xrootd OSG Syracuse, UChicago, Redirector UCSD/SDSC, UIUC. … … Xrootd Origin Xrootd Origin Xrootd regional Xrootd regional A B cache cache XRootd XRootd XRootd XRootd XRootd XRootd XRootd XRootd XRootd XRootd XRootd XRootd Data Server Data Server Data Server Data Server Data Server Data Server Data Server Data Server Data Server Data Server Data Server Data Server One Data Origin per Community Multiple caches across US This is a technology transfer from LHC with some OSG value added. 10 September 7th, 2017
Communities using StashCache • OSG-Connect • See next slide for examples. • LIGO • Nova • And some expression of interest: • Xenon1T, expects future use in front of Comet@SDSC, and potentially elsewhere. • GlueX, initial interest, not yet concrete. 11 October 3rd, 2017
Big Data beyond Big Science OSG caching infrastructure used at up to ~10TB/hour for meta- or exo-genetics 12 September 7th, 2017
StashCP Dashboard Info Last 3 Months Dashboards Hosted at Kibana Instance at MWT2 13 October 3rd, 2017
StashCache Instances View 10/1 0:00 to 10/2 19:00 Details on data in/out, connections, errors, timeouts, retries, … for each cache are monitored. 14 October 3rd, 2017
Rucio and its use in Xenon1T • Xenon1T needed something to manage its transfers between the experiment DAQ in Italy, and various disk locations in EU, Israel, and the US. P r o c e s s e d D a t a D i s k s p a c e M i d w a y / R C C • Xenon1T adopted Rucio for this a n a l y z e d b y H a x R u c i o S t o r a g e ( C h i c a g o ) E l e m e n t after joint evaluation with OSG. S i z e : 9 2 T B D a t a F l o w ( n o w) T a p e x a • Since then, LSST and LIGO C X e n o n 1 T Raw Data S t a s h / L o g i n D A Q expressed an interest in a ( C h i c a g o ) R u c i o t r a n s f e r s ( s i m u l t a n e o u s ) S i z e : 3 0 0 T B similar evaluation. T a p e u p l o a d N I K H E F ( d o wn l o a d ) • Next steps: x e 1 t - d a t a m a n a g e r ( A m s t e r d a m ) R u c i a x u p l o a d • Two pager to define S i z e : 2 0 0 T B ( d o wn l o a d ) / d a t a / x e n o n / r a w / o r Rucio Processed Data I N 2 P 3 metrics for eval project with ● S i z e : 5 0 T B Server ( L y o n ) ● D C a x a t a b u @ e r : (Chicago) LSST. D A Q U p l o a d x S i z e : 2 0 0 T B a i c u • OSG Blueprint to better R B u � e r We i z m a n n n o t y e t u s e d ● S i z e : 5 0 T B understand technical ( I s r æ l ) ● B u @ e r f o r R u c i o T a p e B a c k u p concept underlying Rucio. S i z e : 8 0 T B t r a n s f e r s S ● B ( S t o c k h o l m ) G u @ e r f o r T a p e N u p l o a d L S i z e : 5 . 6 P B 15 October 3rd, 2017
A Future without X509 We already have eliminated X509 for user job submission in OSG. The two remaining use cases are: Pilots being authenticated at CEs Users staging out data to Storage Endpoints from jobs.
Problem Statement • At present, Storage Endpoints authenticate users • X509 certificate is delegated to the job for that job to stage out data to a Storage Endpoint from the worker node. • In the future, we want Storage Tokens that define capability rather than personhood. • You are allowed to store data in your directory at OSG-VOs storage endpoint(s). • Working with NSF funded SciToken project to accomplish this. • https://scitokens.org 17 October 3rd, 2017
Initial “Demo” • Initial Demo showed: • OSG-Connect HTCondor submit host transparently generates SciToken • Users are oblivious of the existence of such Tokens. • User jobs put files from worker node into user-owned directory at Stash Endpoint using HTTPS protocol. • Stash is the Origin of StashCache, implemented as Xrootd server => This implies that data staged out can be used for subsequent processing via StashCache. • Based on OAuth2 framework • Same as when you authorize 3rd party website to use your Facebook/Google/DropBox login. 18 October 3rd, 2017
Status of SciToken • Initial hackathon led to initial demo, and thus understanding of viability of basic concept. • Draft Design write up for technical director evaluation, and broader discussion exists. • Needs a bit more work before it’s ready for sharing. 19 October 3rd, 2017
Summary & Conclusion • OSG made a lot of progress in supporting data on OSG broadly for anybody. • We build on technologies that have broad community support and/or are NSF funded projects. • We reduced, and will continue to reduce the complexity of the software stack required to use data on OSG. • There’s some i’s to dot and t’s to cross, but decent functionality now exists, and the geek gap between Big Science and the rest of scientific endeavors has shrunk, and continues shrinking. 20 October 3rd, 2017
Recommend
More recommend