data curation at large experimental facilities with open
play

Data Curation at Large Experimental Facilities with Open Source - PowerPoint PPT Presentation

Data Curation at Large Experimental Facilities with Open Source Software Line Pouchard, Pavol Juhas, Kerstin Kleese Van Dam Computational Science Initiative Stuart Campbell National Synchrotron Light Source II Brookhaven National


  1. Data Curation at Large Experimental Facilities with Open Source Software Line Pouchard, Pavol Juhas, Kerstin Kleese Van Dam Computational Science Initiative Stuart Campbell National Synchrotron Light Source – II Brookhaven National Laboratory

  2. BNL is a Data-Driven Science Laboratory BNL provides supports data-rich RHIC experimental facilities: • Relativistic Heavy Ion Collider ( RHIC ) • National Synchrotron Light Source II NSLS II ( NSLS-II ) • Center for Functional Nanomaterials ( CFN ) CFN • Accelerator Test Facility ( ATF ) • Large Hadron Collider (LHC) ATLAS • Belle II : computing for neutrino ATLAS experiment • Quantum chromodynamics ( QCD ) computing facilities for BNL, RIKEN, & US QCD communities QCD

  3. NSLS-II: Best in class from far-IR to hard x-ray synchrotron X-rays Cost and schedule • $912 M • 800 m in circumference • First light Oct 2014 User Facility • Capacity for ~ 60 beamlines • Ultimately will host > 4000 users/yr • Proposal access. Free if intend to publish. • Proprietary fee ($412/hr)

  4. Hard X-Ray Spectroscopy 6-BM (BMM): Beamline for Mater. Measurement 7-ID-1 (SST-1): Spectroscopy Soft and Tender 7-ID-2 (SST-2): Spectroscopy Soft and Tender 7-BM (QAS): Quick X-ray Absorption and Scattering 8-ID (ISS): Inner Shell Spectroscopy 8-BM (TES): Tender X-ray Absorption Spectroscopy Imaging & Microscopy 3-ID (HXN): Hard X-ray Nanoprobe 4-BM (XFM): X-ray Fluorescence Microscopy 5-ID (SRX): Sub-micron Resolution X-ray Spectroscopy 18-ID (FXI): Full-Field X-ray Imaging Structural Biology 16-ID (LIX): X-ray Scattering for Biology 17-ID-1 (AMX): Highly Automated MX 17-ID-2 (FMX): Frontier Microfocusing MX 17-BM (XFP): X-ray Footprinting for Bio Macromolecules 19-ID (NYX): Microdiffraction Beamline Soft X-Ray Scattering & Spectroscopy 2-ID (SIX): Soft Inelastic X-ray Scattering 21-ID (ESM): Photoemission-Microscopy Facility 22-IR (FIS/MET): Magneto, Ellips, High-P Infrared 23-ID-1 (CSX-1): Coherent Soft X-ray Scattering 23-ID-2 (CSX-2): Soft X-ray Spectr & Polarization Complex Scattering 10-ID (IXS): Inelastic X-ray Scattering 11-ID (CHX): Coherent Hard X-ray Scattering 11-BM (CMS): Complex Materials Scattering 12-ID (SMI): Soft Matter Interfaces Diffraction & In Situ Scattering • 26 Operating/Commissioning 4-ID (ISR): In-Situ & Resonant X-Ray Studies 27-ID (HEX): High Energy X-ray Diffraction • 3 Under Development 28-ID-1 (PDF): X-Ray Atomic Pair Distribution Function 28-ID-2 (XPD): X-Ray Powder Diffraction

  5. Key Challenge - Complexity • Each beamline has a diverse collection of equipment. • Added complexity from wide range of different analysis workflows • Need to coordinate collection and processing with multiple data sources

  6. More Key Challenges at our Facilities: Velocity, Volume • Real Time Analysis and Steering of Experiments: • CFN - 400 images/sec • NSLS II – up to 5TB/s in burst

  7. Diverse User Community • Synchrotrons have evolved from serving mostly expert users • Users spend a few 4 hour shifts at the facility • Users take data back to home institution to analyze. Currently many users do not: • … have the hardware at home to analyze complex data • … have the analysis software to properly analyze the data • … have the skills to develop appropriate solutions. We need an approach that integrates data processing and data curation for experimental data at the facility Curation tasks embedded with software allow effectiveness and ease of use

  8. Data curation tasks at NSLS-II • Streaming data acquisition from proprietary detector formats • Storing experiment configuration and detector metadata • Transforming proprietary detector formats to open source (ascii) using streaming event model • Extracting and maintaining sample metadata • Relating different data and metadata collections for each sample/experiment • Experiment • Instrument • Sample • Raw, derived, analyzed images

  9. Sample metadata • Sample metadata extracted by web forms when users apply to beam time • Elements: Constituents, Container for the sample • Embedded Crystallography Information File (CIF) format • Standard since 1990 • Provides CIF-URL for database of origin • Refers to CIF ID in that db • Reference structures in an external database are available during experiments

  10. Sample Metadata

  11. Experimental Data Acquisition: Bluesky and Databroker This workflow independently runs at each of the beamlines Bluesky interacts with proprietary detector software Once acquired, metadata is stored in a Databroker database and data in a file system partition for each beam line Used at all of the operational NSLS-II beamlines so far Facilitates sharing code across beamlines and facilities. Supported by copious user-friendly documentation at https://nsls-ii.github.io

  12. Provenance and Discovery Framework Promotes discovery over heterogenous sources Supports searches over multiple data sources to discover new relations Presents integrated results to NSLSII user https://github.com/NSLS-II/sciprovenance

  13. Integrated search across multiple data collections NSLSII and external data 1) iss – beamline 8-ID, Inner Shell Spectroscopy full-text search engine • 66k run-start documents (scan metadata) https://elastic.co 2) xpd – beamline 28-ID-2, X-Ray Powder Diffraction • 41k run-start documents (scan metadata) 3) COD – subset of CIF fields from Crystallography Open Database http://www.crystallography.net/cod • 390k CIF entries – Crystallographic Information File • extracted fields - cell parameters, space group, mineral name, DOI • calculated fields – normalized chemical composition, density Note: presented collections are a subset of what exists in DataBroker

  14. Data retention policies • In contrast to other facilities, many US facilities have not all adopted data policies: • Historical, each lab is independent • Volume is a major obstacle • NSLS-II has kept all data since first flight in 2014 • Formal policies in discussion • Tier-based access at some facilities (4 mo-2y) • Raw, derived, analyzed data

  15. Machine Learning introduces new challenges to curation • New criteria and/or Research Objects to trace: • Training sets, models, hyperparameters, calibrations • Volumes of training sets • Lack of precision and accuracy in metadata: quality • Datasets with incomplete records are/should not be used • Datasets with incorrect records introduce errors • Lack of adequate datasets in materials science despite abundance • Not enough diversity of datasets in large dbs • Not enough diverse datasets to build robust training models

  16. Collaborations through Open Source Software • Loosely defined, could be project-based • More engaged example: AUS is testing and evaluating if they want to use bluesky. They sent a couple of developers to spend a week with us in the Fall • Adoption and training: • Annual hackathons, NSLS-II User meetings, NOBUGS 2018 at BNL • RDA PaNSIG: Photon and Neutron Group • Upcoming at 13 th Plenary Meeting, Philadelphia

  17. Interest from other facilities • LCLS @SLAC will be adopting bluesky • Test installations at APS • Will be used for APS-U • Australian Synchrotron testing and evaluating bluesky • Successful test at SNS (HYSPEC) • Interest has been shown by • European Spallation Source • Swiss Light Source • Encouraging further collaborations

  18. Lessons Learned • Curation tasks should be embedded in scientists’ routines • Provenance tracking supports reproducibility • Policies are a work in progress at the facilities • ML introduces new challenges in data curation • Collaborations around Open Source facilitates software re-use

  19. Team members L. Pouchard K. Kleese P. Juhas Van Dam Data Acquisition, Management, and Analysis Group, S. Campbell Acknowledgements Computational The authors gratefully acknowledge the funding support from the U.S. Experimentalists Department of Energy Office of Science/ Office of Advanced Scientific Scientists Computing Research. This research used resources of the National Synchrotron Light Source II, a U.S. Department of Energy (DOE) Office of Science User Facility operated for the DOE Office of Science by Brookhaven National Laboratory. This manuscript has been authored by employees of Brookhaven Science Associates, LLC operated under Contract No. DESC0012704. Data Scientists

  20. Questions?

Recommend


More recommend