data management report news

Data Management: report & news. PaNDaaS WG 2 nd meeting @ESRF - PowerPoint PPT Presentation

Data Management: report & news. PaNDaaS WG 2 nd meeting @ESRF Jean-Franois Perrin (ILL) 12th of Dec 2016 I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I

  1. Data Management: report & news. PaNDaaS WG 2 nd meeting @ESRF Jean-François Perrin (ILL) 12th of Dec 2016 I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 1 1

  2. Experimental data management Some Results: Dec 2012 – Dec 2016 Co-funded by the European Union : PaNData-Europe Grant Agreement No 261537 I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 2 2 PaNData-ODI Grant Agreement No RI-283556

  3. What has been done so far? • 2008 1 st discussion on Data Policy (PaNData) • 2011 “Open” DP published - 3 (max 5) years embargo • 2012 1 st experiment under DP • 2013 complete set of Data Management Services available for users: search, access, annotate, archive, identify, publish, … • since then, communication with our users … I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 3

  4. Data Policy revisited Based on the PaNData framework Open data & how to protect and credit our users? • The facility shall act as a custodian for the data. • All raw data will be curated in a well-defined format with a unique ID ( DOI ). • Metadata is captured automatically and resides either within the raw data files, and/or in an associated on-line catalogue. • Users can release or give access to their data at any time, by default access to raw data, the associated metadata and the analysis data is restricted to the experimental team for a period of 3 years. During the 2 next years data are available on request. Thereafter, they become publicly accessible. • The embargo period can be extended on requests to the direction. • Publication based on data must acknowledge the source of the data and cite its unique identifier ( CC-BY licence ). • Also apply for CRG beam time when they use the ILL data infrastructure. I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 4

  5. Data portal • Provide access to data, meta-data, logs, DOIs landing page, … • Scientists can contact the experimental team • Tools for managing data authZ • Grant individual access • Tailored to ILL needs • Release data at any time (non-reversible) – User management of data access authorization. – Users could decide to publish (open access) their data, before the end of the embargo period. – Linked to DOIs. – Linked to experimental logs. – Linked to user annotation tool. – Linked with proposal system. – Download of data. – Full text search Index all available information: Proposal, experimental report, data file annotation, publications, … I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 5

  6. Data Portal results • 3 data sets publicly released before end of the embargo • 26 access granted to external scientists (peer-review) • 0 requests to get access to datasets under embargo (at least through the portal) • 760651 data files downloaded (90% external users) concerning 376 unique datasets I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 6

  7. DOIs Linking data and people through ORCID/ResearcherID Collaboration with DataCite/INIST Linking data with publications

  8. DOIs communications We ask our users to cite data sets using the reference section of  their articles. I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 8

  9. Issue #1: Awareness of the scientists • This is still new for most of the scientists “What are DOIs? What are you talking about?” • We (ILL/ISIS) currently feel a bit alone – need to reach critical mass. (ESRF, PSI, ESS … are joining) • We need more communication – mentoring – cultural change - education. Need to fill the gap between what we hear in RDA-like meetings and the daily reality of the scientists. Still need to convince the scientist that a change is happening regarding experimental data. I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 9

  10. Issue #2: Difficulty to collect the articles exploiting the experimental data • Technical reason : DOIs in figures instead of references, partial citations … • No tools yet available to easily collect references – CrossRef cited by linking - currently only for article (vs data) publishers ? -, OpenAire. – This is a business for the publishers. • Difficulties to get metrics: how successful are we? – We have currently (Dec 2016) collected less than 50 peer reviewed article referencing the data DOI. – How many are we missing? Need to access freely information for building metrics. I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 10

  11. Text not in the reference section. Not easily findable through most of search services (WoS, scopus , …) Only findable through google scholar. I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 11

  12. Cited in an image instead of … Not findable at all I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 12

  13. Data DOI vs article DOI Should be the DOI of the article, instead of the one of the data. I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 13

  14. Issue #3: time -Time for understanding data & analyses -Time for writing articles -Time for publishing -On our side Time for explaining & convincing This is by nature a long process, but seeing the level of investment needed, we need to convince, we need evidence of success urgently. I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 14

  15. Results as of Dec 2016 % of ILL users' publication citing the data sets • The reference to Data sets through DOIs in scientific articles, through 7 DOIs, is recently improving. 6 5 • Real interest of the 4 % publishers 3 2 • More user feedback: “Why I don’t 1 get a DOI for experiment XYZ?” 0 2012 2013 2014 2015 2016 Year Scientists name disambiguation: • 378 Scientist “publication name” • 184 Orcid • 141 Researcherid I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 15

  16. One more issue: other repositories in the middle. Cite as M. J. Roy. (2016). Contour method and neutron diffraction dataset to determine the weld fusion zone shape on residual stress in submerged arc welding [Data set]. Zenodo. Instead of WITHERS Philip J.; ISHIGAMI Atsushi; PIRLING Thilo; ROY Matthew and WALSH Joanna. (2014). The effect of weld bead shape on residual stress in novel low heat input welding of steel. Institut Laue-Langevin (ILL) doi:10.5291/ILL-DATA.1-02-145 Licence ? I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 16

  17. Data Analysis As a Service. I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 17 17

  18. Data volume evolution Volume of experimental data / cycle TB Evaluation of new detectors leading to 60 permanent instruments starting from Dec 2016. 50 40 30 20 Moving to list mode (vs Histo) 10 0 2000 2001 2001 2002 2003 2004 2005 2006 2007 2008 2008 2009 2010 2011 2012 2013 2014 2015 2015 2016 2017 2016-2017 Raw (TB) Processed (TB) Forecast (TB) I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 18

  19. Impacts of the data volume evolution Example of the EXILL campaign • Storage (2 experiments = 70TB) – ILL archive capacity & performance – Users ’ storage becoming almost impossible • Moving data – Today how to carry 40TB to 10 different labs? – Why carrying them? • Analysis – Almost impossible in most users’ labs with such data sets. • But – 32 direct (h-index 4) peer reviewed articles published – 2 Phd-thesis – 10+ international conferences – … I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 19

  20. Data analysis as a Service • The aim is to proposed to users to access analysis services ( data, software, IT capacity and expertise ) remotely using standard tools (ideally only web browser). • Typical workflow: 1) The user connects remotely using his web browser and its credentials (Federated IM) 2) Then select one of the experiment he has performed in the list. 3) he is then connected to a service where the necessary analysis applications have been As of Dec 2016 installed and configured for accessing directly the experimental data. • 4) If necessary he could receive help and support from facility expert, during the analysis. Openstack testbeds 5) Analysis data are published. • Evaluation of the management APIs • More resources to come … soon I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 20

  21. Homework by Andy • top 3 data analysis applications … – LAMP , Mantid, Matlab through a private cloud + remote desktop • what services could the e-infrastructures provide ? – OpenAire/Datacite: help us to communicate, collect metrics of data usage – GEANT: Global AAI? Hybrid-Cloud? – EGI/EUDAT: ??? • If we submit a new PANDAAS proposal … what to solve. – DaaS (volume and ease), analysis preservation, metrics • NX as an immediate, temporary (scalability?) solution I N S T I T U T M A X V O N L A U E - P A U L L A N G E V I N 21


More recommend