data format and packaging an update
play

Data Format and Packaging, An Update Kurt Biery 18 March 2020 - PowerPoint PPT Presentation

Data Format and Packaging, An Update Kurt Biery 18 March 2020 DUNE DAQ Dataflow Working Group Meeting Data Format At the DAQ workshop, it was proposed to focus our data format investigations on A DUNE-specific binary format stored


  1. Data Format and Packaging, An Update Kurt Biery 18 March 2020 DUNE DAQ Dataflow Working Group Meeting

  2. Data ‘Format’ At the DAQ workshop, it was proposed to focus our data format investigations on • A DUNE-specific binary format stored in HDF5 files In the (admittedly small number of) subsequent discussions, this has been received positively. Eric Flumerfelt has done preliminary work in demonstrating the writing of artdaq::Fragments (a la PDSP) in HDF5. Next step: share information on what has been done so far with a few more technical experts from offline and online, gather feedback, etc. Run tests, encoding/decoding speed, etc. 2 18-Mar-2020 Data Format and Packaging Update

  3. Data ‘Packaging’ ‘Packaging’ ~= ‘grouping and subdividing’ • Determining how file boundaries are managed… • What appears in each file… At the DAQ workshop, we used different ‘types of data’ as a starting point for discussion - possibly misleading. Here, I’d like to start with different types of packaging and come back to different types of data later… 3 18-Mar-2020 Data Format and Packaging Update

  4. Data packaging choices (parameters) Some of the parameters that can be used to specify how data is grouped into files: 1. Whether or not the data in each file on disk will have geographically complete coverage (superset, TD has details) 1. If not, what subdivision will be used 2. The maximum size of the files that will be created 3. The maximum time interval/duration that will be stored in a single file (data time or wall clock time both seem possible) 4 18-Mar-2020 Data Format and Packaging Update

  5. Priority of the choices Given the choices described on the previous slide, we can imagine sets of answers/values for #1, 2, 3 that can’t simultaneously be satisfied. So, we would need to specify which one(s) are the most important. For example, • For trigger type Y during normal running, the file size specification is the most important. 5 18-Mar-2020 Data Format and Packaging Update

  6. Part of the configuration for DF? Should we (Dataflow subsystem) support a set of configuration parameters, keyed by data type, that specifies how the data for that data type is packaged? I believe that we can identify the set of parameters that will be needed to specify how the data files for a given data type should be handled. Discussion of the parameter values can, and should, be deferred until closer to data taking. 6 18-Mar-2020 Data Format and Packaging Update

  7. Easing back into data types… My sense is that we have two high-level types: • Triggered data - Trigger Records that are produced in response to a Trigger Decision • Streaming data - Data that is collected without Trigger Record boundaries - E.g. WIB debugging data - The Trigger Primitive stream might also fit in this category 7 18-Mar-2020 Data Format and Packaging Update

  8. Data packaging choices, take 2 For Triggered data: 1. Whether each file on disk will have an integer number of Trigger Records, or whether each file can have a fractional number of Trigger Record(s) For both Triggered and Streaming data: 2. Whether or not the data in each file on disk will have geographically complete coverage (superset, TD has details) If not, what subdivision will be used 1. 3. The maximum size of files that will be created 4. The maximum time interval/duration that will be stored in a single file (data time or wall clock time both seem possible) We will still need to specify priority among these… 8 18-Mar-2020 Data Format and Packaging Update

  9. Different ‘types of data’ Types mentioned in earlier discussions: • Local Trigger Records – e.g. beam triggers • Extended Trigger Records – e.g. SNB triggers • Trigger Primitive stream – all TPs • WIB debugging stream – temporary stream that can be enabled for debugging Others may be mentioned/proposed over time… 9 18-Mar-2020 Data Format and Packaging Update

  10. Possible choices for 1 data type Beam Trigger Records: • Integer number of Trigger Records per file: Yes • Geographically complete data in each file: Yes (TPC, PDS, Trigger, Timing; superset, Trigger specifies details in the Trigger Decision) • Maximum file size: <optimized for offline use> • Maximum time duration per file: TBD (0.5 hour?) • Priorities: TBD (to be determined) If TR size < max_file_size, integer # of TRs; otherwise, file size 1. Etc. 2. ** These value choices are for illustration only. If we support configurable data packaging in the Dataflow subsystem, then the values can be changed, under the direction of the appropriate physics groups, offline folks, online folks, etc. 10 18-Mar-2020 Data Format and Packaging Update

  11. Possible choices for a 2 nd data type Supernova Burst Trigger Records: • Integer number of Trigger Records per file: No • Geographically complete data in each file: No - Files split by APA (for example) (PDS, etc details TBD) • Maximum file size: <optimized for offline use> • Maximum time window per file: TBD • Priorities: TBD File size 1. Etc. 2. ** These value choices are for illustration only. 11 18-Mar-2020 Data Format and Packaging Update

  12. Possible choices for a 3 rd data type The Trigger Primitive Stream: • Integer number of Trigger Records per file: n/a • Geographically complete data in each file: Yes (TPC, PDS, Trigger, Timing; superset, subdetector components which don’t have TPs won’t contribute) • Maximum file size: <optimized for offline use> • Maximum time window per file: TBD • Priorities: File size 1. Etc. 2. ** These value choices are for illustration only. If we support configurable data packaging in the Dataflow subsystem, then the values can be changed, under the direction of the appropriate physics groups, offline folks, online folks, etc. 12 18-Mar-2020 Data Format and Packaging Update

  13. Possible choices for a 4 th data type The WIB Debug Stream: • Integer number of Trigger Records per file: n/a • Geographically complete data in each file: No - Files split by <TBD> • Maximum file size: <optimized for offline use> • Maximum time window per file: TBD • Priorities: - File size - Etc. ** These value choices are for illustration only. 13 18-Mar-2020 Data Format and Packaging Update

  14. Comments 1. Choosing to support this configurability does not necessarily mean that we will need to build a general-purpose rules engine. The options aren’t that numerous; we could simply encapsulate them in a class. 2. Remember that we’re talking about interfaces here… Data handoff, and the specification of the packaging. - Implementation details within both the online and the offline have freedom… 3. New trigger types that have readout windows in the range of 10-100 seconds can easily be supported by a configurable DF data packaging system – the packaging config would be part of the proposal from the physics group or whomever. 4. Files wouldn’t necessarily need to have consistent “spans” (time window or number of TRs) [metadata files discussed next slide] 14 18-Mar-2020 Data Format and Packaging Update

  15. Ideas 1. Data challenge in Feb 2021 2. Metadata and manifest files… - Metadata file for each raw data file - Manifest file for each TR that spans multiple files - Metadata could instead be internal to the raw data file - Sample metadata information for SNB files: • the trigger number/identifier • the APA number (or whatever geographic identifier(s) are appropriate) • the beginning and ending timestamps of the trigger window (or start time and window size) • the beginning and ending timestamps of the interval that is covered by the individual file (or start time and window size 15 18-Mar-2020 Data Format and Packaging Update

  16. Backup slides 16 18-Mar-2020 Data Format and Packaging Update

  17. Some topics that have come up Where to save information about which components are in the partition. In each data file? Each metadata file? (configuration archive, for sure) Reminder: partitions do not span detector cryomodules. 17 18-Mar-2020 Data Format and Packaging Update

  18. Reminder about Tom’s requirements Tom has summarized the following requirements: 1. longevity of support 2. integrity checks – for the file format as well as the data fragments 3. ability to read in small subsets of the trigger records and drop from memory data no longer being used 4. ability to navigate through a trigger record to get the adjacent time or space samples 5. compression tools 6. browsable with a lightweight, interactive tool 7. ability to handle evolution of data formats and structure gracefully with backward compatibility ensured https://wiki.dunescience.org/wiki/Project_Requirement_Brainstorming#Data_Format 18 18-Mar-2020 Data Format and Packaging Update

Recommend


More recommend