ProtoDUNE Data Flow Protocol For discussion at the DAQ meeting Kurt Biery, Giovanna Lehmann Miotto 21-Nov-2016
Route ¡fragments ¡from ¡each ¡trigger Multi-‑core ¡nodes to ¡the ¡same ¡EventBuilder BoardReader Process Support ¡dynamic ¡control ¡of ¡dataflow into ¡EventBuilders FragmentGenerator … 4 Multi-‑core ¡node BoardReader Process FragmentGenerator EventBuilder Process EventStore art … 3 Multi-‑core ¡node EventBuilder Process BoardReader Process EventStore art FragmentGenerator 2 1 Trigger ¡Messages Timing ¡System Dataflow ¡Manager ¡(DFO) Event ¡Readout ¡Requests Fragment ¡Readout ¡Requests Data ¡Fragment ¡Flow Trigger/Event ¡counter ¡forwarded ¡through ¡DFO ¡to ¡EB to ¡BR ¡to ¡be ¡put ¡into ¡fragment ¡and ¡event ¡headers Data ¡requested ¡from ¡BRs ¡by ¡TIMESTAMP ¡from ¡EB 2 21/11/16 KAB, GLM | protoDUNE Dataflow Protocol
Additional Dataflow Considerations Building on the dataflow slides shown by Karol last week… Some proposals: 1. One BoardReader per RCE, one per SSP, etc. 2. Which entity should handle the possibility that triggers “arrive” before the data? 1. Propose that the BoardReader/FragmentGenerator handle this (with a timeout) 2. Support for this already exists in the FragmentGenerator base class 3. Other options are possible (DFO, EB), but adding artificial delays or retries there adds complication 3 21/11/16 KAB, GLM | protoDUNE Dataflow Protocol
Additional Dataflow Considerations, continued 3. When can data be cleared from various buffers in the system? 1. Since TCP/IP will be used, propose to delete data as soon as they have been sent on (BoardReader -> EB, EB -> Aggregator, Aggregator -> disk) 4. When will EB notify DFO that event is complete or finished? 1. When full event is queued for output? 2. When event has been sent to Aggregator? 4 21/11/16 KAB, GLM | protoDUNE Dataflow Protocol
Data Flow Error Conditions 1. A BoardReader never finds a match between a request and the data 1. Detection: do we base the detection on a timeout or on the availability of fragments associated to (much) higher timestamps? 2. Reaction: should the BoardReader create an empty fragment to send to the EB? (propose Yes) If this is done the EB can assume that it will ALWAYS build complete events (except if a BoardReader crashes) 2. A BoardReader crashes 1. Question: should this be considered a Fatal Error? 2. Detection: the process management application(s) detect that the BR process is gone; EBs detect that the connection to the BR has been lost 3. Reaction: end the current run or build incomplete events or create empty fragments for the missing pieces? 5 21/11/16 KAB, GLM | protoDUNE Dataflow Protocol
Data Flow Error Conditions continued 3. A BoardReader restarts 1. Do we want to make a BoardReader crash a recoverable error or not? 2. Reconfiguring and re-syncing the BoardReader and its associated hardware with the rest of the system seems like it will be non-trivial 3. There would also be reconnection with EBs to be done 4. This may be a great longer-term goal, but maybe we consider this a low priority for protoDUNE. We expect that this would involve interaction with RunControl. 4. DFO crash: FATAL, start new run 6 21/11/16 KAB, GLM | protoDUNE Dataflow Protocol
Data Flow Error Conditions continued 5. An EB node crashes: 1. Question: should this be considered a Fatal Error? Does the answer to this question depend on the configuration of the system (whether the EBs are writing data to disk, if all of the EBs are needed to handle the full rate, etc.)? 2. Detection: the process management application(s) detect the failure, the TCP connection to the DFO will be closed, and events assigned to the bad node will not be eliminated in the DFO 3. Reaction: the DFO continues assigning events to other EB nodes. Events assigned to the crashed EB will be lost. 6. An EB node restarts: 1. Do we want to foresee a recovery scenario? The way in which this can be done very much depends on how EBs announce themselves to the DFO nodes and how the connections to the BRs are handled. And, EB processes will need to be properly configured. 7 21/11/16 KAB, GLM | protoDUNE Dataflow Protocol
Data Flow Error Conditions continued 7. An event is never requested: 1. How will we clear it from the BoardReaders eventually? Timeout, circular buffer? 2. This depends somewhat on the implementation of the FragGen, e.g. the basic operation of the FELIX FragGen will include dropping of unwanted data. 8. Aggregator cannot write more data 1. Detection: the EventBuilders assigned to this Aggregator will not be able to send more data and will no longer be assigned new events by the DFO. 2. Recovery: continue writing to other Aggregators, or Fatal Error? 8 21/11/16 KAB, GLM | protoDUNE Dataflow Protocol
Recommend
More recommend