DAQ development and operations at PDSP DUNE Collaboration Week 2019-05-20 Roland Sipos CERN EP-DT
Overview This talk is about the development and operations of ProtoDUNE-SP DAQ Three main elements: Detector operations ● ○ Ensure system stability for data taking ○ Support for DAQ users ● Interventions and developments ○ Understanding limitations and issues, eliminate problems Tests of new features during the dedicated development days ○ ● R&D towards DUNE DAQ ○ DUNE DAQ components development ○ Integration of the stable new features 2
DAQ operations ● Dedicated periods for DAQ tests + weekly DAQ Development Day ○ Development Fridays moved to Monday ○ In order to avoid starting the week with hidden issues in the system ● Several problems reported by DAQ users System wide stability issues in January ○ ○ Efforts for better issue tracking Requirements from Detector Operations ● ○ Extremely useful for understanding limitations of the system 3
DAQ support approach Currently the DAQ support approach is informal, operates on best effort: This is not sustainable We are aware of it, and continuously working on improvements Planning of the re-introduction of on-call DAQ shift for ProtoDUNE: ● More formal ● More fair-share ● Full-remote is feasible (first level support) Better understanding of hidden problems ● ● Perfect crash course for new developers 4
NP04 DAQ JIRA Slack is a great tool for communication, not so much for keeping track of progress We introduced an issue tracker for ongoing developments and pending problems ● JIRA seems to help tracking issues ○ Still not too many users, but slowly growing Long-time open To-Do tickets ● We need to encourage developers to follow up on issues, and also to track their progress ● Clear indication of components that lost manpower The lost manpower of critical components needs to be compensated 5
New features ● Prepared ColdBox readout ○ Finalized: the DAQ is fully prepared for APA7 ○ Partition 6 with RCE readout ○ This might change, as we gradually move to full FELIX readout ● New hardware triggers HV current limit threshold ○ ○ Ground plane signals Purity monitor signals ○ Under development: ● Disable triggers from the DCS For automated purity monitor runs ○ 6
Interventions DAQ servers Partition issues of servers eliminated ● ● Kernel upgrade campaign ○ To avoid early Meltdown/Spectre mitigation (retpoline) Aligned configuration ○ ● Cold restart test ○ Some servers have reboot/poweroff issues ● Device maintenance ○ SSD firmware upgrade of FELIX servers Intel QAT driver automation (with user support) ○ Services ● Supervised OP mon. restart procedures ○ WIB and SSP operational monitoring scripts Kibana log aggregation ● ○ Still have some issues 7
High-rate runs Needed for several noise study runs, which takes substantial time (max.: 3 hours) ● Stabilized at 40Hz Design goal: 25Hz ○ Still there are some hidden issues under the hood! Investigating the 10Gb network ● ● UDP messages get lost in RoutingTable update Acknowledgements ○ And the introduction of the DFO will eliminate the use of UDP messaging for the routing table 8
DAQ development ● There were/are several activities to improve the DAQ: ○ ArtDAQ ■ Several parts of the framework (details on next slide) ○ FELIX ■ Align software versions to newest ATLAS FELIX suite ■ Better operational monitoring and automated error recovery ○ Run Control ■ Alarm system improvements ○ System administration ■ Automation of missing elements (e.g.: FELIX) ● New features of different components ○ Feature requests Continuously discussed and followed up ○ 9
ArtDAQ Substantial improvements in the DAQ software framework (Many thanks to Kurt and the ArtDAQ developers) ● Routing Master improvements ● RoundRobin routing policy issues fixed ● EventBuilder fault-tolerance Crashed EBs can be restarted in the same run! ○ ● EventBuilder FragmentWatcher plugin ○ Event integrity check ● Geographic grouping ○ Ongoing work to group FELIX data from each APA in its own art/ROOT data product Studies on better development workflow ● ○ And on work area packaging/handling ● Components for the self-triggering chain are under development 10
Event integrity metrics Offline experts reported incomplete events in data, therefore a new plugin (FragmentWatcher) for the EventBuilder was introduced ● EB reports on event completeness ○ Missing fragments Empty fragments ○ ● ~1.5% of events have empty fragments! ○ Mostly from SSP BoardReaders 11
RunControl improvements In order to improve user experience and warn them if the system is in error state Bug-fixes ● ● Separated logical elements from GUI ● EventBuilder metrics integrated with OP monitoring ● Improved alarms Introduction of a RunControl bot on Slack ○ 12
Extending the FELIX readout RCE APAs are gradually moving to FELIX Planning of resources ● ○ APA4 moves to FELIX in June (Half side of the detector read-out by FELIX) ● Performance and topology evaluation of servers 13
R&D towards DUNE ● ProtoDUNE is the potential test facility for DUNE DAQ prototypes ○ HitFinding ○ Self-triggering chain ○ Single host FELIX setup ○ Co-processor ○ Control and Configuration Management (CCM studies) ○ Fault-recovery ○ … and many more! The only (currently) available platform for system level integration ● 14
HitFinding From Philip Rodrigues ● Software implementation, using Intel AVX2 registers and instructions ● Keeps up with the dataflow! 1 WIB frame (464 B) => 256 ADCs + headers @ 2 MHz 15
Data reordering From Giovanna’s plenary DAQ talk ● Unpack and extend collection channels with AVX2 code ● This is a heavy operation for the CPU FELIX modified to perform channels reordering in FW ● ○ This implies the need of extending the FELIX Overlay with another version Preliminary tests show gain in CPU utilization 16
TPC trigger From Giovanna’s plenary DAQ talk ● Get the complete stream of TPC raw data ● Reformat WIB frames to Expand 12 bit ADCs into 16 bits ○ ○ Reorder wires in order to select only collection plane Find “hits” in the stream ● ● Combine information of hits in order to form track candidates ● Implement a sw based trigger logics Full chain tests are already ongoing: FELIX BR -> HitFinder BR -> SoftwareTrigger BR This work is ongoing. Close to reality: full chain will be tested during next DAQ testing periods at NP04 17
OnHost FELIX BoardReader Goal: Elimination of the 100Gb/s peer-to-peer connection between the FELIX host server, and the BoardReader application. Merging the FELIX data processing software with the BoardReader’s data selection. Gain: Less space requirement (1 less server) with less cost (1 x server and 2 x NICs) Also R&D towards DUNE approach First working version. (Not production ready, needs manual adjustments) 18
Outlook ● Organizing the weekly development days in advance ● Preparation for the June/July DAQ testing period has the highest priority Main goals: ○ Introduction of the DataFlow Orchestrator in the EB ○ Eliminate the Routing Master ○ Still support only full event building ○ Introduce software hit finding, trigger candidate and module trigger applications ○ Introduce FELIX firmware with data reordered by planes ○ Provide FELIX readout for APA4 ● Collecting requirements from the WGs for the sprint 19
End Thank you for your attention! 20
Recommend
More recommend