what to learn from microboone daq
play

What to Learn From MicroBooNE DAQ? Wesley Ketchum with input of - PowerPoint PPT Presentation

What to Learn From MicroBooNE DAQ? Wesley Ketchum with input of lots of MicroBooNE people 30 October 2017 2 First things first MicroBooNE Detector Paper: JINST 12, P02017 (2017) https://arxiv.org/abs/1612.05824 (basically) everything


  1. What to Learn From MicroBooNE DAQ? Wesley Ketchum with input of lots of MicroBooNE people 30 October 2017

  2. 2 First things first ¡ MicroBooNE Detector Paper: JINST 12, P02017 (2017) ¡ https://arxiv.org/abs/1612.05824 ¡ (basically) everything in this talk that is not my opinion comes from there ¡ MicroBooNE continues running well ¡ Starting third year of data-taking ¡ >95% of POT delivered is recorded to tape ¡ That‘s integrated, so 5% loss not (all) due to DAQ (typical uptime >97%) Oct ‘15 Oct ‘16 Oct ‘17 30 October 2017

  3. 3 Design: Electronics 36 PMT channels 30 October 2017

  4. 4 TPC Readout Electronics 30 October 2017

  5. 5 PMT/Trigger Readout Electronics ¡ PMT readout ¡ Beam disc: unbiased readout for 23.4 us around trigger ¡ Cosmic disc: threshold requirement, readout for 625 ns ¡ Clock ¡ Common “frame number” (1.6 ms counter) from start of run ¡ Pules per second from GPS pulse latches time, allows for lookup map to real time ¡ Used for matching auxiliary data (like beam and cosmic ray tagger) 30 October 2017

  6. 6 Two data streams ¡ “Triggered” (NU): ¡ TPC lossless Huffman compressed ¡ PMT has no compression applied, readout 4 frames (no trimming) Operating point ¡ Data ~150 MB before compression, ~35 MB after compression ¡ “Continuous” (SN): ¡ TPC lossy zero-suppression, read-out frame-by-frame (15 MB/s per crate) ¡ PMT just reads out (~7 MB/s) ¡ Preference to triggered stream data ¡ Additional data ¡ Cosmic tagger panels added around MicroBooNE detector, being (or will be) combined in offline process ¡ Readout continuously, matched to TPC data on timestamp 30 October 2017

  7. 7 Thoughts on readout design ¡ First/foremost: it works for the needs of the experiment ¡ And works pretty darn well ¡ Largest struggles in dealing with real data ¡ PMT rate higher than expected à modifications of thresholds/buffers ¡ Likely leading cause for DAQ crash rates are FIFO overflow on PMT readout ¡ TPC noise higher/generally not as expected ¡ Huffman compression factor x5 instead of hoped-for x10 ¡ More complications on continuous readout mode rates ¡ Continuous stream competition for resources ¡ Despite dedicated readout stream, still some shared resources (data transmission on crate, go to same server) ¡ Lacked parasitic data-taking modes for testing DAQ components ¡ Hardware-based PMT trigger and continuous stream not online at start of beam à difficulty in commissioning without losing data ¡ Also, you really really need to use real data for commissioning 30 October 2017

  8. 8 Software data flow ¡ MicroBooNE doesn’t use artdaq, but shares the overall design ¡ I’ll translate to artdaq names ¡ BoardReaders ¡ Receive data from hardware ¡ Move to large circular buffer ¡ Process, identify data belonging to single event, move to outbound queue ¡ Send to EventBuilder ¡ EventBuilder+Aggregator (one multi-threaded process for us) ¡ Collect fragments ¡ When event complete, transfer fragments to raw event queue ¡ Process raw events, apply software trigger, write to disk ¡ 50 events per file, no filtering into separate files 30 October 2017

  9. 9 Software trigger ¡ High-level trigger software trigger to PMT Readout reduce rate TPC Readout ¡ Low-level trigger from neutrino beam gates ¡ High-level trigger looks for coincident PMT signals above threshold ¡ Accepts prescaled unbiased data ¡ <~10 ms per event total alg time Event Builder ¡ ~factor 20 reduction in data rate Fail Pass ¡ Trigger applied after event-building Software ¡ Limits low-level trigger rate to Trigger network bandwidth (20 Hz)/readout crate stability ¡ Better to have PMT info at low-level Data Logger trigger… 30 October 2017

  10. 10 Thoughts on software design ¡ General strategy: everything needs to work, or we get nothing ¡ We rely on … ¡ Well-formatted data (well, with hard-coded exceptions) ¡ In-sync fragments, all fragments report ¡ Pros: simpler (no partial events to handle/monitor, everything in shared state); when it works you trust it ¡ Cons: one piece goes down, you have nothing; special modes really a bit special ¡ This has generally worked well for MicroBooNE ¡ Things much more often than not work! But it’s a simple system ¡ Data format: binary data ¡ Needs conversion to offline format, which didn’t really happen until later in commissioning à hectic moments in early commissioning to understand data 30 October 2017

  11. 11 Additional software ¡ Run control ¡ Simple console-based python/shell scripts in VNC ¡ Highly automated ¡ Automatic re-lanching of runs, no selection of components, etc.: pick configuration, run length, and go ¡ Music to wake shifter in case of major errors ¡ Monitoring ¡ Custom metrics reported to real-time database with ganglia ¡ Some reported to SlowMonitoring / central alarm area ¡ The ones that aren’t are “expert” level ¡ Online data processing to monitor basic PMT and TPC waveforms/activity rates ¡ Runs off of spying data in shared memory, processes binary data ¡ Logging ¡ We just write log files out for history ¡ Configuration database ¡ PSQL ßà FCL tool: upload new configs by making new fcl files 30 October 2017

  12. Thoughts on those additional 12 elements ¡ MicroBooNE gets away with a highly-automated console-based DAQ because not too many components, and overall simple system ¡ Configurations must be carefully maintained…can create high load on experts ¡ Online monitoring off of raw binary data separate from offline data format à rather only slow changes in the quantities to monitor ¡ In periods of duress, we demand the swift conversion of files and dedicated people to continuously analyze the data ¡ Don’t collect enough run information into databases ¡ E.g. local log text files written with run uptime, to be used for POT integration information ¡ à If it’s worth having, plan to store in a database 30 October 2017

  13. 13 Data Management ¡ DAQ responsibility ends once file hits local disk ¡ Online DM takes over for getting file from disk to tape-backed storage ¡ Automated local processes for … ¡ Search for new files ¡ Generate metadata/auxiliary files ¡ Copy to outbound dropbox ¡ Monitor when data is whisked away ¡ Cleanup local files ¡ Nearline/Offline DM takes it from there ¡ Automated processes on grid for ... ¡ Keep-up “swizzling” (reformatting) and reconstruction ¡ Occupies ~100 nodes for ”normal” (1 Hz) data rates 30 October 2017

  14. 14 General thoughts ¡ Requires very close coordination of DAQ and DM groups ¡ DM group needs local DAQ cluster resources (CPU, disk read/write, network bandwidth) that can compete with DAQ functions ¡ MicroBooNE woefully underestimated its data rate, volume, and resource needs ¡ From TDR: ¡ Expected final compression was x10 (we achieved only x5) ¡ Expected recorded data rate from BNB was 0.05 Hz (actual: ~0.15 Hz) ¡ No careful accounting of any additional trigger sources (reality à ~0.7 Hz total rate) ¡ And physics groups demand more data still... ¡ Need carefully validated and realistic data volume and resource estimates ¡ Additional considerations to ease offline DM? ¡ MicroBooNE DAQ writes everything to one file ¡ e.g. Filtering on trigger streams likely would help offline re-swizzling/reconstruction ¡ “Swizzling” takes significant resources ¡ Reduce reformatting? Improve/make less necessary decompression routines? ¡ To borrow from Josh Klein: try to be less paranoid and greedy 30 October 2017

  15. 15 Conclusions/discussion ¡ MicroBooNE DAQ is running, running well, and fits our physics needs ¡ Very useful experience running a real physics experiment ¡ With real results! And MORE COMING! ¡ Elements to learn, as discussed, from design point of view ¡ For multiple data streams, need careful evaluation of shared resources ¡ Need flexibility in data handling, compression, and triggering ¡ Early and close integration with data management ¡ Design for realistic data rates/volume at a global/integrated level (DAQ+DM) ¡ And then resist pressure for changes without complete reevaluation of entire chain ¡ Also, MicroBooNE has loads of operational experience/advice, but won’t dwell on that here… ¡ DISCUSSION TIME 30 October 2017

Recommend


More recommend