A Dispatcher crash can cause upstream senders to stop processing events
At protoDUNE, I noticed that when the Dispatcher exited soon after the start of a run (for example, because the broadcast_buffer_size is too small to handle the Init message), some fraction of the EventBuilders which were configured to send events to the Dispatcher would stop processing events after the first few.
This was traced to a simple bug in DataSenderManager, in which a retry counter was not being incremented.
I'll try to describe how to reproduce this on a teststand at Fermilab later, but for now, I'm primarily filing this issue so that I have an Issue number to put in the branch name with the fix.
#1 Updated by Kurt Biery almost 2 years ago
The branch name is bugfix/22119_DataSenderManager_RetriesIncrement in the artdaq repo, and it was branched from the for_dune-artdaq branch. I've also committed a change to RootNetOutput_module.cc (on this branch) to add its app_name to its TRACE_NAME to help with debugging.
#2 Updated by Kurt Biery almost 2 years ago
Related to this issue...
When I tried setting the send_retry_count parameter in the EB rootNetOut config block to zero, Online Monitoring would never get the Init message or any events.
Looking at the code, a value of zero of this parameter should be valid (an initial try, yes, but no retries).
However, there looked to be misunderstanding in the code as to whether an initial attempt had already been made. I've made a second commit to DSM with a fix for this.