Project

General

Profile

Bug #22119

A Dispatcher crash can cause upstream senders to stop processing events

Added by Kurt Biery almost 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
03/13/2019
Due date:
% Done:

0%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

At protoDUNE, I noticed that when the Dispatcher exited soon after the start of a run (for example, because the broadcast_buffer_size is too small to handle the Init message), some fraction of the EventBuilders which were configured to send events to the Dispatcher would stop processing events after the first few.

This was traced to a simple bug in DataSenderManager, in which a retry counter was not being incremented.

I'll try to describe how to reproduce this on a teststand at Fermilab later, but for now, I'm primarily filing this issue so that I have an Issue number to put in the branch name with the fix.

History

#1 Updated by Kurt Biery almost 2 years ago

The branch name is bugfix/22119_DataSenderManager_RetriesIncrement in the artdaq repo, and it was branched from the for_dune-artdaq branch. I've also committed a change to RootNetOutput_module.cc (on this branch) to add its app_name to its TRACE_NAME to help with debugging.

#2 Updated by Kurt Biery almost 2 years ago

Related to this issue...

When I tried setting the send_retry_count parameter in the EB rootNetOut config block to zero, Online Monitoring would never get the Init message or any events.

Looking at the code, a value of zero of this parameter should be valid (an initial try, yes, but no retries).

However, there looked to be misunderstanding in the code as to whether an initial attempt had already been made. I've made a second commit to DSM with a fix for this.

#3 Updated by Eric Flumerfelt almost 2 years ago

  • Status changed from Assigned to Resolved

With latest DAQInterface, testing this is as trivial as sending a Dispatcher SIGSTOP during a demo run.

#4 Updated by Eric Flumerfelt almost 2 years ago

  • Status changed from Resolved to Reviewed
  • Co-Assignees Eric Flumerfelt added

#5 Updated by Eric Flumerfelt almost 2 years ago

  • Target version set to artdaq v3_04_01
  • Status changed from Reviewed to Closed

Also available in: Atom PDF