Provide moderate synchronization when sending fragments from multiple BoardReaders
In systems that don't have hardware triggers, it's possible for the sending of fragments from some of the BoardReaders to get rather far ahead of other BoardReaders.
The place where we first noticed this (long ago) was the artdaq-demo before we implemented the 0.1 second sleep in the BRs. This can easily be reproduced in the demo with version 1.12.12 of artdaq by setting the ToyGenerator sleep times to zero. To see the problem, look in the EventBuilder DAQ metrics that are sent to the FileMetric reporting destination (e.g. /tmp/eventbuilder/evb_*_metrics.log and search for the reporting of "Incomplete Event Count" and see that it can become quite large. Also, when you try to stop the run, there is a good chance that it will take a really long time to complete because the lagging BoardReader has to finish sending all of its fragments, complete events need to be made in the EBs, etc.
We have also seen the problem in real experiment environments. For example, in the DUNE 35-ton DAQ, if the readout of the data from one of the RCEs or SSPs stops for some reason, the readout of the data from the remaining components can continue to run, and the EventBuilder EventStores can consume lots of memory for incomplete events. As in the demo, stopping the run can take a long time, and often fails, when this happens.
For quite a while, we were thinking that the way to address this problem was to time out incomplete events (Issue #3187) or exert backpressure when a configurable number of incomplete events was reached in the EventStore (Issue #6727). Given our experience with the 35t DAQ, the first option seemed like yet another silent failure that shifters could fail to notice. The second option might be better, because it could provide warning messages when the EventStore is full, but selectively throttling fragments from certain BoardReaders would probably be tricky to implement.
An easier solution is to use MPI_Barrier(s) to periodically synchronize the sending of fragments from the BoardReaders. Historically, this option was not used because of the blocking nature of MPI_Barrier and the possibility that an EndRun message might not get processed because a BoardReader's process_fragments thread was blocked on the barrier. However, the latest versions of MPICH and mvapich2 have a non-blocking barrier call available: MPI_Ibarrier.
This issue requests that we investigate whether MPI_Ibarrier can be used to implment a non-blocking synchronization of the sending of fragments from the BoardReaders.
#5 Updated by Kurt Biery over 5 years ago
- Status changed from Assigned to Resolved
Changes have been made to BoardReaderCore.h/cc to implement this feature.
The scheme that was implemented calls MPI_Ibarrier every N fragments but allows more fragments to flow until a threshold is reached (before the next N fragments are processed). At that point, the code waits for the other BoardReaders to catch up (all BoardReaders call MPI_Ibarrier), and then it continues. During the time that it is waiting, the code prints out configurable messages to say that it is waiting.There are a number of configuration parameters that control how this feature behaves:
- mpi_sync_interval - this is the number of fragments that will be used by the BoardReaders to decide when to call MPI_Ibarrier. That is, MPI_Ibarrier is called every N fragments, where N==mpi_sync_interval. A value of zero for this parameter is the default, and this value turns off the Ibarrier functionality.
- mpi_sync_wait_threshold_fraction - this is the fraction of the N fragments per Ibarrier call that are allowed to be processed before the BoardReader decides to wait for its siblings to catch up. This value should be GE 0.0 and LT 1.0. The default value is 0.5.
- mpi_sync_wait_interval_usec_ - when a BoardReader is waiting for its siblings to catch up, this is the number of microseconds that it will wait between calls to MPI_Test to see if all BoardReaders have called MPI_Ibarrier. The default value is 100 usec. Users should be careful not to make this value too large, lest it affect overall performance.
- mpi_sync_wait_log_level - this parameter specifies whether the BoardReader should log the "I'm waiting" messages as Warnings (value=2) or Errors (value=3). The default is 2.
- mpi_sync_wait_log_interval_sec - the time interval between the "I'm waiting" messages. The default value is 10 seconds. Actually, each time the code prints out a message, it increases the interval by this amount. So, a value of 10 will cause messages to be printed out after 10 seconds, then after 20 seconds, then after 30 seconds, etc.