Occasionally, online monitoring processes get stale data
I've seen this happen both in the artdaq-demo, running on mu2edaq01, and at protoDUNE.
One symptom is that an OnMon process reports that the first event that it got was of type 'EndOfData', and it immediately exits (before any data has actually been sent from the Dispatcher). Another symptom is that it reports seeing several 'Data' events and then gets garbage. This condition will persist until the shared memory segment that the OnMon process is using to talk to the Dispatcher art process gets cleaned up.I've found that I can trigger this situation by running the demo with OnMon enabled, closing one or both of the OnMon windows (no ctrl-c, just close the x-term), and then running the demo again. The command that I've been using to the run the demo is the following:
- sh ./run_demo.sh --config mediumsystem_with_routing_master --bootfile `pwd`/artdaq-utilities-daqinterface/simple_test_config/mediumsystem_with_routing_master/boot.txt --comps component01 component03 --runduration 40 --partition=4
I believe that the cause of this issue is a stale shared memory segment between the Dispatcher art process and the OnMon process. Typically, this shmem segment gets cleaned up, so there is no problem. And, I've seen that if I ctrl-c in the OnMon window, the shmem segment does get cleaned up. However, there are cases in which the segment does not get cleaned up, and with the current code, the bad state can persist for a number of invocations of run_demo (e.g. if the user doesn't 'ipcrm -a' between them).
#1 Updated by Kurt Biery 11 months ago
Eric and I have talked about a couple of options to handle this, but nothing seems ideal so far.
One option is to have SharedMemoryManager warn the user when a shared memory segment owner re-uses a stale shmem segment (instead of creating a new one) and mark the segment for future removal. The code for this has been committed on the feature/21937_SMM_StaleSegmentWarnAndContinue branch in the artdaq_core repository. This seems like a work-around.
There is a race condition - if the owner marks the segment for future removal before any other processes attach, then those other processes won't be able to attach. We could just warn the user and suggest manual cleanup.
For now, this code is available, but as I say, it's not yet clear whether this is the best option.
#2 Updated by Kurt Biery 11 months ago
Another option is to simply say that processes that are supposed to own shared memory always do that, even if the shared memory segment already exists.
Code to do that has been committed to feature/21937_SMM_StaleSegmentTakeOwnershipAnyway.
This new code also prints out the commands to clean up the stale segment, if the user wants to try that.