We should protect against all module failures at end run so that files get closed correctly
In testing at LNGS, I've seen online monitoring modules throw exceptions at end run time, and this causes the disk file to not get closed correctly. We should protect against such problems so that disk files get close correctly.
#2 Updated by Kurt Biery almost 8 years ago
I understand from talking with Chris that this functionality is not currently available on a module-by-module basis. So, it would be useful to understand what it would take to make it available on a module-by-module basis.
Please summarize that and give a ballpark estimate of the time that it would take.
#3 Updated by Paul Russo almost 8 years ago
A view from 30,000 feet of things that can be done, and places of attack.¶
So there are different layers of the software which each need to have
their own robustness solutions implemented, and implemented in such a
way that the layers cooperate with each other in a nice way.
At the lowest layer we have the individual art module, which needs to
make some choices when it determines that it cannot proceed properly.
It can simply return with success and just hide the error condition from
the rest of the system, this is most likely to right thing to do for a
monitoring process which is in-line with the data taking system. This
doesn't mean it cannot complain to the log, it can certainly be as noisy
as it wants!
It can also choose to throw an art exception of a particular kind, which
can be configured from the fhicl file to cause the art event loop to
consider the module to have failed, or the path to have failed, or to
skip the whole event.
Or it can choose to throw a fatal error which will cause the whole art
loop to permanently shutdown, which will require the entire daq system
to be restarted (because art cannot be re-entered once it has exited
from the run_art() routine.)
Unfortunately there are error conditions which can arise during the
execution of an art module which are not so easily dealt with. A
floating point exception can be caught and handled, with difficulty, but
an integer divide by zero, or a null pointer dereference is going to
cause the process to crash, which will result in the whole daq system
coming down. This can be dealt with using the more advanced features of
MPI, such as dynamic process management, but this is not a small
project. A little more on this later.
At the next layer up is the art path, with some fhicl file gymnastics
you can play games with allowing a path to fail, but not all the paths,
and the meaning of failure is different between triggers paths and end
At the next layer up are the different threads of control, the board
reader thread, the eb communication/event queue inserter thread, the eb
to ag communication thread, the ag input thread, the ag art event loop
thread, and all of the threads doing xmlrpc communication with the
You would like to be able to detect if one of these threads gets stuck
and stops doing work. You would also like to be able to recover if one
of them crashes and needs to be restarted. This becomes an exercise in
singleton/global data management, and managing communication states
At the next layer up are the MPI processes. The MPI process manager,
mpi_run/mpi_rsh/mpiexec/hydra, when used in the simplest way is making
sure that when one MPI process dies the whole MPI job dies. You can
change this behavior, for example you can set the MPI library to return
error codes instead of dying, but you then have to check every MPI call
for errors, and have some policy for what to do in the case of a fatal
As mentioned before, using the more advanced features of the library it
is possible to control processes individually, starting and stopping
them dynamically. The difficulty here is arranging for a restarted
process to re-establish a valid communication channel and state with its
partners. This can be done, but it is not a small project.
#5 Updated by Kurt Biery almost 8 years ago
Notes from Alessandro/Jim/Kurt discussion:
Alessandro would like files to always get closed, even if there is an exception or a signal.
Division by zero is a special case. With other signals, the current work is finished, and then things are shut down.
Does the Aggregator have enough information internally to close the file with sufficient metadata when a signal is received.
We need to check how mpirun propogates a ctrl-c to the individual processes.
#6 Updated by Kurt Biery almost 8 years ago
- Status changed from Resolved to Assigned
In a follow-up discussion on 02-Jul between Chris, Marc, and Kurt, this topic was discussed a little bit more. Here are my notes from that discussion:
1) exceptions are already handled by art, but in the case of artdaq/ds50daq, art is run in a thread, and it may not be clearly defined how signals are sent to the different threads
1.1) the recommendation was made to set the thread mask so that only the main threads gets signals, and it puts the right thing(s) on the queue to tell art how to react
1.2) Jim's MPI/PMT shim may be needed to get the most reliability that we can
1.3) (internal) questions include: how could a fatal error in one part of the MPI program get turned into a graceful shutdown in another part?
1.4) Possible action items (lower priority than other things that we discussed, IMO):
1.4.1) investigate/improve how signals and interrupt handling is done
1.4.2) improve the way that PMT responds to errors and signals, including Jim's shim
2) Chris suggested that "workers" could be changed to optionally swallow exceptions based on a configuration parameter. We talked through various types of exceptions, and we agreed that there will probably always need to be a catch-all case (since we can't foresee all exceptions that user code might throw). I made my usual request for module-based configurability (in addition to the exception-based configurability that already exists).
2.1) Possible action items
2.1.1) a quick solution might be to allow all non-art exceptions to be ignored. This would be configurable and would allow "next event" and "next module" options.
2.1.2) a medium-term solution might be to allow module-based configurability of suppressing exceptions and continuing
Marc's correction to these notes was the following:
I think by "ignored" that you mean art can be configured to not shut down upon the throwing of such an exception. If this is the case, then I would prefer not to call this "ignored", because art has to take positive action to prevent the shutdown. I'd rather say that we could make art configurable to handle all non-art exceptions in the same way we handle art exceptions.