Additional robustness for picking run number
Our offline colleagues really don't want to see duplicate run numbers under any circumstances. We should have additional machinery in DAQInterface to ensure that it never uses a run number that has already been used for an experiment.
JCF: Issue #23482: device identifier for run records differs across nodes so remove it from the run records integrity check
JCF: Issue #23482: check_run_record_integrity has been added at the beginning of start, to check that the run record has the expected inode
#1 Updated by John Freeman over 1 year ago
To think about how to do this, let's first address how run numbers are currently handled in DAQInterface:
The actual DAQInterface instance doesn't determine the run number, it only receives it as an argument for the start transition. However, if an experiment decides to use the convenience script "send_transition.sh" which comes in the DAQInterface package, then while the experiment can provide any run number it wants and the script will just pass it on to the DAQInterface instance via an XML-RPC call, it can also let the script determine the run number. The way the script does this is to find the highest-numbered run in the run records directory (as defined in $DAQINTERFACE_SETTINGS) and add 1 to it.
Needless to say, if something terrible happens, like a disk failure where the run records directory gets partially or completely wiped out, this algorithm will not generate a unique run number. And if an experiment decides not to use the algorithm but instead to pass their own run number on to send_transition.sh (or not even use send_transition.sh at all), the phase space of failure modes for run number choice is essentially infinite.
The questions then to ask are:
-What information do we want DAQInterface to access so it can determine the likelihood that its run number may be a repeat?
-How do we want that information to be stored?
If an experiment's using the artdaq_database, then Issue #23490 will help, where the database will provide a function DAQInterface can call that tells it what the last run number is. As long as the experiment's using the database then DAQInterface can consult that function. In the event of damage to the run records directory, hopefully there isn't also damage to the database, or at least it's been backed up more reliably than the run records directory.
If an experiment's not using the database, it gets a bit trickier. There's information like the run records directory's disk ID and the directory's inode, where DAQInterface could crosscheck against the current run records directory to see whether it looks like damage occurred to the directory. You could even think of taking the checksum of the directory after a run finishes and saving it someowhere as well. But then the question becomes, where should that information get stored that's safer than the run records directory?
Final thought: one of the experiment-overridable functions described in Issue #22806 is start_datataking, which DAQInterface calls after sending the start transition to the artdaq processes. One could imagine an experiment which generates its own run number without using send_transition.sh putting logic to guard against a repeated run number in that function, although the fact that it's called after sending the start transition to the processes may mean it's too late.
#2 Updated by John Freeman over 1 year ago
- % Done changed from 0 to 90
This Issue is partially resolved in that it's waiting on Issue #23490; however, for users who aren't working with the database, it's fully resolved. With the head of feature/23482_unique_run_number, commit a5f8bb812d17221be27d0e28074f599030cc98c1, on run start, the integrity of the run records directory is checked, and if it's OK, then an exception is thrown if it's found the requested run number already has a record in the directory.
How the integrity of the run records directory is checked is in the following manner (we'll call the run records directory "/run_records"):
If a file called /run_records/.record_directory_info exists, its contents (the inode of the record directory when .record_directory_info was created) are compared with the result of a call to a function called "record_directory_info", which provides the current inode of the record directory. If they match, then the test passes. If they don't match, an exception is thrown, and users are instructed on how to create an up-to-date .record_directory_info file...but only after being warned about the possibility of run duplicates.
If /run_records/.record_directory_info doesn't exist, it gets created. However, if the run records directory already contained run records, an exception is thrown and users are warned about the possibility of run duplicates.