Data Logger / RC communication: Subruns skipped due to latency, resulting in empty (no events and no metadata) size 0 files in the middle of a run
Louise determined that this is an issue where DataLogger does not respond to an RC new subrun request in time, which causes an additional request to be issued causing two subruns to be opened simultaneously but only one of them actually written to. Additionally the contiguous subruns ending up with potentially incorrect time information given that this introduces a skew between the accounting of "current subrun" by RunControl and DataLogger. The first of these two subruns never gets closed. In these cases it is possible that the last subrun in the run also gets corrupted.
Hopefully a change in the timeout for DataLogger could fix this issue.
#1 Updated by Keith Matera over 3 years ago
Jon looked through RunControl, found the function (CreateNewSubrun) which calls DataLogger--identified structure that allowed DataLogger to be issued two subrun rollover commands in the case that it failed to respond in X amount of time (was 5 seconds, now increased to 10). Two proposed solutions: stop run under this condition (voted down by Jon and Keith), or modify the method in some way. Jon suggested that a logic solution in the code was likely to take a few hours, and that simply increasing the wait time from 5 seconds to 10 seconds might patch the problem for now. Changes documented in ECL 73891 . Jon cautions that even a 5 second wait time is a very long time to wait for files to be written; suggests that a network latency problem is likely contributing to this error condition in a big way.
"Increasing Resource Manager --> DataLogger --> Resource Timeout Time value from 5 seconds to 10 seconds.
Changing RunControl Wait Time for Message Retry from 5001 ms to 10000 ms."
Current status: see how Ganglia metrics are affected (fewer missing files?) and see whether fewer of the corrupted / size 0 files are produced in future runs. Will want a permanent (code logic) solution at some point, to prevent this from becoming a problem again later on.
#2 Updated by Keith Matera over 3 years ago
Trying to keep this information together so that we can figure it out.
Latency should be increased to 25 s to match known timeout on the raid controller that we hit.
Known to whom? I hadn’t heard of this. Is this something that will be fixed? 25 s is a really long time for RC to be completely unresponsive. [I will note that the unresponsiveness feature can be improved, but this will require some code changes + tests.]
I wonder if instead the DL should immediately respond that it got the message, and if it fails after some time to close the file it can send an error back to the RC via a StatusRequest message. But we also need to think about how we want to deal with this; is it catastrophic enough to require an EndRun from RC?
Regardless, one way or another we need to figure out a safe way for the DL and RC to remain synced in terms of subrun number.