Individual reconfiguration of single BR sometimes causes a crash of that BR
At protoDUNE, folks have noticed that the mysterious crashes of WIB BoardReaders at begin-run (Start) time are sometimes correlated with a second configuration of that BR.
Basically what happens is that the Configure step works for all of the BRs, EBs, etc, except for one BR (because of an issue with the upstream electronics). The user picks the single troublesome BR in the RunControl GUI and sends a second Configure command to that single BR, which works. But, when the subsequent Begin Run is attempted, that BR crashes.
#1 Updated by Eric Flumerfelt over 1 year ago
Does clicking on configure again send another "initialize" XMLRPC command, or does it send a "reinitialize"? "reinitialize" currently has no implementation in BoardReaderCore, it simply logs a message and returns true.
Also, I think that we should add
generator_ptr_.reset(nullptr); at BoardReaderCore.cc:128.
#7 Updated by John Freeman 9 months ago
Concerning the issue with File metric trying to lock a mutex during destruction: if I look at commit a51d87d861aab023fb937d6212cbbfc14bd40ce9 on bugfix/21163_MetricCrashes, in artdaq-utilities/Plugins/file_metric.cc I see that these first two lines of the FileMetric::getTime_ function have been removed:
static std::mutex timeMutex; std::unique_lock<std::mutex> lk(timeMutex);
...so it's safe to say, just based on code review, that the stated desire to remove a problem with File metric locking a mutex during destruction would have been solved with this code change! It should be pointed out that FileMetric::getTime_ alters the contents of the file to which the File metric is writing, but as long as we don't have multiple processes writing to the same file this shouldn't be a problem (plus the mutex wouldn't help in that situation anyway, since it wouldn't be shared among processes).
#8 Updated by John Freeman 9 months ago
And concerning the problem with the graphite metric destructor: I confirmed that an exception was getting thrown from the graphite metric's destructor on the develop branch, specifically due to a call to
socket_.shutdown(boost::asio::socket_base::shutdown_send) within GraphiteMetric::stopMetrics_(). Commit a51d87d861aab023fb937d6212cbbfc14bd40ce9 on bugfix/21163_MetricCrashes handles this by calling the variant of the shutdown function which provides an error code rather than throwing an exception and then ignoring it. During my review, I added a bit of code via commit 9354f05882ba7d65bd8cbbd65914a63196845b54, in which the contents of the error are printed out, e.g., from /home/jcfree/daqlogs/component01-mu2edaq13-11101/component01-mu2edaq13-11101-20190911112526-17061.log on the mu2edaq cluster:
%MSG-w GraphiteMetric: Shutting Down 11-Sep-2019 11:26:00 CDT Sequence ID 83 In destructor of GraphiteMetric instance associated with localhost:2003, the following boost::system::system_error exception was thrown out of a call to stopMetrics() and caught: system:107, "shutdown: Transport endpoint is not connected" %MSG
#9 Updated by John Freeman 9 months ago
- Status changed from Remission to Resolved
Final part of the review. Commit 9c492d57c6f3d93bb46333636bdc9512f3997e3b on artdaq's feature/21163_BoardReaderCore_Reinitialize branch contains the forwarding to initialize() in soft_initialize() and reinitialize(). The code changes in that commit are crystal clear, so just looking at them is enough for me to consider them reviewed.