Project

General

Profile

Bug #21163

Individual reconfiguration of single BR sometimes causes a crash of that BR

Added by Kurt Biery 12 months ago. Updated 22 days ago.

Status:
Closed
Priority:
High
Category:
Known Issues
Target version:
Start date:
10/16/2018
Due date:
% Done:

100%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

At protoDUNE, folks have noticed that the mysterious crashes of WIB BoardReaders at begin-run (Start) time are sometimes correlated with a second configuration of that BR.

Basically what happens is that the Configure step works for all of the BRs, EBs, etc, except for one BR (because of an issue with the upstream electronics). The user picks the single troublesome BR in the RunControl GUI and sends a second Configure command to that single BR, which works. But, when the subsequent Begin Run is attempted, that BR crashes.


Related issues

Related to artdaq - Idea #21164: A simple-test-config that has configure-only BRs might be useful to help mimic protoDUNE-like systemsNew10/16/2018

History

#1 Updated by Eric Flumerfelt 12 months ago

Question:
Does clicking on configure again send another "initialize" XMLRPC command, or does it send a "reinitialize"? "reinitialize" currently has no implementation in BoardReaderCore, it simply logs a message and returns true.

Also, I think that we should add generator_ptr_.reset(nullptr); at BoardReaderCore.cc:128.

#2 Updated by Kurt Biery 12 months ago

The truthful answer is 'I don't know', but I strongly suspect that it is a second Init command, rather than a re-init.

#3 Updated by Eric Flumerfelt 12 months ago

  • Related to Idea #21164: A simple-test-config that has configure-only BRs might be useful to help mimic protoDUNE-like systems added

#4 Updated by Eric Flumerfelt 4 months ago

  • Status changed from New to Remission

This issue has not reoccurred in 8 months, and we have direct evidence that sending multiple initialize transitions to BoardReaders works as expected.

#5 Updated by Eric Flumerfelt 4 months ago

As part of my investigation of this Issue, I created artdaq:feature/21163_BoardReaderCore_Reinitialize, which lets the reinitialize and soft_initialize methods of BoardReaderCore simply call initialize.

#6 Updated by Eric Flumerfelt 4 months ago

Also found during investigation: problems with destructing Graphite metric when it never was connected, and problems with File metric trying to lock a mutex during destruction. These have been fixed on artdaq-utilities:bugfix/21163_MetricCrashes.

#7 Updated by John Freeman about 1 month ago

Concerning the issue with File metric trying to lock a mutex during destruction: if I look at commit a51d87d861aab023fb937d6212cbbfc14bd40ce9 on bugfix/21163_MetricCrashes, in artdaq-utilities/Plugins/file_metric.cc I see that these first two lines of the FileMetric::getTime_ function have been removed:

                static std::mutex timeMutex;
                std::unique_lock<std::mutex> lk(timeMutex);

...so it's safe to say, just based on code review, that the stated desire to remove a problem with File metric locking a mutex during destruction would have been solved with this code change! It should be pointed out that FileMetric::getTime_ alters the contents of the file to which the File metric is writing, but as long as we don't have multiple processes writing to the same file this shouldn't be a problem (plus the mutex wouldn't help in that situation anyway, since it wouldn't be shared among processes).

#8 Updated by John Freeman about 1 month ago

And concerning the problem with the graphite metric destructor: I confirmed that an exception was getting thrown from the graphite metric's destructor on the develop branch, specifically due to a call to socket_.shutdown(boost::asio::socket_base::shutdown_send) within GraphiteMetric::stopMetrics_(). Commit a51d87d861aab023fb937d6212cbbfc14bd40ce9 on bugfix/21163_MetricCrashes handles this by calling the variant of the shutdown function which provides an error code rather than throwing an exception and then ignoring it. During my review, I added a bit of code via commit 9354f05882ba7d65bd8cbbd65914a63196845b54, in which the contents of the error are printed out, e.g., from /home/jcfree/daqlogs/component01-mu2edaq13-11101/component01-mu2edaq13-11101-20190911112526-17061.log on the mu2edaq cluster:

%MSG-w GraphiteMetric:  Shutting Down 11-Sep-2019 11:26:00 CDT Sequence ID 83
In destructor of GraphiteMetric instance associated with localhost:2003, the following boost::system::system_error exception was thrown out of a call to stopMetrics() and caught: system:107, "shutdown: Transport endpoint is not connected" 
%MSG

#9 Updated by John Freeman about 1 month ago

  • Status changed from Remission to Resolved

Final part of the review. Commit 9c492d57c6f3d93bb46333636bdc9512f3997e3b on artdaq's feature/21163_BoardReaderCore_Reinitialize branch contains the forwarding to initialize() in soft_initialize() and reinitialize(). The code changes in that commit are crystal clear, so just looking at them is enough for me to consider them reviewed.

#10 Updated by John Freeman about 1 month ago

  • % Done changed from 0 to 100
  • Status changed from Resolved to Reviewed

#11 Updated by Eric Flumerfelt 22 days ago

  • Target version set to artdaq v3_06_01
  • Assignee set to Eric Flumerfelt
  • Status changed from Reviewed to Closed
  • Category set to Known Issues
  • Co-Assignees John Freeman added


Also available in: Atom PDF