Project

General

Profile

Bug #8891

Unexpected art thread exception thrown at end of running

Added by John Freeman over 4 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
05/21/2015
Due date:
% Done:

100%

Estimated time:
40.00 h
Spent time:
Occurs In:
Scope:
Internal
Experiment:
DarkSide
SSI Package:
art
Duration:

Description

Kurt and I have noticed that artdaq-based DAQ systems which use e6:s7:eth builds of artdaq v1_12_09 (and consequently depend on art v1_13_01) have their EventBuilderMain processes throw exceptions (and subsequently terminate) when they're sent the shutdown transition. Some code-level details; note that paths are given relative to their package's root (so, e.g., artdaq paths are given relative to ./artdaq):

  • The art thread in EventBuilderMain is referred to by an std::future object called "reader_thread_" in the EventStore class; its creation appears at artdaq/DAQrate/EventStore.cc:71
  • When a shutdown is sent, EventBuilderCore::shutdown() is called in EventBuilderMain at artdaq/Application/MPI2/EventBuilderCore.cc:298
  • This function in turn calls endOfData() on the EventBuilderCore's instance of the EventStore class; this function appear in artdaq/DAQrate/EventStore.cc:203
  • The problem appears in EventStore::endOfData() when the art thread is closed out via a call to reader_thread_.get()
  • By looking at core dumps of the program, I've determined that the problem occurs in the destructor of the NetMonOutput module run on the EventBuilderMain's art thread, defined at artdaq/ArtModules/NetMonOutput_module.cc:86. Specifically, in this line of code:
    ServiceHandle<NetMonTransportService> transport;
    

    the ServiceHandle constructor defined in the art package at art/Framework/Services/Registry/ServiceHandle.h:33 is invoked, which in turn calls the ServiceRegistry::get() function defined in art/Framework/Services/Registry/ServiceRegistry.h:60 -- where it's found that the pointer to the ServicesManager class is null and an exception is thrown.

To get your hands on this code and troubleshoot it, I've created a special branch of artdaq-demo called "artthread_exception". To get artdaq-demo up and running, simply do the following:

git clone ssh://p-artdaq-demo@cdcvs.fnal.gov/cvs/projects/artdaq-demo
cd artdaq-demo
git checkout artthread_exception
cd ..
./artdaq-demo/tools/quick-start.sh --tag=artthread_exception

…and respond "y" to the "Are you sure" question with which you'll be prompted. A bunch of packages will be downloaded, and some repositories will be checked out and built; the entire process takes about 12 minutes on woof.fnal.gov, and takes up just under 7 Gb of space. Please note that you'll see complaints about there being different versions of cetbuildtools; this only occurs because the source build of art uses an older version of cetbuildtools than the other packages, and doesn't appear to affect whether the builds are successful. Once this has been done, you can run a demo of the system following the instructions at https://cdcvs.fnal.gov/redmine/projects/artdaq-demo/wiki/Running_a_sample_artdaq-demo_system, where you'll see the exception when you issue a shutdown.

The reason for this special branch is (A) rather than just downloading a prebuilt art from a zip file, it builds art from the source in the local area, to allow for line-by-line troubleshooting, and (B) the FHiCL code controlling the artdaq processes when you run the demo as described at https://cdcvs.fnal.gov/redmine/projects/artdaq-demo/wiki/Running_a_sample_artdaq-demo_system has been simplified, so you can focus just on the problem described in this issue.

History

#1 Updated by John Freeman over 4 years ago

Apologies; a section above appeared crossed out due to a bug in my Redmine wiki syntax. That section is reprinted here:

-The art thread in EventBuilderMain is referred to by an std::future object called "reader_thread_" in the EventStore class; its creation appears at artdaq/DAQrate/EventStore.cc:71
-When a shutdown is sent, EventBuilderCore::shutdown() is called in EventBuilderMain at artdaq/Application/MPI2/EventBuilderCore.cc:298
-This function in turn calls endOfData() on the EventBuilderCore's instance of the EventStore class; this function appear in artdaq/DAQrate/EventStore.cc:203
-The problem appears in EventStore::endOfData() when the art thread is closed out via a call to reader_thread_.get()
-By looking at core dumps of the program, I've determined that the problem occurs in the destructor of the NetMonOutput module run on the EventBuilderMain's art thread, defined at artdaq/ArtModules/NetMonOutput_module.cc:86. Specifically, in this line of code:

#2 Updated by Christopher Green over 4 years ago

  • Description updated (diff)
  • Category set to Infrastructure
  • Status changed from New to Accepted
  • Target version set to 1.18.00
  • Estimated time changed from 6.00 h to 40.00 h
  • Experiment DarkSide added
  • Experiment deleted (-)
  • SSI Package art added
  • SSI Package deleted ()

#3 Updated by Kyle Knoepfel about 4 years ago

John, can you check that the build instructions/scripts are current? I was not able to get a consistent build following instructions as listed on the Wiki - in particular, artdaq-code-demo does not seem to have a version v1_03_00.

#4 Updated by John Freeman about 4 years ago

I either didn't run "git tag" on v1_03_00 of artdaq-code-demo or if I did, I forgot to push it. That's been fixed, and I've gotten the instructions to work on woof.fnal.gov .

#5 Updated by Kyle Knoepfel about 4 years ago

Thanks, John, for making the push. I am now able to build everything per your instructions, and I confirm the error you have reported.

#6 Updated by Kyle Knoepfel about 4 years ago

  • Subject changed from Unexpected art thread exception throws at end of running to Unexpected art thread exception thrown at end of running

#7 Updated by Kyle Knoepfel about 4 years ago

  • Assignee set to Kyle Knoepfel
  • % Done changed from 0 to 70

Problem and solution summary

We now understand why the failure is happening. The issue can be understood heuristically be looking at this toy class:

class EP {
  // many data members
  ModulesType  entity_that_creates_modules_;
  ServicesType entity_that_creates_services_;
  // more data members
};

Although the actual code is more nuanced than this, it illustrates the issue. According to the C++ standard, entity_that_creates_services_ is constructed after and destructed before entity_that_creates_modules_. In other words, whenever the NetMonOutput module destructor is called, the art::ServiceHandle within it cannot be valid since NetMonTransferService no longer exists. The solution is to reverse the order of the declaration of the member data members, which reverses the construction/destruction behavior.

Obviously this is an oversimplification of the issue since art::ServiceHandle objects can validly be retrieved in module constructors. Discussing the subtleties of how this happens, though, would only confuse the issue.

We should be able to recreate this problem within art. We will add a test to ensure this does not happen in the future. I suspect we will release a fix for this whenever Alpha Centauri is released later this summer or early fall. If the fix is desired before then, please talk with us and we can discuss priorities of releasing a bug-fixed version of art.

Apparently, artdaq is the first to try to resolve an art::ServiceHandle in one of its module destructors.


Expert notes

A snippet of the original art::EventProcessor class definition shows the following member order:


class art::EventProcessor : public art::IEventProcessor {

  // down to the private member data block
  // ... skipping some members

  ServiceToken serviceToken_;
  tbb::task_scheduler_init tbbManager_;
  PathManager pathManager_; // Must outlive schedules.
  ServiceDirector serviceDirector_;
  // destructorOperate_ should be populated in destructor only!
  std::unique_ptr<ServiceRegistry::Operate> destructorOperate_; // [ADDED NOTE: This must be before 'pathManager_' !]
  std::unique_ptr<InputSource> input_;

  // etc.

};

The 'ADDED NOTE' is mine. Even though the EventProcessor destructor resets the destructorOperate_ member so that services, in principle can be used in module destructors, the destructorOperate_ member is destroyed before the destructors of the modules are called, thus creating the exception. The problem is solved by simply moving the destructorOperate_ declaration before that pathManager_ declaration.

#8 Updated by Kyle Knoepfel about 4 years ago

  • Status changed from Accepted to Resolved
  • % Done changed from 70 to 100

I was able to reproduce the error within art, outside of artdaq. The fix has been implemented and a test added.

Implemented with commit art:f30c37a0b3b69af20be2a3b8d8cb087bf9742ab7.

#9 Updated by Kyle Knoepfel about 4 years ago

Credit goes to Paul Russo for spending significant time helping me debug this problem.

#10 Updated by Kyle Knoepfel about 4 years ago

  • Target version changed from 1.18.00 to 1.15.02

#11 Updated by Kyle Knoepfel almost 4 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF