Project

General

Profile

Bug #21870

Bug #21863: mediumsystem_with_routing_master fails during integration testing

DataReceiverManager::stop_threads ends before all threads are actually stopped

Added by Eric Flumerfelt over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Category:
Known Issues
Target version:
Start date:
02/08/2019
Due date:
% Done:

0%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

During debugging for Bug #21863, I have noticed that in the backlogged DataLogger, not all of the receiver threads are actually joined in DataReceiverManager::stop_threads.

History

#1 Updated by Eric Flumerfelt over 1 year ago

  • Assignee set to Eric Flumerfelt
  • Status changed from New to Work in progress
  • Category set to Known Issues

Work in progress on artdaq/bugfix/21870_DRM_StopThreads_WaitForAllThreads

#2 Updated by Eric Flumerfelt over 1 year ago

The problem actually appeared to be an errant "return" in runReceiver_, which didn't properly let it get to the cleanup phase.
Since this only became apparent using the other changes in this branch, I think it is still a worthwhile change.

#3 Updated by Eric Flumerfelt over 1 year ago

  • Status changed from Work in progress to Resolved

#4 Updated by John Freeman over 1 year ago

  • Status changed from Resolved to Reviewed

On mu2edaq12, my runs 2176, 2177, and 2178 were performed before I merged bugfix/21870_DRM_StopThreads_WaitForAllThreads into /home/jcfree/artdaq-demo_test_fixes_to_v3_03_02/srcs/artdaq and rebuilt (details in mu2edaq12:/home/jcfree/run_records, and mu2edaq12:/tmp/daqinterface_jcfree/DAQInterface_partition0.log). Each run had three eventbuilders, ranks 10-12. If set TRACE_FILE to /tmp/trace_buffer_jcfree.run217X and then run

tshow | egrep "DataLogger.*receive loop" | tdelta -ct 1 -d 1

then, clearly, we're not reaching the end of DataReceiverManager::runReceiver_ for all three eventbuilders in the datalogger on stop:
31001 02-14 16:08:27.481850         0 24664 15750   6         DataLogger1_DataReceiverManager dbg . runReceiver_ 12 receive loop exited
54775 02-14 16:06:54.553228  92928622 24664 54030  10         DataLogger1_DataReceiverManager dbg . runReceiver_ 11 receive loop exited
79798 02-14 16:05:15.570905  98982323 24664 33695  14         DataLogger1_DataReceiverManager dbg . runReceiver_ 11 receive loop exited
79804 02-14 16:05:15.484929     85976 24664 33694  27         DataLogger1_DataReceiverManager dbg . runReceiver_ 10 receive loop exited

On the other hand, runs 2181, 2182 and 2183 were performed after merging in the bugfix branch, and if I set TRACE_FILE to /tmp/trace_buffer_jcfree.run218X, then I see that for each run stop, there are three "receive loop exited" messages, as expected:
2742 02-14 17:00:02.550815         0 46616 35940  15         DataLogger1_DataReceiverManager dbg . runReceiver_ 12 receive loop exited
 2743 02-14 17:00:02.549863       952 46616 35938  22         DataLogger1_DataReceiverManager dbg . runReceiver_ 10 receive loop exited
 2904 02-14 17:00:00.829380   1720483 46616 35939  18         DataLogger1_DataReceiverManager dbg . runReceiver_ 11 receive loop exited
29767 02-14 16:58:25.167646  95661734 46616 17766   9         DataLogger1_DataReceiverManager dbg . runReceiver_ 11 receive loop exited
29768 02-14 16:58:25.166633      1013 46616 17765  10         DataLogger1_DataReceiverManager dbg . runReceiver_ 10 receive loop exited
29769 02-14 16:58:25.166508       125 46616 17767  24         DataLogger1_DataReceiverManager dbg . runReceiver_ 12 receive loop exited
58014 02-14 16:56:47.788704  97377804 46616 55574  13         DataLogger1_DataReceiverManager dbg . runReceiver_ 10 receive loop exited
58015 02-14 16:56:47.788110       594 46616 55575  14         DataLogger1_DataReceiverManager dbg . runReceiver_ 11 receive loop exited
58016 02-14 16:56:47.788079        31 46616 55576  15         DataLogger1_DataReceiverManager dbg . runReceiver_ 12 receive loop exited

and similarly, I see the "Joining thread for source rank" message I'd expect for each of the three sources if I grep accordingly.

#5 Updated by Eric Flumerfelt over 1 year ago

  • Target version set to artdaq v3_04_00
  • Status changed from Reviewed to Closed


Also available in: Atom PDF