Project

General

Profile

Support #23517

Lots of SharedMemoryManager errors in TRACE buffer at end of artdaq-demo test run

Added by Kurt Biery about 1 month ago. Updated 20 days ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
Start date:
11/04/2019
Due date:
% Done:

0%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

This is with an artdaq-demo v3_06_02 system on mu2edaq13.
Are these expected/normal? (It seems that these have not always been there...)

idx                us_tod    pid    tid cpu                                 trcname lvl r msg                       
------ ---------------- ------ ------ --- --------------------------------------- --- - --------------------------
0 11-04 12:45:13.369614 309813 309813 18 SharedMemoryManager err . Calling default signal handler
1 11-04 12:45:13.369613 309813 309813 18 SharedMemoryManager err . A signal of type 11 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
2 11-04 12:45:13.369161 309810 328236 27 SharedMemoryManager err . Calling default signal handler
3 11-04 12:45:13.369155 309823 328238 16 SharedMemoryManager err . Calling default signal handler
4 11-04 12:45:13.369152 309810 328236 27 SharedMemoryManager err . A signal of type 11 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
5 11-04 12:45:13.369145 309813 309813 18 SharedMemoryManager err . Calling default signal handler
6 11-04 12:45:13.369100 309831 309831 23 SharedMemoryManager err . Calling default signal handler
7 11-04 12:45:13.369087 309846 309846 0 SharedMemoryManager err . Calling default signal handler
8 11-04 12:45:13.369079 309838 309838 42 SharedMemoryManager err . Calling default signal handler
9 11-04 12:45:13.369051 309846 328179 44 SharedMemoryManager err . Calling default signal handler
10 11-04 12:45:13.369051 309851 309851 40 SharedMemoryManager err . Calling default signal handler
11 11-04 12:45:13.369050 309846 309846 0 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
12 11-04 12:45:13.369025 309822 328242 8 SharedMemoryManager err . Calling default signal handler
13 11-04 12:45:13.369025 309838 309838 42 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
14 11-04 12:45:13.369024 309851 328174 30 SharedMemoryManager err . Calling default signal handler
15 0000000 0 0 0 MPISentry err .
16 11-04 12:45:13.369022 309818 328239 6 SharedMemoryManager err . Calling default signal handler
17 11-04 12:45:13.369016 309831 309831 23 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
18 11-04 12:45:13.369011 309851 309851 40 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
19 11-04 12:45:13.369010 309836 328244 22 SharedMemoryManager err . Calling default signal handler
20 0000000 0 0 0 MPISentry err .
21 11-04 12:45:13.369008 309824 309824 43 SharedMemoryManager err . Calling default signal handler
22 11-04 12:45:13.369002 309823 309823 17 SharedMemoryManager err . Calling default signal handler
23 11-04 12:45:13.368991 309847 328193 11 SharedMemoryManager err . Calling default signal handler
24 11-04 12:45:13.368987 309847 328193 11 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
25 11-04 12:45:13.368975 309846 328179 44 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
26 11-04 12:45:13.368970 309822 309822 16 SharedMemoryManager err . Calling default signal handler
27 11-04 12:45:13.368970 309836 309836 26 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
28 11-04 12:45:13.368967 309823 309823 17 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
29 11-04 12:45:13.368966 309846 309846 0 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
30 11-04 12:45:13.368964 309824 328243 46 SharedMemoryManager err . Calling default signal handler
31 11-04 12:45:13.368963 309847 309847 41 SharedMemoryManager err . Calling default signal handler
32 11-04 12:45:13.368958 309824 309824 43 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
33 11-04 12:45:13.368958 309822 328242 8 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
34 11-04 12:45:13.368958 309851 328174 30 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
35 11-04 12:45:13.368953 309818 328239 6 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
36 11-04 12:45:13.368953 309818 309818 2 SharedMemoryManager err . Calling default signal handler
37 11-04 12:45:13.368950 309851 309851 40 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
38 11-04 12:45:13.368949 309826 328240 34 SharedMemoryManager err . Calling default signal handler
39 11-04 12:45:13.368946 309826 309826 39 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
40 11-04 12:45:13.368944 309838 309838 42 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
41 11-04 12:45:13.368940 309838 328192 14 SharedMemoryManager err . Calling default signal handler
42 11-04 12:45:13.368931 309823 328238 16 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
43 11-04 12:45:13.368927 309836 328244 22 SharedMemoryManager err 1 A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
44 0000000 309808 309808 15 SharedMemoryManager err . Calling default
45 11-04 12:45:13.368923 309808 328237 10 SharedMemoryManager err . Calling default signal handler
46 11-04 12:45:13.368920 309838 328192 36 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
47 11-04 12:45:13.368916 309847 309847 41 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
48 11-04 12:45:13.368916 309813 328241 1 SharedMemoryManager err . Calling default signal handler
49 11-04 12:45:13.368915 309808 309808 15 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
50 11-04 12:45:13.368909 309851 309851 40 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
51 11-04 12:45:13.368907 309813 328241 1 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
52 11-04 12:45:13.368903 309810 309810 4 SharedMemoryManager err . Calling default signal handler
53 11-04 12:45:13.368902 309810 309810 4 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
54 11-04 12:45:13.368901 309838 309838 42 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
55 11-04 12:45:13.368859 309824 328243 46 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
56 0000000 0 0 0 MPISentry err .
57 11-04 12:45:13.368856 309810 328236 27 SharedMemoryManager err . Calling default signal handler
58 11-04 12:45:13.368854 309810 328236 27 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
59 11-04 12:45:13.368842 309823 309823 17 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
60 11-04 12:45:13.368841 309822 328242 8 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
61 11-04 12:45:13.368834 309824 309824 43 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
62 11-04 12:45:13.368829 309836 309836 26 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
63 11-04 12:45:13.368817 309826 328240 34 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
64 11-04 12:45:13.368807 309826 309826 39 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
65 11-04 12:45:13.368793 309818 328239 6 SharedMemoryManager err 1 A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
66 11-04 12:45:13.368793 309818 309818 2 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
67 11-04 12:45:13.368790 309808 328237 10 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
68 11-04 12:45:13.368775 309822 309822 16 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
69 11-04 12:45:13.368772 309810 328236 27 SharedMemoryManager err 1 A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
70 11-04 12:45:13.368772 309810 309810 4 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
71 11-04 12:45:13.368752 309813 328241 1 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
72 11-04 12:45:13.368742 309813 309813 18 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
73 11-04 12:45:13.368733 309808 309808 15 SharedMemoryManager err . A signal of type 15 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
74 11-04 12:45:12.275492 309849 339485 25 RoutingMaster1_CommandableInterface nfo . Shutdown transition complete
75 11-04 12:45:12.275446 309849 339485 25 MetricManager wrn . Attempted to send metric when MetricManager has not yet been initialized!
76 11-04 12:45:12.275443 309849 339485 25 RoutingMaster1_CommandableInterface dbg . States before and after a shutdown transition: InitializedMap::Ready and Main::Booted. Transition Duration: 0.00239009 s.
77 11-04 12:45:12.275414 309849 339485 25 RoutingMaster1_RoutingMasterCore dbg . Destructor
78 11-04 12:45:12.275403 309849 339485 25 RoutingMaster1App dbg . Booted state entry action called.
79 11-04 12:45:12.275395 309849 339485 25 MetricManager dbg . Metric Plugin file shutdown.
80 11-04 12:45:12.275049 309849 339485 25 MetricManager dbg . do_stop Complete
81 11-04 12:45:12.275047 309849 339485 25 MetricManager dbg . Joining Metric-Sending thread
82 11-04 12:45:12.275029 309849 339485 25 MetricManager dbg . Stopping Metrics
83 11-04 12:45:12.275015 309849 339485 25 MetricManager dbg . MetricManager is shutting down...
84 11-04 12:45:12.274912 309849 328169 21 RoutingMaster1_TokenReceiver dbg . Received 0 events
85 11-04 12:45:12.273074 309849 339485 25 RoutingMaster1_TokenReceiver dbg . Joining tokenThread
86 11-04 12:45:12.272923 309849 339485 25 RoutingMaster1_CommandableInterface nfo . Shutdown transition started
87 11-04 12:45:12.267839 309851 339483 41 Dispatcher1_CommandableInterface nfo . Shutdown transition complete
88 11-04 12:45:12.267781 309851 339483 41 MetricManager wrn . Attempted to send metric when MetricManager has not yet been initialized!
89 11-04 12:45:12.267780 309851 339483 41 Dispatcher1_CommandableInterface dbg . States before and after a shutdown transition: InitializedMap::Ready and Main::Booted. Transition Duration: 0.0547202 s.
90 11-04 12:45:12.267774 309851 339483 41 Dispatcher1_CommandableInterface dbg . BootedEnter called.
91 11-04 12:45:12.267699 309851 339483 41 Dispatcher1_DataReceiverCore nfo . Completed Shutdown transition
92 11-04 12:45:12.267692 309851 339483 41 Dispatcher1_DataReceiverCore dbg . shutdown: Complete

History

#1 Updated by Eric Flumerfelt about 1 month ago

In my testing, I found that these messages were being produced by BoardReaders which were using shared memory transfers. The problem is caused by the fact that SharedMemoryManager never removed it's rather verbose signal handler when it was shutting down.

I have a candidate fix on artdaq-core:bugfix/23517_SMM_UnregisterSignalHandlers. Please give it a try and see if that fixes your issue without undesired side-effects (especially check for leftover shared memory segments).

#2 Updated by Eric Flumerfelt about 1 month ago

  • Assignee set to Eric Flumerfelt
  • Status changed from New to Resolved

#3 Updated by Kurt Biery about 1 month ago

I have run artdaq-demo tests of this code on mu2edaq13 using the mediumsystem_with_routing_master sample config. The artdaq-demo software area was based on artdaq_demo/artdaq v3_06_02. To get the code on the artdaq-core:bugfix/23517_SMM_UnregisterSignalHandlers branch to work in that software area, I modified the artdaq_core product_deps file to pretend that the version number was v3_05_07 (which is what the software area wanted) instead of v3_05_08 (which was what the product_deps file on the bugfix branch has).

I've run individual tests and tests with more than one run in a single DAQ session (e.g. run_demo.sh --runs 5), and the results look good [no unexpected errors; data files look consistently good], and I didn't see any shared memory segments left over after any of the runs.

I also took a quick look at the code changes. They looked good. A side question: SharedMemoryManager uses the static boolean sighandler_init to control whether the signal handlers are installed on startup and removed on shutdown. The use of a static variable implies that more than one instance of SharedMemoryManager can existing in a given process. And, the boolean seems like a good way to ensure that only the first one installs the special signal handlers. However, I wonder if the lifetimes of the different SMM instances might not perfectly align. If they do align, then it doesn't matter which one removes the special signal handlers at process shutdown. If they don't align, though, then it seems like we would want the last one that is destructed to remove the special signal handlers, not the first one

Eric, could you provide some information on the possible lifetimes of different SMM instances?
Thanks!
Kurt

#4 Updated by Eric Flumerfelt about 1 month ago

We have Shmem_transfer SharedMemoryManagers and we have SharedMemoryEventManager SharedMemoryManagers.

BR: Can only have Shmem_transfer SMM instances. All are constructed/destructed at approximately the same time, at begin and end run.
EB/DL/DI: Can have receive Shmem_transfer SMM and SMEM SMM. Receiver transfer plugins are not destructed until the shutdown transition
EB/DL/DI art: Can have send Shmem_transfer SMM and SMEM SMM. art processes are terminated at run end and restarted clean at run start.

The one that worries me is the EventBuilder...currently, it's not a problem because DataReceiverManager has the same lifetime as SharedMemoryEventManager, but if in the future this is no longer true, it may break the signal handler functionality. It may be better to gate the signal handler removal on instances.size() == 0 (as the removal code occurs after the instance is removed from that list).

#5 Updated by Kurt Biery about 1 month ago

Switching to the use of the number of instances to determine when to remove (and maybe add) the special signal handlers sounds good to me.
For reference, I added a temporary TLOG to SMM at destruction time to print out the number of instances of SMM, and it showed an initial number of 11 (eleven) in each EventBuilder in a run of the artdaq-demo with mediumsystem_with_routing_master.

#6 Updated by Kurt Biery 27 days ago

  • Status changed from Resolved to Reviewed
  • Co-Assignees Kurt Biery added

I have confirmed that the latest code changes (that include the instances.size() == 0 condition) do not have the error messages in the TRACE log. And that the shared memory segments are cleaned up in single-run and multi-run tests with run_demo.sh.

Marking this Issue Reviewed.

#7 Updated by Eric Flumerfelt 20 days ago

  • Target version set to artdaq_core v3_05_09
  • Status changed from Reviewed to Closed


Also available in: Atom PDF