Project

General

Profile

Bug #20548

EB terminate crash

Added by Eric Flumerfelt about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
Start date:
08/07/2018
Due date:
% Done:

0%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

Ron:

I'm getting one of these practically everytime the run terminates/shuts-down.

If the BR terminates and closes a connection to the EB, will it cause something like this?

Core was generated by `EventBuilderMain -c id: 5235 commanderPluginType: xmlrpc application_name: Even'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007fc016631b20 in cling::Interpreter::runAndRemoveStaticDestructors() () from /mu2e/ups/root/v6_12_04e/Linux64bit+3.10-2.17-e15-prof/lib/libCling.so
[Current thread is 1 (Thread 0x7fc5b5f71f00 (LWP 21562))]
(gdb) bt
#0 0x00007fc016631b20 in cling::Interpreter::runAndRemoveStaticDestructors() () from /mu2e/ups/root/v6_12_04e/Linux64bit+3.10-2.17-e15-prof/lib/libCling.so
#1 0x00007fc0165d9ed6 in TCling::ResetGlobals() () from /mu2e/ups/root/v6_12_04e/Linux64bit+3.10-2.17-e15-prof/lib/libCling.so
#2 0x00007fc5b4a7a953 in TROOT::EndOfProcessCleanups() () from /mu2e/ups/root/v6_12_04e/Linux64bit+3.10-2.17-e15-prof/lib/libCore.so
#3 0x00007fc5b4b802fd in TUnixSystem::Exit(int, bool) () from /mu2e/ups/root/v6_12_04e/Linux64bit+3.10-2.17-e15-prof/lib/libCore.so
#4 0x00007fc5b4b840b2 in TUnixSystem::DispatchSignals(ESignals) () from /mu2e/ups/root/v6_12_04e/Linux64bit+3.10-2.17-e15-prof/lib/libCore.so
#5 <signal handler called>
#6 0x00007fc5b0644a3d in poll () from /lib64/libc.so.6
#7 0x00007fc321acac3a in waitForConnection (listenSocketP=0x13ee4f0, listenSocketP=0x13ee4f0, errorP=0x7ffccd6b1e60, interruptedP=<synthetic pointer>)
at socket_unix.c:694
#8 chanSwitchAccept (chanSwitchP=<optimized out>, channelPP=0x7ffccd6b1e68, channelInfoPP=0x7ffccd6b1e70, errorP=0x7ffccd6b1e60) at socket_unix.c:804
#9 0x00007fc321ac1a7f in ChanSwitchAccept (chanSwitchP=0x13ee510, channelPP=0x7ffccd6b1e68, channelInfoPP=0x7ffccd6b1e70, errorP=<optimized out>) at chanswitch.c:159
#10 0x00007fc321ac94ef in acceptAndProcessNextConnection (errorP=0x7ffccd6b1e58, outstandingConnListP=0x13ee3e0, serverP=0x13ee4c0) at server.c:1191
#11 serverRun2 (errorP=0x7ffccd6b1e58, serverP=0x13ee4c0) at server.c:1242
#12 ServerRun (serverP=serverP@entry=0x13ee4c0) at server.c:1280
#13 0x00007fc32250a3f8 in xmlrpc_c::setupSignalsAndRunAbyss (abyssServerP=0x13ee4c0) at server_abyss.cpp:760
#14 0x00007fc32250b219 in xmlrpc_c::serverAbyss_impl::run (this=<optimized out>) at server_abyss.cpp:771
#15 0x00007fc32250b6bd in xmlrpc_c::serverAbyss::run (this=<optimized out>) at server_abyss.cpp:873
#16 0x00007fc3227361eb in artdaq::xmlrpc_commander::run_server (this=0x13ea160)
at /home/ron/work/artdaqPrj/demo25-kurt2/srcs/artdaq/artdaq/ExternalComms/xmlrpc_commander.cc:1133
#17 0x0000000000414d11 in main (argc=<optimized out>, argv=<optimized out>)
at /home/ron/work/artdaqPrj/demo25-kurt2/srcs/artdaq_mpich_plugin/artdaq-mpich-plugin/Application/EventBuilderMain.cc:67
(gdb)

History

#1 Updated by Eric Flumerfelt about 1 year ago

More info:

Fix in artdaq 93a83a8

Eric L. Flumerfelt
Computational Physics Developer

Fermi National Accelerator Laboratory
www.fnal.gov

-----Original Message-----
From: Ronald D Rechenmacher
Sent: Monday, August 6, 2018 10:48 PM
To: Eric Flumerfelt <>
Subject: Re: EB terminate crash

I'm thinking somewhere between the TLOGs:
TLOG << "endOfData: Flushing " << initialStoreSize
<< " stale events from the SharedMemoryEventManager.";
int counter = initialStoreSize;
while (active_buffers_.size() > 0 && counter > 0) {
complete_buffer_(*active_buffers_.begin());
counter--;
}
TLOG << "endOfData: Done flushing, there are now " << GetIncompleteEventCount()
<< " stale events in the SharedMemoryEventManager.";

The real thread bt might ??? be: (except I should see an additionaly TLOG, so I don't understand)
#25 <signal handler called>
#26 std::__atomic_base<unsigned long>::load (_m=std::memory_order_seq_cst, this=<optimized out>)
at /cvmfs/fermilab.opensciencegrid.org/products/artdaq/gcc/v6_4_0/Linux64bit+3.10-2.17/include/c++/6.4.0/bits/atomic_base.h:396
#27 artdaq::RequestSender::GetSentTokenCount (this=<optimized out>) at /home/ron/work/artdaqPrj/demo25-kurt2/srcs/artdaq/artdaq/DAQrate/RequestSender.hh:134
#28 artdaq::SharedMemoryEventManager::check_pending_buffers
(this=this@entry=0x7fc298003680, lock=...)
at /home/ron/work/artdaqPrj/demo25-kurt2/srcs/artdaq/artdaq/DAQrate/SharedMemoryEventManager.cc:1184
#29 0x00007fc5b3c67161 in artdaq::SharedMemoryEventManager::complete_buffer_ (this=this@entry=0x7fc298003680, buffer=<optimized out>) ---Type <return> to continue, or q <return> to quit---
at /home/ron/work/artdaqPrj/demo25-kurt2/srcs/artdaq/artdaq/DAQrate/SharedMemoryEventManager.cc:1035
#30 0x00007fc5b3c682b1 in artdaq::SharedMemoryEventManager::endOfData (this=0x7fc298003680)
at /home/ron/work/artdaqPrj/demo25-kurt2/srcs/artdaq/artdaq/DAQrate/SharedMemoryEventManager.cc:600
#31 0x00007fc5b405cd1c in artdaq::DataReceiverCore::shutdown (this=0x7fc29803a8f0)
at /home/ron/work/artdaqPrj/demo25-kurt2/srcs/artdaq/artdaq/Application/DataReceiverCore.cc:242
#32 0x00007fc5b4048fd8 in artdaq::EventBuilderApp::do_shutdown (this=0x7ffccd6b8b10)
at /home/ron/work/artdaqPrj/demo25-kurt2/srcs/artdaq/artdaq/Application/EventBuilderApp.cc:96
#33 0x00007fc5b3fdb0a7 in artdaq::Main_Initialized::shutdown (this=0x7fc5b42c84b0 <artdaq::Main::Initialized>, context=..., timeout=0x2d)
at /home/ron/work/artdaqPrj/demo25-kurt2/build_slf7.x86_64/artdaq/artdaq/Application/Commandable_sm.cpp:306
#34 0x00007fc5b3fd615b in artdaq::CommandableContext::shutdown (timeout=0x2d, this=0x7ffccd6b8b18)
at /home/ron/work/artdaqPrj/demo25-kurt2/build_slf7.x86_64/artdaq/artdaq/Application/Commandable_sm.h:296
#35 artdaq::InitializedMap_Ready::shutdown (this=<optimized out>, context=..., timeout=0x2d)
at /home/ron/work/artdaqPrj/demo25-kurt2/build_slf7.x86_64/artdaq/artdaq/Application/Commandable_sm.cpp:494
#36 0x00007fc5b3fe747c in artdaq::CommandableContext::shutdown (timeout=0x2d, this=0x7ffccd6b8b18)
at /home/ron/work/artdaqPrj/demo25-kurt2/build_slf7.x86_64/artdaq/artdaq/Application/Commandable_sm.h:296
#37 artdaq::Commandable::shutdown (this=0x7ffccd6b8b10, timeout=0x2d) at /home/ron/work/artdaqPrj/demo25-kurt2/srcs/artdaq/artdaq/Application/Commandable.cc:176
#38 0x00007fc322743016 in artdaq::shutdown_::execute_ (this=0x13edf30, paramList=...)
at /home/ron/work/artdaqPrj/demo25-kurt2/srcs/artdaq/artdaq/ExternalComms/xmlrpc_commander.cc:614

Ron Rechenmacher wrote on 08/06/2018 09:55 PM:

The corresponding trace entries are:
342 08-06 21:45:27.652980 0 21562 30677 43
SharedMemoryManager err . Calling default signal handler
344 08-06 21:45:27.567103 85877 21562 30677 43
SharedMemoryManager err . A signal of type 11 was caught by SharedMemoryManager. Detaching all Shared Memory segments, then proceeding with default handlers!
345 08-06 21:45:27.567028 75 21562 30677 43 EventBuilder1_SharedMemoryEventManager dbg . endOfData: Flushing 1 stale events from the SharedMemoryEventManager.
346 08-06 21:45:27.567022 6 21562 30677 43
EventBuilder1_SharedMemoryEventManager dbg .
SharedMemoryEventManager::endOfData
347 08-06 21:45:27.567002 20 21562 30677 43
EventBuilder1_DataReceiverCore dbg . shutdown: Calling
EventStore::endOfData
348 08-06 21:45:27.566973 29 21562 30677 43
EventBuilder1_TCPSocketTransfer dbg . transfer_between_0_and_20_RECV:
End of Destructor
349 08-06 21:45:27.566764 209 21562 24945 27
EventBuilder1_TCPSocketTransfer nfo . listen_: Shutting down
connection listener
350 08-06 21:45:27.243201 323563 21562 30677 43
EventBuilder1_TCPSocketTransfer dbg . transfer_between_0_and_20_RECV:
Shutting down TCPSocketTransfer
351 08-06 21:45:27.243200 1 21562 30677 43
EventBuilder1_TCPSocketTransfer dbg . transfer_between_1_and_20_RECV:
End of Destructor
352 08-06 21:45:27.243198 2 21562 30677 43
EventBuilder1_TCPSocketTransfer dbg . transfer_between_1_and_20_RECV:
Shutting down TCPSocketTransfer
353 08-06 21:45:27.243197 1 21562 30677 43
EventBuilder1_TCPSocketTransfer dbg . transfer_between_2_and_20_RECV:
End of Destructor
354 08-06 21:45:27.243196 1 21562 30677 43
EventBuilder1_TCPSocketTransfer dbg . transfer_between_2_and_20_RECV:
Shutting down TCPSocketTransfer
355 08-06 21:45:27.243196 0 21562 30677 43
EventBuilder1_TCPSocketTransfer dbg . transfer_between_3_and_20_RECV:
End of Destructor
356 08-06 21:45:27.243196 0 21562 30677 43
EventBuilder1_TCPSocketTransfer dbg . transfer_between_3_and_20_RECV:
Shutting down TCPSocketTransfer
357 08-06 21:45:27.243195 1 21562 30677 43
EventBuilder1_TCPSocketTransfer dbg . transfer_between_4_and_20_RECV:
End of Destructor
358 08-06 21:45:27.243194 1 21562 30677 43
EventBuilder1_TCPSocketTransfer dbg . transfer_between_4_and_20_RECV:
Shutting down TCPSocketTransfer
359 08-06 21:45:27.243194 0 21562 30677 43
EventBuilder1_TCPSocketTransfer dbg . transfer_between_5_and_20_RECV:
End of Destructor
360 08-06 21:45:27.243193 1 21562 30677 43
EventBuilder1_TCPSocketTransfer dbg . transfer_between_5_and_20_RECV:
Shutting down TCPSocketTransfer

Ron Rechenmacher wrote on 08/06/2018 09:52 PM:

I'm getting one of these practically everytime the run terminates/shuts-down.

If the BR terminates and closes a connection to the EB, will it cause something like this?

Core was generated by `EventBuilderMain -c id: 5235 commanderPluginType: xmlrpc application_name: Even'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007fc016631b20 in
cling::Interpreter::runAndRemoveStaticDestructors() () from
/mu2e/ups/root/v6_12_04e/Linux64bit+3.10-2.17-e15-prof/lib/libCling.s
o [Current thread is 1 (Thread 0x7fc5b5f71f00 (LWP 21562))]
(gdb) bt
#0 0x00007fc016631b20 in
cling::Interpreter::runAndRemoveStaticDestructors() () from
/mu2e/ups/root/v6_12_04e/Linux64bit+3.10-2.17-e15-prof/lib/libCling.s
o
#1 0x00007fc0165d9ed6 in TCling::ResetGlobals() () from
/mu2e/ups/root/v6_12_04e/Linux64bit+3.10-2.17-e15-prof/lib/libCling.s
o
#2 0x00007fc5b4a7a953 in TROOT::EndOfProcessCleanups() () from
/mu2e/ups/root/v6_12_04e/Linux64bit+3.10-2.17-e15-prof/lib/libCore.so
#3 0x00007fc5b4b802fd in TUnixSystem::Exit(int, bool) () from
/mu2e/ups/root/v6_12_04e/Linux64bit+3.10-2.17-e15-prof/lib/libCore.so
#4 0x00007fc5b4b840b2 in TUnixSystem::DispatchSignals(ESignals) ()
from
/mu2e/ups/root/v6_12_04e/Linux64bit+3.10-2.17-e15-prof/lib/libCore.so
#5 <signal handler called>
#6 0x00007fc5b0644a3d in poll () from /lib64/libc.so.6
#7 0x00007fc321acac3a in waitForConnection (listenSocketP=0x13ee4f0,
listenSocketP=0x13ee4f0, errorP=0x7ffccd6b1e60,
interruptedP=<synthetic pointer>)
at socket_unix.c:694
#8 chanSwitchAccept (chanSwitchP=<optimized out>,
channelPP=0x7ffccd6b1e68, channelInfoPP=0x7ffccd6b1e70,
errorP=0x7ffccd6b1e60) at socket_unix.c:804
#9 0x00007fc321ac1a7f in ChanSwitchAccept (chanSwitchP=0x13ee510,
channelPP=0x7ffccd6b1e68, channelInfoPP=0x7ffccd6b1e70,
errorP=<optimized out>) at chanswitch.c:159
#10 0x00007fc321ac94ef in acceptAndProcessNextConnection
(errorP=0x7ffccd6b1e58, outstandingConnListP=0x13ee3e0,
serverP=0x13ee4c0) at server.c:1191
#11 serverRun2 (errorP=0x7ffccd6b1e58, serverP=0x13ee4c0) at
server.c:1242
#12 ServerRun (serverP=serverP@entry=0x13ee4c0) at server.c:1280
#13 0x00007fc32250a3f8 in xmlrpc_c::setupSignalsAndRunAbyss
(abyssServerP=0x13ee4c0) at server_abyss.cpp:760
#14 0x00007fc32250b219 in xmlrpc_c::serverAbyss_impl::run
(this=<optimized out>) at server_abyss.cpp:771
#15 0x00007fc32250b6bd in xmlrpc_c::serverAbyss::run (this=<optimized
out>) at server_abyss.cpp:873
#16 0x00007fc3227361eb in artdaq::xmlrpc_commander::run_server
(this=0x13ea160)
at
/home/ron/work/artdaqPrj/demo25-kurt2/srcs/artdaq/artdaq/ExternalComm
s/xmlrpc_commander.cc:1133
#17 0x0000000000414d11 in main (argc=<optimized out>, argv=<optimized
out>)
at
/home/ron/work/artdaqPrj/demo25-kurt2/srcs/artdaq_mpich_plugin/artdaq
-mpich-plugin/Application/EventBuilderMain.cc:67
(gdb)

--
Ron Rechenmacher
Engineer, Group Leader - Real-Time Software Infrastructure Fermi National Accelerator Laboratory Batavia, IL 60510



Also available in: Atom PDF