Project

General

Profile

Feature #23640

Monitoring of online monitors

Added by Gennadiy Lukhanin 2 months ago. Updated 12 days ago.

Status:
Reviewed
Priority:
Normal
Category:
-
Target version:
-
Start date:
11/20/2019
Due date:
% Done:

0%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

In cases when online monitoring software does not properly disconnect from the dispatcher process the resources associated with that online monitor inside the Dispatcher process are not being cleaned up, and hence prevent subsequent re-connections of online monitors with the same unique_label. The cleanup of stale online monitoring connections can be performed by the dispatcher process, which could monitor the rate of events read by online monitors and proactively disconnect them upon timeouts, and thus make subsequent re-connections with the same unique_label possible.

An example of the error message produced by the dispatcher process is given below.

%MSG-i TransferWrapper:  DAQ 19-Nov-2019 18:38:29 CST Booted TransferWrapper.cc:90
Response from dispatcher is "Unable to create a Transfer plugin with the FHiCL code "outputs:{out:{module_type:"TransferOutput" transfer_plugin:{destination_rank:6 first_event_builder_rank:0 max_fragment_size_words:3.3554432e7 shm_key:1.078400083e9 transferPluginType:"Shmem" unique_label:"ExampleOnlineMonitor"}}} path:["out"] physics:{} unique_label:"ExampleOnlineMonitor"", a new monitor has not been registered
Exception: ---- DispatcherCore BEGIN
  Unique label already exists!
---- DispatcherCore END
" 
%MSG
%MSG-w TransferWrapper:  DAQ 19-Nov-2019 18:38:29 CST Booted TransferWrapper.cc:99
Error in TransferWrapper: attempt to register with dispatcher did not result in the "Success" response
%MSG
%MSG-i TransferWrapper:  DAQ 19-Nov-2019 18:38:29 CST Booted TransferWrapper.cc:85
Attempting to register this monitor ("ExampleOnlineMonitor_2crates") with the dispatcher aggregator
%MSG
%MSG-i TransferWrapper:  DAQ 19-Nov-2019 18:38:29 CST Booted TransferWrapper.cc:90
Response from dispatcher is "Unable to create a Transfer plugin with the FHiCL code "outputs:{out:{module_type:"TransferOutput" transfer_plugin:{destination_rank:6 first_event_builder_rank:0 max_fragment_size_words:3.3554432e7 shm_key:1.078400083e9 transferPluginType:"Shmem" unique_label:"ExampleOnlineMonitor"}}} path:["out"] physics:{} unique_label:"ExampleOnlineMonitor"", a new monitor has not been registered
Exception: ---- DispatcherCore BEGIN
  Unique label already exists!
---- DispatcherCore END
" 
%MSG
%MSG-w TransferWrapper:  DAQ 19-Nov-2019 18:38:29 CST Booted TransferWrapper.cc:99
Error in TransferWrapper: attempt to register with dispatcher did not result in the "Success" response

Associated revisions

Revision 4e3cce15 (diff)
Added by Eric Flumerfelt 26 days ago

This commit satisfies Issue #23640 by making three changes:
1. DispatcherCore always sets "restart_crashed_art_processes" to false.
If any Dispatcher art process fails, it will have to be restarted via
another register_monitor command.
2. TransferOutput will self-destruct after reaching its send retry
count. This allows the Dispatcher art process to quickly exit when the
online monitor art process disappears
3. Shmem_transfer returns kErrorNotRequiringException when it cannot
connect to the shared memory segment (instead of failing silently with
kSuccess). This change does not break the broken_transfer_driver tests,
but more testing should probably be done to ensure there are no other
side-effects.

History

#1 Updated by Eric Flumerfelt 26 days ago

  • Status changed from New to Resolved

On artdaq:feature/23640_TransferOutput_DetectErrorsAndClose, I have made the changes needed for the Dispatcher to detect that the online monitor has gone away and clear its unique_label list appropriately.

#2 Updated by John Freeman 12 days ago

  • Status changed from Resolved to Reviewed

I modified the ToyDump module so that after 100 events it would segfault via "raise(SIGSEGV);". Then, in run 3395 on mu2edaq13 (/home/jcfree/run_records/3395), during the running the state I launched

art -c ${MRB_TOP}/srcs/artdaq_demo/tools/fcl/TransferInputShmem.fcl

a few times (with the port, of course, correctly set) ending it each time either by hitting Ctrl-c before the 100 event limit was reached or letting it run until the segfault. Every time, I launched it, it registered correctly, and I saw sensible data from the ToyDump module.



Also available in: Atom PDF