Project

General

Profile

Support #23111

Restarting a crashed EventBuilder doesn't automatically result in data flowing through that EB again

Added by Kurt Biery 3 months ago. Updated about 2 months ago.

Status:
Closed
Priority:
Normal
Category:
Needed Enhancements
Target version:
Start date:
08/13/2019
Due date:
% Done:

0%

Estimated time:
Experiment:
-
Duration:

Description

Here are the steps that I used for this:
• installed the demo on sbnd-daq33
◦ wget https://cdcvs.fnal.gov/redmine/projects/artdaq-demo/repository/revisions/develop/raw/tools/quick-mrb-start.sh
◦ chmod +x quick-mrb-start.sh
◦ ./quick-mrb-start.sh --tag=v3_06_00
• added export DAQINTERFACE_PROCESS_MANAGEMENT_METHOD="direct" to the DAQInterface/user_sourcefile_example file
• modified the mediumsystem_with_routing_master configuration to correctly handle multicasts over the private network interface on send-daq33
• switched to the develop branch in artdaq-utilities-daqinterface to pick up the changes for a list of PRODUCTS areas
• added a fourth and fifth EventBuilder to the mediumsystem_with_routing_master boot.txt file; created the FCL file for EventBuilder5 in that area
• reduced the event creation rate to 2 Hz
• reduced the number of buffers in the EventBuilders from 20 to 5
• started a half-hour run with the following command
◦ sh ./run_demo.sh --config mediumsystem_with_routing_master --bootfile `pwd`/artdaq-utilities-daqinterface/simple_test_config/mediumsystem_with_routing_master/boot.txt --comps component01 component02 component03 component04 component05 component06 component07 component08 component09 component10 --runduration 1800 --partition 0 --no_om
• killed the EventBuilder3 process with "kill" (it’s useful to note the time of day when you do this)
• confirmed that data continued to flow through the system
• restarted EB3 with a command like the following:
◦ eventbuilder -c “id: 5237 commanderPluginType: xmlrpc rank: 12 application_name: EventBuilder3 partition_number: 5” &
• xmlrpc http://localhost:5237/RPC2 daq.status
• cd run_records/<run number>
• export EB3_CFG=`cat EventBuilder3.fcl | grep -v '^\#' | sed 's/\:/\\\\:/g' | sed 's/{/\\\\{/g' | sed 's/}/\\\\}/g' | sed 's/"/\\\\"/g' | sed 's/,/\\\\,/g'`
• xmlrpc http://localhost:5237/RPC2 daq.init "$EB3_CFG"
• xmlrpc http://localhost:5237/RPC2 daq.start <run number>
• note that EB3 never gets any events after it is restarted (e.g. “tshowt | grep Releasing | grep EventBuilder3 | more”)

•    modify DAQInterface/settings to set “transfer_plugin_to_use: TCPSocket”
• re-run the test listed above (starting with the run_demo command)
• note that events now arrive at EB3 after it is restarted, although, this condition may only be temporary…

Related issues

Related to artdaq - Bug #23050: The number of EB buffers seems to affect whether a demo system continues to take data after an EB crashNew08/02/2019

Related to artdaq - Support #21621: Notes from testing the killing of an EB process on a teststandAssigned01/11/2019

Related to artdaq - Feature #23422: Automatically test various TransferPlugin failure modesReviewed10/14/2019

History

#1 Updated by Kurt Biery 3 months ago

  • Related to Bug #23050: The number of EB buffers seems to affect whether a demo system continues to take data after an EB crash added

#2 Updated by Kurt Biery 3 months ago

  • Related to Support #21621: Notes from testing the killing of an EB process on a teststand added

#3 Updated by Eric Flumerfelt 3 months ago

While investigating this issue, I have made the following branches:

  • artdaq-core:bugfix/23111_SharedMemoryManager_ReconnectionImprovements
    • In SharedMemoryFragmentManager::WriteFragment, check if shared memory is connected, and attempt to reconnect before writing
    • Added explicit timeout parameter and boolean status to SharedMemoryManager::Attach. Returning false means that shared memory is not connected
    • Reset manager_id_ to -1 in Detach
    • Added Reattach test case to SharedMemoryFragmentManager_t, which tests both sender and receiver disconnection/reconnection
  • artdaq:bugfix/23111_ShmemTransfer_isRunningFix
    • Added call to SharedMemoryManager::IsEndOfData to Shmem_transfer::isRunning.
    • Use isRunning to determine shared memory validity when sending Fragment NOTE: The code currently drops data automatically if not connected. We may want to pass this status back up to DataSenderManager and let it make its own determination (i.e. based on config)

#4 Updated by Eric Flumerfelt 3 months ago

I have started to implement a test application which will eventually be able to detect issues like this earlier in our workflow on artdaq:feature/23111_BrokenTransferTest.

#5 Updated by Gennadiy Lukhanin 2 months ago

  • Status changed from New to Feedback

I was not able to reproduce this issue on mu2edaq11. The restarted EB3 process reconnected to the shared memory and the tshow command reported that new events were released to art. I repeated this test 4 times. The behavior remained the same after I switched to the bug fix branches, rebased them to the corresponding develop branches, rebuild the code and re-ran tests. Code changes appear to be addressing the issue as described above. However, repeating the kill-restart sequence several times (say 2 or more) causes the majority of events to be built by EB3.

14:43:44mu2etrg@mu2edaq11:~/issue23111
$ tshow | grep Releasing | grep EventBuilder
   19 1568144627995591  56480  56683  40  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 975 in buffer 19 to art, event_size=4584, buffer_size=16777216
   97 1568144627495314  56480  56686   3  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 974 in buffer 18 to art, event_size=4584, buffer_size=16777216
  283 1568144626994137  56480  56687  25  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 973 in buffer 17 to art, event_size=4584, buffer_size=16777216
  360 1568144626494598  56480  56685  27  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 972 in buffer 16 to art, event_size=4584, buffer_size=16777216
  548 1568144625993748  56480  56688   2  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 971 in buffer 15 to art, event_size=4584, buffer_size=16777216
  626 1568144625494719  56480  56690  36  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 970 in buffer 14 to art, event_size=4584, buffer_size=16777216
  812 1568144624993043  56480  56691  48  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 969 in buffer 13 to art, event_size=4584, buffer_size=16777216
  891 1568144624493081  56480  56688  25  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 968 in buffer 12 to art, event_size=4584, buffer_size=16777216
 1080 1568144623992302  56480  56687   2  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 967 in buffer 11 to art, event_size=4584, buffer_size=16777216
 1158 1568144623492598  56480  56691  48  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 966 in buffer 10 to art, event_size=4584, buffer_size=16777216
 1345 1568144622991292  56480  56691  48  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 965 in buffer 9 to art, event_size=4584, buffer_size=16777216
 1424 1568144622490993  56480  56688  25  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 964 in buffer 8 to art, event_size=4584, buffer_size=16777216
 1615 1568144621991348  56480  56684  18  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 963 in buffer 7 to art, event_size=4584, buffer_size=16777216
 1692 1568144621491253  56480  56686  26  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 962 in buffer 6 to art, event_size=4584, buffer_size=16777216
 1883 1568144620991271  56480  56693   7  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 961 in buffer 5 to art, event_size=4584, buffer_size=16777216
 1960 1568144620491319  56480  56693  26  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 960 in buffer 4 to art, event_size=4584, buffer_size=16777216
 2150 1568144619990382  56480  56687  13  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 959 in buffer 3 to art, event_size=4584, buffer_size=16777216
 2229 1568144619489415  56480  56691  48  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 958 in buffer 2 to art, event_size=4584, buffer_size=16777216
 2417 1568144618989737  56480  56686  44  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 957 in buffer 1 to art, event_size=4584, buffer_size=16777216
 2493 1568144618489855  56480  56688  18  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 956 in buffer 0 to art, event_size=4584, buffer_size=16777216
 2683 1568144617988387  56480  56687   1  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 955 in buffer 19 to art, event_size=4584, buffer_size=16777216
 2762 1568144617487753  56480  56684  20  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 954 in buffer 18 to art, event_size=4584, buffer_size=16777216
 2951 1568144616989116  56480  56691  38  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 953 in buffer 17 to art, event_size=4584, buffer_size=16777216
 3030 1568144616487371  56480  56686   3  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 952 in buffer 16 to art, event_size=4584, buffer_size=16777216
 3220 1568144615987813  56480  56684  48  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 951 in buffer 15 to art, event_size=4584, buffer_size=16777216
 3298 1568144615487163  56480  56688  20  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 950 in buffer 14 to art, event_size=4584, buffer_size=16777216
 3488 1568144614987305  56480  56691  53  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 949 in buffer 13 to art, event_size=4584, buffer_size=16777216
 3567 1568144614485988  56480  56688  54  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 948 in buffer 12 to art, event_size=4584, buffer_size=16777216
 3757 1568144613985956  56480  56686   0  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 947 in buffer 11 to art, event_size=4584, buffer_size=16777216
 3836 1568144613485424  56480  56683  13  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 946 in buffer 10 to art, event_size=4584, buffer_size=16777216
 4026 1568144612985276  56480  56688  54  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 945 in buffer 9 to art, event_size=4584, buffer_size=16777216
 4103 1568144612485385  56480  56691  20  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 944 in buffer 8 to art, event_size=4584, buffer_size=16777216
 4294 1568144611985359  56480  56688  54  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 943 in buffer 7 to art, event_size=4584, buffer_size=16777216
 4372 1568144611483958  56480  56683  16  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 942 in buffer 6 to art, event_size=4584, buffer_size=16777216
 4562 1568144610985377  56480  56685  23  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 941 in buffer 5 to art, event_size=4584, buffer_size=16777216
 4641 1568144610482954  56480  56683  12  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 940 in buffer 4 to art, event_size=4584, buffer_size=16777216
 4832 1568144609983161  56480  56688  54  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 939 in buffer 3 to art, event_size=4584, buffer_size=16777216
 4911 1568144609482850  56480  56693  20  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 938 in buffer 2 to art, event_size=4584, buffer_size=16777216
 5099 1568144608983845  56480  56687  50  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 937 in buffer 1 to art, event_size=4584, buffer_size=16777216
 5177 1568144608482802  56480  56685  22  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 936 in buffer 0 to art, event_size=4584, buffer_size=16777216
 5367 1568144607982826  39470  30684   2  EventBuilder5_SharedMemoryEventManager dbg . Releasing event 935 in buffer 2 to art, event_size=4584, buffer_size=16777216
 5445 1568144607481731  39467  30716  24  EventBuilder2_SharedMemoryEventManager dbg . Releasing event 934 in buffer 3 to art, event_size=4584, buffer_size=16777216
 5634 1568144606981747  39466  30675  13  EventBuilder1_SharedMemoryEventManager dbg . Releasing event 933 in buffer 2 to art, event_size=4584, buffer_size=16777216
 5714 1568144606480471  39470  30670  13  EventBuilder5_SharedMemoryEventManager dbg . Releasing event 932 in buffer 1 to art, event_size=4584, buffer_size=16777216
 5905 1568144605981499  39469  30705  14  EventBuilder4_SharedMemoryEventManager dbg . Releasing event 931 in buffer 1 to art, event_size=4584, buffer_size=16777216
 5987 1568144605482537  39466  30683   9  EventBuilder1_SharedMemoryEventManager dbg . Releasing event 930 in buffer 1 to art, event_size=4584, buffer_size=16777216
 6178 1568144604980224  39469  30705  22  EventBuilder4_SharedMemoryEventManager dbg . Releasing event 929 in buffer 0 to art, event_size=4584, buffer_size=16777216
 6275 1568144604479100  39467  30719  20  EventBuilder2_SharedMemoryEventManager dbg . Releasing event 928 in buffer 2 to art, event_size=4584, buffer_size=16777216
 6466 1568144603979769  39467  30724   3  EventBuilder2_SharedMemoryEventManager dbg . Releasing event 927 in buffer 1 to art, event_size=4584, buffer_size=16777216
 6543 1568144603480812  39470  30695  19  EventBuilder5_SharedMemoryEventManager dbg . Releasing event 926 in buffer 0 to art, event_size=4584, buffer_size=16777216
 6734 1568144602979761  39466  30683   7  EventBuilder1_SharedMemoryEventManager dbg . Releasing event 925 in buffer 0 to art, event_size=4584, buffer_size=16777216
 6812 1568144602477953  39467  30721  16  EventBuilder2_SharedMemoryEventManager dbg . Releasing event 924 in buffer 0 to art, event_size=4584, buffer_size=16777216
 7000 1568144601977970  39469  30702   7  EventBuilder4_SharedMemoryEventManager dbg . Releasing event 923 in buffer 4 to art, event_size=4584, buffer_size=16777216
 7079 1568144601477829  39470  30684  26  EventBuilder5_SharedMemoryEventManager dbg . Releasing event 922 in buffer 4 to art, event_size=4584, buffer_size=16777216
 7270 1568144600977309  39470  30670  28  EventBuilder5_SharedMemoryEventManager dbg . Releasing event 921 in buffer 3 to art, event_size=4584, buffer_size=16777216
 7348 1568144600476269  39466  30663   1  EventBuilder1_SharedMemoryEventManager dbg . Releasing event 920 in buffer 4 to art, event_size=4584, buffer_size=16777216
 7538 1568144599977630  39466  30680   3  EventBuilder1_SharedMemoryEventManager dbg . Releasing event 919 in buffer 3 to art, event_size=4584, buffer_size=16777216
 7617 1568144599476337  39469  30698  48  EventBuilder4_SharedMemoryEventManager dbg . Releasing event 918 in buffer 3 to art, event_size=4584, buffer_size=16777216
 7808 1568144598975614  56480  56688  51  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 917 in buffer 19 to art, event_size=4584, buffer_size=16777216
 7884 1568144598475940  56480  56688  51  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 916 in buffer 18 to art, event_size=4584, buffer_size=16777216
 8075 1568144597975049  56480  56683  49  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 915 in buffer 17 to art, event_size=4584, buffer_size=16777216
 8154 1568144597474709  56480  56686  34  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 914 in buffer 16 to art, event_size=4584, buffer_size=16777216
 8343 1568144596974178  56480  56691  24  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 913 in buffer 15 to art, event_size=4584, buffer_size=16777216
 8422 1568144596473887  56480  56685   2  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 912 in buffer 14 to art, event_size=4584, buffer_size=16777216
 8613 1568144595973771  56480  56688  21  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 911 in buffer 13 to art, event_size=4584, buffer_size=16777216
 8690 1568144595475078  39467  30717  28  EventBuilder2_SharedMemoryEventManager dbg . Releasing event 910 in buffer 4 to art, event_size=4584, buffer_size=16777216
 8881 1568144594973573  56480  56693   5  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 909 in buffer 12 to art, event_size=4584, buffer_size=16777216
 8960 1568144594473701  39469  30699  20  EventBuilder4_SharedMemoryEventManager dbg . Releasing event 908 in buffer 2 to art, event_size=4584, buffer_size=16777216
 9147 1568144593972447  56480  56688  27  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 907 in buffer 11 to art, event_size=4584, buffer_size=16777216
 9225 1568144593472511  56480  56693   4  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 906 in buffer 10 to art, event_size=4584, buffer_size=16777216
 9415 1568144592972459  56480  56691  27  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 905 in buffer 9 to art, event_size=4584, buffer_size=16777216
 9495 1568144592472269  56480  56685  17  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 904 in buffer 8 to art, event_size=4584, buffer_size=16777216
 9686 1568144591972293  56480  56688  26  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 903 in buffer 7 to art, event_size=4584, buffer_size=16777216
 9765 1568144591471498  56480  56688  22  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 902 in buffer 6 to art, event_size=4584, buffer_size=16777216
 9956 1568144590970693  56480  56688  19  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 901 in buffer 5 to art, event_size=4584, buffer_size=16777216
10033 1568144590471017  56480  56691  21  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 900 in buffer 4 to art, event_size=4584, buffer_size=16777216
10222 1568144589969799  56480  56693  12  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 899 in buffer 3 to art, event_size=4584, buffer_size=16777216
10299 1568144589470804  56480  56683   7  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 898 in buffer 2 to art, event_size=4584, buffer_size=16777216
10490 1568144588970895  56480  56687  24  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 897 in buffer 1 to art, event_size=4584, buffer_size=16777216
10568 1568144588469801  56480  56685  13  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 896 in buffer 0 to art, event_size=4584, buffer_size=16777216
10757 1568144587970218  56480  56688   0  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 895 in buffer 19 to art, event_size=4584, buffer_size=16777216
10836 1568144587469070  56480  56693  25  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 894 in buffer 18 to art, event_size=4584, buffer_size=16777216
11027 1568144586971894  56480  56691  14  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 893 in buffer 17 to art, event_size=4584, buffer_size=16777216
11106 1568144586468250  56480  56685  22  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 892 in buffer 16 to art, event_size=4584, buffer_size=16777216
11294 1568144585969166  56480  56691  18  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 891 in buffer 15 to art, event_size=4584, buffer_size=16777216
11372 1568144585468254  56480  56687  30  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 890 in buffer 14 to art, event_size=4584, buffer_size=16777216
11561 1568144584968096  56480  56691  24  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 889 in buffer 13 to art, event_size=4584, buffer_size=16777216
11640 1568144584467708  56480  56691  42  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 888 in buffer 12 to art, event_size=4584, buffer_size=16777216
11831 1568144583969050  56480  56687  39  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 887 in buffer 11 to art, event_size=4584, buffer_size=16777216
11909 1568144583466535  56480  56693  24  EventBuilder3_SharedMemoryEventManager dbg . Releasing event 886 in buffer 10 to art, event_size=4584, buffer_size=16777216
14:43:48mu2etrg@mu2edaq11:~/issue23111
$ %MSG-i EventBuilder3_SharedMemoryEventManager:  InRunMap::Running  10-Sep-2019 14:44:12 CDT Sequence ID 1024 SharedMemoryEventManager.cc:1162
EventBuilder3 statistics:
  Event statistics: 80 events released at 1.33323 events/sec, effective data rate = 0.0466274 MB/sec, monitor window = 60.0045 sec, min::max event size = 0.0349731::0.0349731 MB
  Average time per event:  elapsed time = 0.750056 sec
  Fragment statistics: 800 fragments received at 13.3323 fragments/sec, effective data rate = 0.0463019 MB/sec, monitor window = 60.0045 sec, min::max fragment size = 0.000427246::0.0308838 MB
  Event counts: Run -- 349 Total, 0 Incomplete.  Subrun -- 0 Total, 0 Incomplete. 
%MSG
%MSG-i EventBuilder3_SharedMemoryEventManager:  InRunMap::Running  10-Sep-2019 14:44:42 CDT Sequence ID 1084 SharedMemoryEventManager.cc:1162
EventBuilder3 statistics:
  Event statistics: 80 events released at 1.33324 events/sec, effective data rate = 0.0466275 MB/sec, monitor window = 60.0043 sec, min::max event size = 0.0349731::0.0349731 MB
  Average time per event:  elapsed time = 0.750054 sec
  Fragment statistics: 800 fragments received at 13.3324 fragments/sec, effective data rate = 0.046302 MB/sec, monitor window = 60.0043 sec, min::max fragment size = 0.000427246::0

#6 Updated by Kurt Biery 2 months ago

I re-ran the tests on sbnd-daq33 and mu2edaq13, and I observed different behavior. (So, I agree that there seems to be something on the mu2edaq cluster that prevents us from reproducing the problem.)

On sbnd-daq33, I reproduced the behavior that I described in the original description: events did not flow through EB3 after it was restarted.
On mu2edaq13, events did seem to flow through EB3 after it was restarted, but those events all seemed to be incomplete.

Just to be clear, this was all done with the v3_06_00 code. I haven't tried any tests with the new code.

I haven't yet figured out why the two systems would behave differently.

For reference, I used artdaq-utilities-daqinterface branch "feature/Issue21769_SBN_Multicast_Tests" in today's tests.

#7 Updated by Kurt Biery 2 months ago

In some initial tests with the new code, I see improved behavior on sbnd-daq33 (I see events flow through EB3 after it is restarted) and improved behavior on mu2edaq13 (I don't see the large numbers of complaints about partial events after the EB3 restart). These tests are with the Autodetect Transfer plugin.

With the TCPSocket transfer plugin on sbnd-daq33, I see a few events get sent to EB3 after it is restarted, but those stop fairly quickly because the link between EB3 and the downstream DataLogger doesn't seem to get restored after EB3 is restarted. (That is a different issue; I merely note it here for completeness.)

I'm tempted to try the new code at protoDUNE. As I may have mentioned, the symptom of the problem that I saw there was that data fragments from the DFO to an EB would not be resumed after an EB was restarted (when using the Autodetect/Shmem transfer plugin).

#8 Updated by Kurt Biery 2 months ago

The test at protoDUNE was successful. That is, when I used the existing artdaq and artdaq_core software, and I killed an EB running on the same host as the DFO, dataflow stopped. When I updated to the software on the bugfix/23111_* branches and repeated the same test, dataflow continued past the killing and restarting of an EB.

A couple of side notes:
  • after an EB is killed and restarted, the time-series plot of the event rate into the EventBuilders that is shown in the protoDUNE Run Control is less smooth. I believe that this is because I've got RoutingDestinationHelper greedily pulling destinations out of the RoundRobin policy (and the tokens are bunched up when the dead EB is restarted). Something to think about is how to return more quickly to the desired routing behavior.
  • I didn't see any problems in the protoDUNE tests when the EB was first killed because the number of buffers in the EBs is configured to be 5. As we noted in Issue #23050, if the number of buffers in the EBs had been greater than 10, then I might have seen problems when the 10 shared memory buffers in the Shmem_transfer between the DFO and the dead EB became full.

#9 Updated by Eric Flumerfelt about 2 months ago

  • Target version set to artdaq v3_06_01
  • Assignee set to Eric Flumerfelt
  • Status changed from Feedback to Closed
  • Category set to Needed Enhancements
  • Co-Assignees Gennadiy Lukhanin, Kurt Biery added

#10 Updated by Eric Flumerfelt about 1 month ago

  • Related to Feature #23422: Automatically test various TransferPlugin failure modes added


Also available in: Atom PDF