Project

General

Profile

Bug #20976

multiple_art_processes_example broken

Added by Eric Flumerfelt about 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
09/28/2018
Due date:
% Done:

0%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

Using artdaq v3_03_00.
During the multiple-run test of multiple_art_processes_example, the following messages were seen:

2018-09-28 13:05:36 -0500: %MSG-i component02_CommandableInterface:  Early pre-events Commandable.cc:94
2018-09-28 13:05:36 -0500: Stop transition started
2018-09-28 13:05:36 -0500: %MSG
2018-09-28 13:05:36 -0500: %MSG-i component01_CommandableInterface:  Early pre-events Commandable.cc:94
2018-09-28 13:05:36 -0500: Stop transition started
2018-09-28 13:05:36 -0500: %MSG
2018-09-28 13:05:36 -0500: %MSG-i component02_BoardReaderCore:  Early pre-events BoardReaderCore.cc:546
2018-09-28 13:05:36 -0500: Stopping run 22 after 75 fragments.
2018-09-28 13:05:36 -0500: %MSG
2018-09-28 13:05:36 -0500: %MSG-i component02_BoardReaderCore:  Early pre-events BoardReaderCore.cc:546
2018-09-28 13:05:36 -0500: Completed the Stop transition for run 22
2018-09-28 13:05:36 -0500: %MSG
2018-09-28 13:05:36 -0500: %MSG-i component01_BoardReaderCore:  Early pre-events BoardReaderCore.cc:546
2018-09-28 13:05:36 -0500: Stopping run 22 after 75 fragments.
2018-09-28 13:05:36 -0500: %MSG
2018-09-28 13:05:36 -0500: %MSG-i component01_BoardReaderCore:  Early pre-events BoardReaderCore.cc:546
2018-09-28 13:05:36 -0500: Completed the Stop transition for run 22
2018-09-28 13:05:36 -0500: %MSG
2018-09-28 13:06:28 -0500: %MSG-w SharedMemoryManager:  Early pre-events SharedMemoryManager.cc:766
2018-09-28 13:06:28 -0500: Stale Read buffer 3 at 0x7facca1a70b8 ( 100000576 / 100000000 us ) detected! (seqid=294) Resetting... Reading-->Full
2018-09-28 13:06:28 -0500: %MSG
2018-09-28 13:06:28 -0500: %MSG-w SharedMemoryManager:  Early pre-events SharedMemoryManager.cc:766
2018-09-28 13:06:28 -0500: Stale Read buffer 3 at 0x7f52c9ee70b8 ( 100000535 / 100000000 us ) detected! (seqid=294) Resetting... Reading-->Full
2018-09-28 13:06:28 -0500: %MSG
2018-09-28 13:06:37 -0500: %MSG-i EventBuilder2_CommandableInterface:  Early pre-events Commandable.cc:94
2018-09-28 13:06:37 -0500: Stop transition started
2018-09-28 13:06:37 -0500: %MSG

DAQInterface failed to stop the second run with a timeout sending stop transition to component01 message.


Related issues

Related to artdaq - Bug #21077: SharedMemoryFragmentManager issues with multiple readersClosed10/09/2018

Related to artdaq - Bug #21075: Broadcast Buffers reset before seen by processClosed09/28/2018

History

#1 Updated by Eric Flumerfelt about 1 year ago

Command line was:

killall -9 art;treset;ipcrm -a;reset;./run_demo.sh --config multiple_art_processes_example --comps component{01..02} -- --runs 4 --runduration 50

on ironwork

#2 Updated by Eric Flumerfelt about 1 year ago

The example works when DAQInterface is forced to use TCPSocket transfers.

#3 Updated by Eric Flumerfelt about 1 year ago

  • Related to Bug #21075: Broadcast Buffers reset before seen by process added

#4 Updated by Eric Flumerfelt about 1 year ago

  • Related to Bug #21077: SharedMemoryFragmentManager issues with multiple readers added

#5 Updated by Eric Flumerfelt about 1 year ago

  • Status changed from New to Resolved
  • Parent task set to #21075

I merged artdaq_core/feature/SMM_DontResetUnseenBroadcasts and artdaq-core/feature/SharedMemoryReader_GetBufferForReading_LimitedRetries into working/Issue20976, and was able to run the example successfully using ShmemTransfer.

#6 Updated by Eric Flumerfelt about 1 year ago

  • Parent task deleted (#21075)

#7 Updated by Eric Flumerfelt about 1 year ago

  • Related to Bug #21075: Broadcast Buffers reset before seen by process added

#8 Updated by Eric Flumerfelt about 1 year ago

  • Target version set to artdaq_core v3_04_03

#9 Updated by John Freeman about 1 year ago

  • Status changed from Resolved to Work in progress

It seems the stale read buffer problem still crops up, despite my using the head of release/v3_04_03 (b54b523f5a50d132b54774a9440124aa40365ca2) in artdaq-core. Details are as follows:

  • The installation I'm working with is on woof, in /home/jcfree/scratch/artdaq-demo_test_artdaq-core_shmem; the installation was performed using the quick-mrb-start.sh script in that directory, which I modified to use release branches where possible.
  • I used variants of the command listed above (the one which begins with killall), where the variants in practice meant that I used the --no_om argument to take online monitoring out of the picture, and also used --runs 10 at one point rather than --runs 4
  • I saw Stale Read buffer warnings, accompanied by timeouts on DAQInterface's stop transition, for runs 16 and 22 (details can be found in the run records directory, /home/jcfree/scratch/artdaq-demo_test_artdaq-core_shmem/run_records)

#10 Updated by John Freeman about 1 year ago

The problem seems to persist. I've tested out changes made to artdaq-core and artdaq in the last day. Specifically, for artdaq-core, commit 914b0221c0f3b1b5021569bdd95841a2df07ad65 at the HEAD of release/v3_04_03_WithHotfixes and for artdaq, commit f92af7c5d2bdb442e620b96a2722e2a99d2c5606 at the HEAD of release/v3_03_01_WithHotfixes. Details:

  • Installation area used for the tests in this entry is woof:/home/jcfree/scratch/artdaq-demo_test_artdaq-core_shmem_try2; installation performed by quick-mrb-start.sh in the directory.
  • Runs 10 and 13 both had messages of the form
    2018-10-17 15:58:00 -0500: %MSG-w SharedMemoryManager:  Early pre-events SharedMemoryManager.cc:800
    2018-10-17 15:58:00 -0500: Stale Read buffer 1 at 0x7fe9b3b2e068 ( 100000288 / 100000000 us ) detected! (seqid=582) Resetting... Reading-->Full
    

    resulting in a timeout on the stop transition sent to the boardreaders.
  • Further info can be found in the run records directory, /home/jcfree/scratch/artdaq-demo_test_artdaq-core_shmem_try2/run_records

#11 Updated by Eric Flumerfelt about 1 year ago

  • Status changed from Work in progress to Closed


Also available in: Atom PDF