Project

General

Profile

Bug #21908

Inconsistent results with 10 1-minute runs using mediumsystem_with_routing_master

Added by John Freeman over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Category:
Known Issues
Target version:
Start date:
02/16/2019
Due date:
% Done:

100%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

As part of testing Issue #21863, I've been performing sets of 10 runs (stop-start-stop-start-etc) of 1 minute, with all processes on mu2edaq12. As mu2edaq12 is known to have back pressure with this configuration, it's not surprising that things don't behave perfectly, but since my intuition is that the buildup of back pressure would always reset to zero on a stop it's a little surprising that the runs behave differently within sets. E.g., the output root files from runs 2226-2235 look like:

-rw-r--r-- 1 jcfree mu2e 2387501693 Feb 15 16:32 /home/jcfree/daqdata/artdaqdemo_r002226_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2648707283 Feb 15 16:34 /home/jcfree/daqdata/artdaqdemo_r002227_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2774810733 Feb 15 16:37 /home/jcfree/daqdata/artdaqdemo_r002228_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 1560547980 Feb 15 16:38 /home/jcfree/daqdata/artdaqdemo_r002229_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e    4711963 Feb 15 16:40 /home/jcfree/daqdata/artdaqdemo_r002230_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e    4818143 Feb 15 16:42 /home/jcfree/daqdata/artdaqdemo_r002231_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e    4724161 Feb 15 16:44 /home/jcfree/daqdata/artdaqdemo_r002232_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e    4673021 Feb 15 16:45 /home/jcfree/daqdata/artdaqdemo_r002233_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e    4711963 Feb 15 16:47 /home/jcfree/daqdata/artdaqdemo_r002234_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e    4818143 Feb 15 16:49 /home/jcfree/daqdata/artdaqdemo_r002235_sr01_1_dl1.root

where the smallness of the later files is due to empty fragments (recall the config uses one ToySimulator in push mode and nine in pull mode). Perhaps even more interesting, from today:
-rw-r--r-- 1 jcfree mu2e 2774806637 Feb 16 11:34 /home/jcfree/daqdata/artdaqdemo_r002236_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2810838256 Feb 16 11:36 /home/jcfree/daqdata/artdaqdemo_r002237_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2792830956 Feb 16 11:39 /home/jcfree/daqdata/artdaqdemo_r002238_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 1937876778 Feb 16 11:41 /home/jcfree/daqdata/artdaqdemo_r002239_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e    4770625 Feb 16 11:43 /home/jcfree/daqdata/artdaqdemo_r002240_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e    2766449 Feb 16 11:45 /home/jcfree/daqdata/artdaqdemo_r002241_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e    4720065 Feb 16 11:47 /home/jcfree/daqdata/artdaqdemo_r002242_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2466814192 Feb 16 11:50 /home/jcfree/daqdata/artdaqdemo_r002243_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2747083772 Feb 16 11:53 /home/jcfree/daqdata/artdaqdemo_r002244_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2346676599 Feb 16 11:56 /home/jcfree/daqdata/artdaqdemo_r002245_sr01_1_dl1.root

It's important to note here that for both sets of ten runs, the stop timeout was set to a very high value (10 minutes) and therefore the stop went smoothly in all cases. I'm wondering whether there's something in the artdaq code which is causing this run-by-run inconsistency, or if it's something at system level related to resource usage that, say, ganglia would reveal.

History

#1 Updated by Eric Flumerfelt over 1 year ago

  • Priority changed from Low to Normal
  • Assignee set to Eric Flumerfelt
  • Status changed from New to Resolved
  • Category set to Known Issues

This was due to a bug in CommandableFragmentGenerator where the dataBufferDepthBytes_ member was not being correctly reset between runs. After a few runs of data, it reached the limit and would not allow new Fragments to be generated.

I have corrected the issue on artdaq/bugfix/21908_CFG_ResetDataBufferDepthBytes.

#2 Updated by John Freeman over 1 year ago

  • % Done changed from 0 to 100
  • Status changed from Resolved to Reviewed

Using the pre-bugfix code, I recreated the problem:

-rw-r--r-- 1 jcfree mu2e 2531548772 Feb 19 10:22 artdaqdemo_r002254_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2504528862 Feb 19 10:23 artdaqdemo_r002255_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2540620711 Feb 19 10:25 artdaqdemo_r002256_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e    2618269 Feb 19 10:27 artdaqdemo_r002257_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e    2618269 Feb 19 10:29 artdaqdemo_r002258_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e    2618269 Feb 19 10:30 artdaqdemo_r002259_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e    2605633 Feb 19 10:32 artdaqdemo_r002260_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e    2605633 Feb 19 10:34 artdaqdemo_r002261_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e    2618269 Feb 19 10:35 artdaqdemo_r002262_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e    2605633 Feb 19 10:37 artdaqdemo_r002263_sr01_1_dl1.root

and then after merging the branch in, things look fine:
-rw-r--r-- 1 jcfree mu2e 2513540850 Feb 19 11:06 artdaqdemo_r002264_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2549573128 Feb 19 11:07 artdaqdemo_r002265_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2423453007 Feb 19 11:09 artdaqdemo_r002266_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2270328060 Feb 19 11:11 artdaqdemo_r002267_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2414440668 Feb 19 11:12 artdaqdemo_r002268_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2324384256 Feb 19 11:14 artdaqdemo_r002269_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2189268590 Feb 19 11:16 artdaqdemo_r002270_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2567581053 Feb 19 11:19 artdaqdemo_r002271_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2450472547 Feb 19 11:20 artdaqdemo_r002272_sr01_1_dl1.root
-rw-r--r-- 1 jcfree mu2e 2351404389 Feb 19 11:21 artdaqdemo_r002273_sr01_1_dl1.root

#3 Updated by Kurt Biery over 1 year ago

dumb question...

On line 842 of CFG (https://cdcvs.fnal.gov/redmine/projects/artdaq/repository/revisions/develop/entry/artdaq/Application/CommandableFragmentGenerator.cc#L842), there is a call to the clear() function in dataBuffer_, and there is a corresponding reset of dataBufferDepthBytes_ to zero, but there is not a resetting of dataBufferDepthFragments_ to zero. Is the latter not needed?
Thanks,
Kurt

#4 Updated by Eric Flumerfelt over 1 year ago

It could be added for code clarity, but dataBufferDepthFragments_ is set in getDataBufferStats(), which is always called before dataBufferDepthFragments_ is read. The reason the variable exists at all is that dataBuffer_.size() should only be called while dataBufferMutex_ is held. Making a function which locks the mutex and returns the size has problems when called from contexts where the mutex is already held.

#5 Updated by Kurt Biery over 1 year ago

OK, thanks. Maybe it is fine without that reset.

I'm not sure that I understand all of the subtleties, though.

It looks like checkDataBuffer accesses dataBufferDepthFragments before calling getDataBufferStats().

And, std::recursive_mutex can help with re-locking the same mutex in a single thread.

#6 Updated by Eric Flumerfelt over 1 year ago

  • Target version set to artdaq v3_04_00
  • Status changed from Reviewed to Closed


Also available in: Atom PDF