Project

General

Profile

Feature #22267

Should we gracefully support a RoundRobin routing policy config in which the missing EB count is larger than the number of EBs in the system

Added by Kurt Biery over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Known Issues
Target version:
Start date:
04/02/2019
Due date:
% Done:

0%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

In the testing of the RoundRobin routing policy, prior to rolling it out on protoDUNE, I noticed that a large negative value for the minimum number of participants can prevent data flow through the system.

Actually, I wasn't trying to specify a large negative number for this value, I was only trying to specify a reasonable number of missing EBs (for example, 2 or 3 allowed to be missing out of a pool of 10-16) while also ensuring that if someone accidentally only used a single EB, the system configuration would still work. For a minimum_participants value of -2, a system with only one EB did not work.

Looking at the code in RoundRobin_policy.cc, I found this behavior surprising. I saw that there is a test to ensure that the minimum number of participants is one. Given that, I expected that the system would still work when the number of allowed missing EBs is larger than the number of EBs in the system.

I traced this to the automatic declaration of the "minimum" variable in the code. The compiler makes this an unsigned int, which ruins its usefulness when minimum_participants is less than (-1.0 * numberOfEBs). I'll attached some TRACE messages to this Issue and commit a candidate fix to a branch.

History

#1 Updated by Kurt Biery over 1 year ago

To test this, I started with a v3_04_01 artdaq_demo system.

I ran the following command:
  • sh ./run_demo.sh --config mediumsystem_with_routing_master --bootfile `pwd`/artdaq-utilities-daqinterface/simple_test_config/mediumsystem_with_routing_master/boot.txt --comps component01 component02 component03 component04 component05 component06 component07 component08 component09 component10 --runduration 40 --partition=4 --no_om

With the default mediumsystem_with_routing_master configuration, I saw data successfully written to disk.

I then modified the mediumsystem_with_routing_master configuration to add a fourth EB and switch to the RoundRobin routing policy

diff --git a/simple_test_config/mediumsystem_with_routing_master/RoutingMaster1.fcl b/simple_test_config/mediumsystem_with_routing_master/R
index 77778eb..b8e9b4f 100644
--- a/simple_test_config/mediumsystem_with_routing_master/RoutingMaster1.fcl
+++ b/simple_test_config/mediumsystem_with_routing_master/RoutingMaster1.fcl
@@ -1,7 +1,8 @@
   daq: {
   policy: {
-         policy: "NoOp" 
+         policy: "RoundRobin" 
          receiver_ranks: [2,3]
+          minimum_participants: -6
   }

   sender_ranks: [0,1]
diff --git a/simple_test_config/mediumsystem_with_routing_master/boot.txt b/simple_test_config/mediumsystem_with_routing_master/boot.txt
index 004f9b8..2201902 100644
--- a/simple_test_config/mediumsystem_with_routing_master/boot.txt
+++ b/simple_test_config/mediumsystem_with_routing_master/boot.txt
@@ -19,6 +19,10 @@ EventBuilder host: localhost
 EventBuilder port: 5237
 EventBuilder label: EventBuilder3

+EventBuilder host: localhost
+EventBuilder port: 5238
+EventBuilder label: EventBuilder4
+
 DataLogger host: localhost
 DataLogger port: 5265
 DataLogger label: DataLogger1

With this configuration, there were no events written to disk when using the "run_demo" command listed above.

There are lots of ways to see the problem in the TRACE log, but a simple one is to look at the end-of-run summary. In run 3 (using the RoundRobin policy with the large negative minimum_participants), shown below, we see that there were no table updates. In run #2, with the NoOp policy, there were many table updates.

[biery@mu2edaq01 341Demo]$ tshowt | grep Routing | grep Stopping
   668 04-02 09:57:20.671682 17967  5982  19        RoutingMaster1_RoutingMasterCore nfo . Stopping run 3 after 0 table updates. and 80 received tokens.
  9591 04-02 09:51:53.560014  8407 32154  23        RoutingMaster1_RoutingMasterCore nfo . Stopping run 2 after 44 table updates. and 499 received tokens.

#2 Updated by Eric Flumerfelt over 1 year ago

This should also be reproducible in artdaq/test/Application/Routing/RoundRobin_policy_t.cc, with a specialized test-case.

#3 Updated by Kurt Biery over 1 year ago

I have committed code changes to

  • artdaq/artdaq/Application/Routing/RoundRobin_policy.cc

and

  • artdaq/test/Application/Routing/RoundRobin_policy_t.cc

to fix the issue describe here, and to implement a unit test to verify the expected behavior, as suggested by Eric.

In addition to changing the declared type of the "minimum" variable in RoundRobin_policy.cc from 'auto' to 'int', I changed the type of the "endCondition" variable from 'auto' to 'bool'. (Yes, I was feeling a little paranoid and wanted to avoid unexpected behavior later.)

I also add the app_name to the TRACE_NAME for RoundRobin_policy.

#4 Updated by Eric Flumerfelt over 1 year ago

  • Status changed from New to Resolved

#5 Updated by Eric Flumerfelt over 1 year ago

  • Assignee set to Kurt Biery
  • Status changed from Resolved to Reviewed
  • Category set to Known Issues
  • Co-Assignees Eric Flumerfelt added

I added a one-line change to ensure that the minimum variable is also clamped for the upper end, when the user specifies an absolute minimum number of participants.

One possibility for a to-do is to move the minimum variable to a class member and calculate it in the constructor, with TLVL_WARNING messages when it has to be clamped one way or the other. Adding those messages now would result in them being printed for every routing update, which would be quite a flood.

Verified code by code review and testing with/without change to observe undesirable behavior. Because I've made further changes, I won't merge into develop until someone else takes a quick look.

#6 Updated by Eric Flumerfelt over 1 year ago

  • Target version set to artdaq v3_06_00
  • Status changed from Reviewed to Closed

Also available in: Atom PDF