Project

General

Profile

Feature #22226

Request to add DAQInterface book-keeping for private-network Routing Master multicast addresses

Added by Kurt Biery 6 months ago. Updated 9 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
03/26/2019
Due date:
% Done:

100%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

As part of investigating the sending of multicast messages over private-network interfaces on SBN clusters (ICARUS and DAB) [described in Issue #21769], some additional RoutingMaster-related configuration parameters have been identified (and added to the code on the feature/MulticastMinorTweaks branch in the artdaq repo).

Something that I'm not sure about is whether the existing "RoutingMaster host" parameter in the DAQInterface boot.txt file is the right one to use for this bookkeeping, or whether a new parameter is needed.

It would be great to get together sometime (John, Eric, me, and possibly others) to talk about what the "RoutingMaster host" parameter is already used for (and can be also used for this bookkeeping) and for me to describe the new RoutingMaster multicast parameters.


Related issues

Related to artdaq - Support #23036: Observations from attempting to install and run the artdaq_demo v3_6_0New08/01/2019

Related to artdaq - Support #21769: Notes on getting single-node, private-network multicasts to work on SBN DAQ computersClosed01/25/2019

Associated revisions

Revision 62bc3dba (diff)
Added by John Freeman about 1 month ago

JCF: lightly-tested (i.e., don't use it yet) bookkeeping which prioritizes private network use

With this commit, DAQInterface will try to get routing to happen over
a private network if available. This is described in more detail in
the comment from Aug. 8, 5:29 PM in Issue #22226. I successfully
performed a run where boardreaders, eventbuilders and a routing_master
were used, all on the same subsystem
(sbnd-daq33.fnal.gov:/home/nfs/jcfree/run_records/25) using this
commit's code. More testing needs to be done, and in particular, from
the perspective of Issue #22226 I have yet to deal with the
multicast_interface_ip used for requests, or to support the convention
whereby a parameter is only bookkept if its value is set to
"BOOKKEPT_BY_DAQINTERFACE"

Revision b082ade6 (diff)
Added by John Freeman about 1 month ago

JCF: as discussed with Kurt yesterday, add option "disable_private_network_bookkeeping" to switch off Issue #22226 bookkeeping

Revision 6273dc5a (diff)
Added by John Freeman about 1 month ago

JCF: include non-DFO eventbuilders in parent subsystems when looking for RQMgroup processes (see Issue #22226 for what an RM group is)

History

#1 Updated by John Freeman about 2 months ago

Discussing this with Kurt, it seems like a good approach would be the following:
  • Have users continue to set the routing_master's host in the boot file as the public address (or, if desired, "localhost", which DAQInterface expands internally into the public address). This is what gets used for ssh calls to the routing_master node (e.g., to launch the routing_master process)
  • Come up with some algorithm DAQInterface could launch where it determines if a private address is available on the routing_master node. If it is, use that when bookkeeping the "routing_master_hostname" parameter. Otherwise, fall back to the public hostname.

#2 Updated by Kurt Biery about 2 months ago

To help with the implementation of the model that John describes (or whatever one we come up with as a group), I will update the branches that I created as part of Issue #21769 and document some instructions for using them. (The point is that I recall that there are new FCL parameters that need to be book-kept, and it would be good to have them included in whatever testing is done.)

#3 Updated by Kurt Biery about 2 months ago

  • Related to Support #23036: Observations from attempting to install and run the artdaq_demo v3_6_0 added

#4 Updated by Kurt Biery about 2 months ago

OK, the good news is that the develop branch in artdaq already has the needed code changes for this.

In the artdaq-utilities-daqinterface repo, I've updated the feature/Issue21769_SBN_Multicast_Tests branch by merging in the develop branch and modifying the relevant values in the mediumsystem_with_routing_master sample config so that the system works on sbnd-daq34.

From what I've determined, there are seven parameters of interest:
  1. the routing master hostname in the boot.txt file
  2. the "routing_master_hostname" parameter that is used by the RoutingMaster itself
  3. the "routing_master_hostname" parameter that is used by the BRs
  4. the "routing_master_hostname" parameter that is used by the EBs
  5. the "table_update_multicast_interface" parameter that is used by the BRs
  6. the "multicast_interface_ip" parameter that is used by the BRs
  7. the "multicast_interface_ip" parameter that is used by the EBs

First, let's focus on multicasts. The sending of the routing table updates is done via multicast, and the 2nd and 5th parameters in the list above are the ones that are relevant for that. As you might imagine, the RM routing_master_hostname needs to be set to the private-network interface of the computer that is hosting the RM. And the table_update_multicast_interface parameter for each BR needs to be set to the appropriate private-network interface for each of the BRs.

Also, the sending of the DataRequests is done via multicast. The relevant parameters for this are the 6th and 7th ones in the list. The values of these parameters need to be the private-network interface addresses (or hostnames) of the computers on which each of the BRs or EBs is running.

The 3rd and 4th parameters in the list are not used in multicasts. The 3rd is used when the BRs send UDP broadcasts back to the RM to acknowledge routing table updates, and the 4th is used when the EBs send TCP messages to the RM to report their number of available tokens.

#5 Updated by Kurt Biery about 2 months ago

If we say that we want all four types of messages,

  1. routing table update multicasts
  2. data request multicasts
  3. routing table update acknowledgement UDP messages
  4. event builder token update TCP messages

to be sent over the private-network interfaces of the computers in the DAQ cluster, then John's proposed scheme for handling the "routing_master_hostname" likely works.

That still leaves the other parameters, though (the 5th, 6th, and 7th ones). For specific experiments, the scheme that Eric described to David Rivera (using 192.168.x.0) seems like it should work. For the demo, though, it might be nice to have some help from DAQInterface book-keeping.

Would it be reasonable to request that DAQInterface book-keep table_update_multicast_interface and multicast_interface_ip, if they are set to "localhost"? It would convert them to the private-network interface (if available) of the appropriate computer.

#6 Updated by John Freeman about 2 months ago

From a technical perspective, bookkeeping these variables would be quite easy, but concerning the value for the private network: if there are multiple networks on a node, what algorithm should I use to resolve this? E.g., if I type ifconfig on mu2edaq11, I get the following:

br-be37deb57424: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.18.0.1  netmask 255.255.0.0  broadcast 172.18.255.255
        ether 02:42:ed:85:de:3a  txqueuelen 0  (Ethernet)
        RX packets 3092407  bytes 308181092 (293.9 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 11  bytes 910 (910.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:83:c1:a3:22  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 192.168.157.11  netmask 255.255.255.128  broadcast 192.168.157.127
        inet6 fe80::ec4:7aff:fe79:acda  prefixlen 64  scopeid 0x20<link>
        ether 0c:c4:7a:79:ac:da  txqueuelen 1000  (Ethernet)
        RX packets 30012868  bytes 16686572552 (15.5 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 23935764  bytes 9642539692 (8.9 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eno2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.226.9.25  netmask 255.255.255.0  broadcast 10.226.9.255
        inet6 fe80::ec4:7aff:fe79:acdb  prefixlen 64  scopeid 0x20<link>
        ether 0c:c4:7a:79:ac:db  txqueuelen 1000  (Ethernet)
        RX packets 3092407  bytes 308181092 (293.9 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 11  bytes 910 (910.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1  (Local Loopback)
        RX packets 2558065  bytes 717760815 (684.5 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2558065  bytes 717760815 (684.5 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

#7 Updated by John Freeman about 2 months ago

After discussions with Kurt, we've come up with detailed requirements for how the bookkeeping should work.

For a given run with a set of artdaq processes, we can think in terms of collections of processes which would all need to see the same private network. There are three separate such types of collection: boardreaders and eventbuilders mediated by a routing_master (hence called an RM collection), boardreaders and eventbuilders involved in requests (hence called an RQ collection) and processes mediated by a DFO (a DFO collection).

RM collection:
To take a simple example of an RM collection, say we have a single subsystem, consisting of boardreaders, eventbuilders, and a routing_master. If and only if all nodes spanned by the boardreaders and routing_master saw the same private network would we set table_update_multicast_interface (for the boardreaders) and routing_master_hostname (for the routing_master) to the private network address; otherwise we'd set it to the public network address.

Because routing_master_hostname in the case of eventbuilders and boardreaders isn't used for multicast, we can be more flexible; if we have a private network as described above, if a given eventbuilder or boardreader can see that network we can set routing_master_hostname to it, otherwise we can set routing_master_hostname to the public address of the node the routing_master is on.

If we add a parent subsystem (or subsystems) to the mix, then the way things change is that we need to add the eventbuilders in the parent subsystem(s) to the boardreaders and routing_master when it comes to the set of processes spanning nodes which all share the same private network.

RQ collection:
In a given subsystem, if and only if all the boardreaders which receive requests and eventbuilders can all see the same private network do we set multicast_interface_ip to the value of the private network. Otherwise, we set it to the public network. The good news is that we don't need to think outside of individual subsystems here; also good news is that this is all orthogonal to the RM collection logic.

DFO collection:
I'm largely going to punt here, besides pointing out that there's a lot of the same logic here as there would be for RM collections.

One thing to keep in mind is that, particularly for large systems, the ssh calls involved in figuring out whether all processes in a collection can see a private network could slow things down. This seems especially wasteful in the context of experiments which are stably running - why should DAQInterface rediscover what everyone already knows every time a config transition is sent? For this reason, Kurt and I agreed that there should be some convention by which we could tell DAQInterface whether or not we wanted a parameter bookkept. E.g., if DAQInterface saw something like

routing_master_hostname: "BOOKKEPT_BY_DAQINTERFACE" 

it would bookkeep the routing_master_hostname parameter, but if it saw anything else, e.g.,
routing_master_hostname: "192.168.230.33" 

it would take no action. In this way, we could take advantage of DAQInterface bookkeeping when setting up an experiment (or even just running artdaq-demo's quick-mrb-start.sh script), but once parameters were bookkept we could leave the overwrites in place so bookkeeping wouldn't take place each and every time a config transition was sent.

#8 Updated by John Freeman about 1 month ago

  • % Done changed from 0 to 80

With commit 2efdfa34ba55b40956e02e45286531d7c9028cc8 (the current head of the feature/Issue21769_SBN_Multicast_Tests branch), DAQInterface can bookkeep the following with the mediumsystem_with_routing_master configuration:

-component01, component02, and an eventbuilder on sbn-daq01-priv.fnal.gov
-component{03..10}, the other eventbuilders, a datalogger and a dispatcher on sbnd-daq33-priv.fnal.gov

Details are in sbnd-daq33:/home/nfs/jcfree/run_records/62 . The run was successful in the sense that we got no major warnings, and all the fragments you'd expect to see appeared in the output root file. Note that I modified the # of ADC counts per fragment in mediumsystem_with_routing_master from 500000 to 5000 as the larger value was causing timeouts on stop. To see exactly what happened to the parameters of interest, you can run

for token in routing_master_hostname multicast_interface_ip table_update_multicast_interface; do grep -H $token /home/nfs/jcfree/run_records/62/*.fcl ; done

#9 Updated by John Freeman about 1 month ago

  • % Done changed from 80 to 90

Update since the last comment: now, with commit 6273dc5a6ed2b9cd4e46ef340e036636c0151430 at the HEAD of feature/Issue21769_SBN_Multicast_Tests, the following's been added:

  • You can set "disable_private_network_bookkeeping: true" in the $DAQINTERFACE_SETTINGS file to prevent DAQInterface from searching for, and bookkeeping with, the private networks it finds on the hosts of the processes intended for a run
  • Along with requiring that a routing_master and the boardreaders in its subsystem all see the same private network if it's to use that private network in bookkeeping, DAQInterface will also require any non-DFO eventbuilders in parent subsystems of the subsystem can also see the network
  • When private network bookkeeping is disabled as described in the first bullet point, then DAQInterface bookkeeps multicast_interface_ip to "0.0.0.0" and table_update_multicast_interface to "localhost" in the FHiCL documents. These are, in fact, their artdaq v3_06_00 defaults.

#10 Updated by John Freeman about 1 month ago

  • % Done changed from 90 to 100
  • Status changed from New to Resolved

I'm marking this issue as resolved. The current head of feature/Issue21769_SBN_Multicast_Tests is b6f97c7f0ccf0ea93cc980d1ff08f617cc04df92. The phase space of possible tests is clearly quite large, but some "Hello, world!" tests would be:

  • Make sure that DAQInterface properly takes advantage of a set of nodes which share a private network
  • Make sure that putting "disable_private_network_bookkeeping: true" in the $DAQINTERFACE_SETTINGS file will do what it suggests it will do

#11 Updated by Kurt Biery 12 days ago

  • Related to Support #21769: Notes on getting single-node, private-network multicasts to work on SBN DAQ computers added

#12 Updated by Kurt Biery 10 days ago

I've tested these changes on several computers using the following command (and the version of the mediumsystem_with_routing_master sample config that is on the feature/Issue21769_SBN_Multicast_Tests branch):
  • sh ./run_demo.sh --config mediumsystem_with_routing_master --bootfile `pwd`/artdaq-utilities-daqinterface/simple_test_config/mediumsystem_with_routing_master/boot.txt --comps component01 component02 component03 component04 component05 component06 component07 component08 component09 component10 --runduration 40 --partition 5 --no_om
  1. ICARUS vst01
    • with the new code, the artdaq_demo worked, as expected. With the new code and "disable_private_network_bookkeeping" set to "true", the artdaq_demo didn't work, as expected.
  2. sbnd-daq33
    • with the new code, the artdaq_demo worked, as expected. With the new code and "disable_private_network_bookkeeping" set to "true", the artdaq_demo didn't work, as expected.
  3. mu2edaq13
    • with the new code, the artdaq_demo worked, as expected. With the new code and "disable_private_network_bookkeeping" set to "true", the artdaq_demo didn't work, as expected.
  4. mu2edaq01
    • with the new code, the artdaq_demo worked, as expected. With the new code and "disable_private_network_bookkeeping" set to "true", the artdaq_demo didn't work. I wasn't sure what to expect in that case, but this must mean that multicasts are disabled on the public network on mu2edaq01. However, when I disabled the slam-firewall, it still didn't work, and I'm not sure why not. The existing daqinterface code did work, even without disabling the slam-firewall.
  5. protodune np04-srv-015
    • running the artdaq_demo with the existing daqinterface code worked. The new daqinterface code (along with the older mediumsystem_with_routing_master config) didn't work with either disable_private_network_bookkeeping set to true or false.

I'm working on trying to understand why situations that work now don't work with the new code. It seems desirable to have situations that work now continue to work.

#13 Updated by Kurt Biery 10 days ago

Maybe I just need to specify "0.0.0.0" in the mediumsystem_with_routing_master" config files, instead of "localhost". I'll check...

#14 Updated by John Freeman 9 days ago

Probably a good time to mention this: when working with feature/Issue21769_SBN_Multicast_Tests, if you define an environment variable DAQINTERFACE_DISABLE_BOOKKEEPING and set it to anything other than false, then bookkeeping won't happen. You can do this if you want to manually tweak FHiCL parameters which would otherwise be clobbered in bookkeeping, potentially useful for reviewing this issue. Personally, when I've done this, I've set DAQINTERFACE_FHICL_DIRECTORY to the run records base directory, and then given the configuration as the run number whose FHiCL documents I wish to use verbatim.

One technique you can use is create a subdirectory of $DAQINTERFACE_FHICL_DIRECTORY with a name like "mediumsystem_with_routing_master_no_bookkeeping_needed" and copy the FHiCL documents from the run record of a run that used the configuration "mediumsystem_with_routing_master" into that subdirectory; provided you use the same boot file and known boardreaders list as in the run in question (which, of course, are also saved in the run record) then you'll exactly recreate the run-- unless you decide to alter a parameter or two for your study.



Also available in: Atom PDF