Project

General

Profile

Feature #22226

Request to add DAQInterface book-keeping for private-network Routing Master multicast addresses

Added by Kurt Biery 8 months ago. Updated about 2 months ago.

Status:
Reviewed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
03/26/2019
Due date:
% Done:

100%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

As part of investigating the sending of multicast messages over private-network interfaces on SBN clusters (ICARUS and DAB) [described in Issue #21769], some additional RoutingMaster-related configuration parameters have been identified (and added to the code on the feature/MulticastMinorTweaks branch in the artdaq repo).

Something that I'm not sure about is whether the existing "RoutingMaster host" parameter in the DAQInterface boot.txt file is the right one to use for this bookkeeping, or whether a new parameter is needed.

It would be great to get together sometime (John, Eric, me, and possibly others) to talk about what the "RoutingMaster host" parameter is already used for (and can be also used for this bookkeeping) and for me to describe the new RoutingMaster multicast parameters.


Related issues

Related to artdaq - Support #23036: Observations from attempting to install and run the artdaq_demo v3_6_0New08/01/2019

Related to artdaq - Support #21769: Notes on getting single-node, private-network multicasts to work on SBN DAQ computersClosed01/25/2019

Associated revisions

Revision 62bc3dba (diff)
Added by John Freeman 3 months ago

JCF: lightly-tested (i.e., don't use it yet) bookkeeping which prioritizes private network use

With this commit, DAQInterface will try to get routing to happen over
a private network if available. This is described in more detail in
the comment from Aug. 8, 5:29 PM in Issue #22226. I successfully
performed a run where boardreaders, eventbuilders and a routing_master
were used, all on the same subsystem
(sbnd-daq33.fnal.gov:/home/nfs/jcfree/run_records/25) using this
commit's code. More testing needs to be done, and in particular, from
the perspective of Issue #22226 I have yet to deal with the
multicast_interface_ip used for requests, or to support the convention
whereby a parameter is only bookkept if its value is set to
"BOOKKEPT_BY_DAQINTERFACE"

Revision b082ade6 (diff)
Added by John Freeman 3 months ago

JCF: as discussed with Kurt yesterday, add option "disable_private_network_bookkeeping" to switch off Issue #22226 bookkeeping

Revision 6273dc5a (diff)
Added by John Freeman 3 months ago

JCF: include non-DFO eventbuilders in parent subsystems when looking for RQMgroup processes (see Issue #22226 for what an RM group is)

Revision 08225150 (diff)
Added by John Freeman about 2 months ago

JCF: Issue #22226: based on Kurt's studies from the last couple of days, if we're not using private network bookkeeping, then don't touch table_update_multicast_interface or multicast_interface_ip

Revision 13cb0958 (diff)
Added by John Freeman about 2 months ago

JCF: Issue #22226: have the get_private_networks() function pick out 10.x.y.z as well as 192.168.y.z

Revision 399c22b9 (diff)
Added by John Freeman about 2 months ago

JCF: Issue #22226: if private network bookkeeping is disabled, remove bookkeeping of a routing_master's routing_master_hostname parameter for reasons discussed today on the issue page

History

#1 Updated by John Freeman 4 months ago

Discussing this with Kurt, it seems like a good approach would be the following:
  • Have users continue to set the routing_master's host in the boot file as the public address (or, if desired, "localhost", which DAQInterface expands internally into the public address). This is what gets used for ssh calls to the routing_master node (e.g., to launch the routing_master process)
  • Come up with some algorithm DAQInterface could launch where it determines if a private address is available on the routing_master node. If it is, use that when bookkeeping the "routing_master_hostname" parameter. Otherwise, fall back to the public hostname.

#2 Updated by Kurt Biery 4 months ago

To help with the implementation of the model that John describes (or whatever one we come up with as a group), I will update the branches that I created as part of Issue #21769 and document some instructions for using them. (The point is that I recall that there are new FCL parameters that need to be book-kept, and it would be good to have them included in whatever testing is done.)

#3 Updated by Kurt Biery 4 months ago

  • Related to Support #23036: Observations from attempting to install and run the artdaq_demo v3_6_0 added

#4 Updated by Kurt Biery 4 months ago

OK, the good news is that the develop branch in artdaq already has the needed code changes for this.

In the artdaq-utilities-daqinterface repo, I've updated the feature/Issue21769_SBN_Multicast_Tests branch by merging in the develop branch and modifying the relevant values in the mediumsystem_with_routing_master sample config so that the system works on sbnd-daq34.

From what I've determined, there are seven parameters of interest:
  1. the routing master hostname in the boot.txt file
  2. the "routing_master_hostname" parameter that is used by the RoutingMaster itself
  3. the "routing_master_hostname" parameter that is used by the BRs
  4. the "routing_master_hostname" parameter that is used by the EBs
  5. the "table_update_multicast_interface" parameter that is used by the BRs
  6. the "multicast_interface_ip" parameter that is used by the BRs
  7. the "multicast_interface_ip" parameter that is used by the EBs

First, let's focus on multicasts. The sending of the routing table updates is done via multicast, and the 2nd and 5th parameters in the list above are the ones that are relevant for that. As you might imagine, the RM routing_master_hostname needs to be set to the private-network interface of the computer that is hosting the RM. And the table_update_multicast_interface parameter for each BR needs to be set to the appropriate private-network interface for each of the BRs.

Also, the sending of the DataRequests is done via multicast. The relevant parameters for this are the 6th and 7th ones in the list. The values of these parameters need to be the private-network interface addresses (or hostnames) of the computers on which each of the BRs or EBs is running.

The 3rd and 4th parameters in the list are not used in multicasts. The 3rd is used when the BRs send UDP broadcasts back to the RM to acknowledge routing table updates, and the 4th is used when the EBs send TCP messages to the RM to report their number of available tokens.

#5 Updated by Kurt Biery 4 months ago

If we say that we want all four types of messages,

  1. routing table update multicasts
  2. data request multicasts
  3. routing table update acknowledgement UDP messages
  4. event builder token update TCP messages

to be sent over the private-network interfaces of the computers in the DAQ cluster, then John's proposed scheme for handling the "routing_master_hostname" likely works.

That still leaves the other parameters, though (the 5th, 6th, and 7th ones). For specific experiments, the scheme that Eric described to David Rivera (using 192.168.x.0) seems like it should work. For the demo, though, it might be nice to have some help from DAQInterface book-keeping.

Would it be reasonable to request that DAQInterface book-keep table_update_multicast_interface and multicast_interface_ip, if they are set to "localhost"? It would convert them to the private-network interface (if available) of the appropriate computer.

#6 Updated by John Freeman 4 months ago

From a technical perspective, bookkeeping these variables would be quite easy, but concerning the value for the private network: if there are multiple networks on a node, what algorithm should I use to resolve this? E.g., if I type ifconfig on mu2edaq11, I get the following:

br-be37deb57424: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.18.0.1  netmask 255.255.0.0  broadcast 172.18.255.255
        ether 02:42:ed:85:de:3a  txqueuelen 0  (Ethernet)
        RX packets 3092407  bytes 308181092 (293.9 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 11  bytes 910 (910.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:83:c1:a3:22  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 192.168.157.11  netmask 255.255.255.128  broadcast 192.168.157.127
        inet6 fe80::ec4:7aff:fe79:acda  prefixlen 64  scopeid 0x20<link>
        ether 0c:c4:7a:79:ac:da  txqueuelen 1000  (Ethernet)
        RX packets 30012868  bytes 16686572552 (15.5 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 23935764  bytes 9642539692 (8.9 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eno2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.226.9.25  netmask 255.255.255.0  broadcast 10.226.9.255
        inet6 fe80::ec4:7aff:fe79:acdb  prefixlen 64  scopeid 0x20<link>
        ether 0c:c4:7a:79:ac:db  txqueuelen 1000  (Ethernet)
        RX packets 3092407  bytes 308181092 (293.9 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 11  bytes 910 (910.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1  (Local Loopback)
        RX packets 2558065  bytes 717760815 (684.5 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2558065  bytes 717760815 (684.5 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

#7 Updated by John Freeman 3 months ago

After discussions with Kurt, we've come up with detailed requirements for how the bookkeeping should work.

For a given run with a set of artdaq processes, we can think in terms of collections of processes which would all need to see the same private network. There are three separate such types of collection: boardreaders and eventbuilders mediated by a routing_master (hence called an RM collection), boardreaders and eventbuilders involved in requests (hence called an RQ collection) and processes mediated by a DFO (a DFO collection).

RM collection:
To take a simple example of an RM collection, say we have a single subsystem, consisting of boardreaders, eventbuilders, and a routing_master. If and only if all nodes spanned by the boardreaders and routing_master saw the same private network would we set table_update_multicast_interface (for the boardreaders) and routing_master_hostname (for the routing_master) to the private network address; otherwise we'd set it to the public network address.

Because routing_master_hostname in the case of eventbuilders and boardreaders isn't used for multicast, we can be more flexible; if we have a private network as described above, if a given eventbuilder or boardreader can see that network we can set routing_master_hostname to it, otherwise we can set routing_master_hostname to the public address of the node the routing_master is on.

If we add a parent subsystem (or subsystems) to the mix, then the way things change is that we need to add the eventbuilders in the parent subsystem(s) to the boardreaders and routing_master when it comes to the set of processes spanning nodes which all share the same private network.

RQ collection:
In a given subsystem, if and only if all the boardreaders which receive requests and eventbuilders can all see the same private network do we set multicast_interface_ip to the value of the private network. Otherwise, we set it to the public network. The good news is that we don't need to think outside of individual subsystems here; also good news is that this is all orthogonal to the RM collection logic.

DFO collection:
I'm largely going to punt here, besides pointing out that there's a lot of the same logic here as there would be for RM collections.

One thing to keep in mind is that, particularly for large systems, the ssh calls involved in figuring out whether all processes in a collection can see a private network could slow things down. This seems especially wasteful in the context of experiments which are stably running - why should DAQInterface rediscover what everyone already knows every time a config transition is sent? For this reason, Kurt and I agreed that there should be some convention by which we could tell DAQInterface whether or not we wanted a parameter bookkept. E.g., if DAQInterface saw something like

routing_master_hostname: "BOOKKEPT_BY_DAQINTERFACE" 

it would bookkeep the routing_master_hostname parameter, but if it saw anything else, e.g.,
routing_master_hostname: "192.168.230.33" 

it would take no action. In this way, we could take advantage of DAQInterface bookkeeping when setting up an experiment (or even just running artdaq-demo's quick-mrb-start.sh script), but once parameters were bookkept we could leave the overwrites in place so bookkeeping wouldn't take place each and every time a config transition was sent.

#8 Updated by John Freeman 3 months ago

  • % Done changed from 0 to 80

With commit 2efdfa34ba55b40956e02e45286531d7c9028cc8 (the current head of the feature/Issue21769_SBN_Multicast_Tests branch), DAQInterface can bookkeep the following with the mediumsystem_with_routing_master configuration:

-component01, component02, and an eventbuilder on sbn-daq01-priv.fnal.gov
-component{03..10}, the other eventbuilders, a datalogger and a dispatcher on sbnd-daq33-priv.fnal.gov

Details are in sbnd-daq33:/home/nfs/jcfree/run_records/62 . The run was successful in the sense that we got no major warnings, and all the fragments you'd expect to see appeared in the output root file. Note that I modified the # of ADC counts per fragment in mediumsystem_with_routing_master from 500000 to 5000 as the larger value was causing timeouts on stop. To see exactly what happened to the parameters of interest, you can run

for token in routing_master_hostname multicast_interface_ip table_update_multicast_interface; do grep -H $token /home/nfs/jcfree/run_records/62/*.fcl ; done

#9 Updated by John Freeman 3 months ago

  • % Done changed from 80 to 90

Update since the last comment: now, with commit 6273dc5a6ed2b9cd4e46ef340e036636c0151430 at the HEAD of feature/Issue21769_SBN_Multicast_Tests, the following's been added:

  • You can set "disable_private_network_bookkeeping: true" in the $DAQINTERFACE_SETTINGS file to prevent DAQInterface from searching for, and bookkeeping with, the private networks it finds on the hosts of the processes intended for a run
  • Along with requiring that a routing_master and the boardreaders in its subsystem all see the same private network if it's to use that private network in bookkeeping, DAQInterface will also require any non-DFO eventbuilders in parent subsystems of the subsystem can also see the network
  • When private network bookkeeping is disabled as described in the first bullet point, then DAQInterface bookkeeps multicast_interface_ip to "0.0.0.0" and table_update_multicast_interface to "localhost" in the FHiCL documents. These are, in fact, their artdaq v3_06_00 defaults.

#10 Updated by John Freeman 3 months ago

  • % Done changed from 90 to 100
  • Status changed from New to Resolved

I'm marking this issue as resolved. The current head of feature/Issue21769_SBN_Multicast_Tests is b6f97c7f0ccf0ea93cc980d1ff08f617cc04df92. The phase space of possible tests is clearly quite large, but some "Hello, world!" tests would be:

  • Make sure that DAQInterface properly takes advantage of a set of nodes which share a private network
  • Make sure that putting "disable_private_network_bookkeeping: true" in the $DAQINTERFACE_SETTINGS file will do what it suggests it will do

#11 Updated by Kurt Biery 2 months ago

  • Related to Support #21769: Notes on getting single-node, private-network multicasts to work on SBN DAQ computers added

#12 Updated by Kurt Biery 2 months ago

I've tested these changes on several computers using the following command (and the version of the mediumsystem_with_routing_master sample config that is on the feature/Issue21769_SBN_Multicast_Tests branch):
  • sh ./run_demo.sh --config mediumsystem_with_routing_master --bootfile `pwd`/artdaq-utilities-daqinterface/simple_test_config/mediumsystem_with_routing_master/boot.txt --comps component01 component02 component03 component04 component05 component06 component07 component08 component09 component10 --runduration 40 --partition 5 --no_om
  1. ICARUS vst01
    • with the new code, the artdaq_demo worked, as expected. With the new code and "disable_private_network_bookkeeping" set to "true", the artdaq_demo didn't work, as expected.
  2. sbnd-daq33
    • with the new code, the artdaq_demo worked, as expected. With the new code and "disable_private_network_bookkeeping" set to "true", the artdaq_demo didn't work, as expected.
  3. mu2edaq13
    • with the new code, the artdaq_demo worked, as expected. With the new code and "disable_private_network_bookkeeping" set to "true", the artdaq_demo didn't work, as expected.
  4. mu2edaq01
    • with the new code, the artdaq_demo worked, as expected. With the new code and "disable_private_network_bookkeeping" set to "true", the artdaq_demo didn't work. I wasn't sure what to expect in that case, but this must mean that multicasts are disabled on the public network on mu2edaq01. However, when I disabled the slam-firewall, it still didn't work, and I'm not sure why not. The existing daqinterface code did work, even without disabling the slam-firewall.
  5. protodune np04-srv-015
    • running the artdaq_demo with the existing daqinterface code worked. The new daqinterface code (along with the older mediumsystem_with_routing_master config) didn't work with either disable_private_network_bookkeeping set to true or false.

I'm working on trying to understand why situations that work now don't work with the new code. It seems desirable to have situations that work now continue to work.

#13 Updated by Kurt Biery 2 months ago

Maybe I just need to specify "0.0.0.0" in the mediumsystem_with_routing_master" config files, instead of "localhost". I'll check...

#14 Updated by John Freeman 2 months ago

Probably a good time to mention this: when working with feature/Issue21769_SBN_Multicast_Tests, if you define an environment variable DAQINTERFACE_DISABLE_BOOKKEEPING and set it to anything other than false, then bookkeeping won't happen. You can do this if you want to manually tweak FHiCL parameters which would otherwise be clobbered in bookkeeping, potentially useful for reviewing this issue. Personally, when I've done this, I've set DAQINTERFACE_FHICL_DIRECTORY to the run records base directory, and then given the configuration as the run number whose FHiCL documents I wish to use verbatim.

One technique you can use is create a subdirectory of $DAQINTERFACE_FHICL_DIRECTORY with a name like "mediumsystem_with_routing_master_no_bookkeeping_needed" and copy the FHiCL documents from the run record of a run that used the configuration "mediumsystem_with_routing_master" into that subdirectory; provided you use the same boot file and known boardreaders list as in the run in question (which, of course, are also saved in the run record) then you'll exactly recreate the run-- unless you decide to alter a parameter or two for your study.

#15 Updated by Kurt Biery about 2 months ago

I'm realizing that "localhost" never seems to be useful as a default multicast_interface address in our configurations. That gets translated to 127.0.01, and that doesn't seem to work on many hosts. As such, I've set all multicast_interface_ip and table_update_multicast_interface values to "0.0.0.0" in the mediumsystem_with_routing_master config on the feature/Issue21769_SBN_Multicast_Tests branch.

To help debug multicast issues, I've created two simple extra FCL files in the mediumsystem_with_routing_master sample config directory: receiveRequest.fcl and sendRequest.fcl. These are to be used with the requestReceiver and requestSender utility applications.

With the multicast_interface_ip parameters set to "localhost" in these two FCL files, the sender and receiver apps do not successfully transfer messages on mu2edaq01, whereas with "0.0.0.0", they do successfully transfer messages.

#16 Updated by Kurt Biery about 2 months ago

Getting back to the observed behavior where artdaq-demo systems don't seem to work when disable_private_network_bookkeeping is set to true...

There seem to be a couple of issues...
  1. it seems that the table_update_multicast_interface parameter is being set to a value of "localhost". I believe that it would be better set to "0.0.0.0".
  2. user-specified values for the multicast_interface parameters in the configuration files seem to get overwritten. Is the right way to have DAQInterface leave user-specified values alone to use disable_private_network_bookkeeping = false?

A separate issue that I've noticed is that with disable_private_network_bookkeeping = false, book-keeping doesn't seem to consider 10.x.y.z private networks. Can those be added?

#17 Updated by John Freeman about 2 months ago

Concerning the two issues, I'll cover the "is" and then the "ought":

  • The "is":

DAQInterface at the head of feature/Issue21769_SBN_Multicast_Tests will bookkeep multicast_interface_ip and table_update_multicast_interface both when private network bookkeeping is enabled (because that's essentially what private network bookkeeping entails) but then also when it's disabled. In the case of private network bookkeeping being disabled, in commit f4bc2a3efe46f1136d365149d8bba851c34849e7, I implemented logic whereby DAQInterface would set multicast_interface_ip and table_update_multicast_interface to the defaults artdaq v3_06_00 gives them when they're not mentioned at all in the FHiCL; these defaults are "0.0.0.0" for multicast_interface_ip and "localhost" for table_update_multicast_interface

Also, the separate issue: when searching for private networks, DAQInterface only looks for networks which begin with "192.168"

  • The "ought":

First, the uncontroversial:

  • When searching for private networks, DAQInterface should also look for networks which begin with "10" as well as those which begin with "192.168"
  • If we decide that DAQInterface should continue bookkeeping multicast_interface_ip and table_update_multicast_interface even with private network bookkeeping disabled, then table_update_multicast_interface should be set to "0.0.0.0" rather than "localhost"

Then, open for discussion:

  • Do we want any bookkeeping of table_update_multicast_interface and multicast_interface_ip at all if we've got private network bookkeeping disabled? My thinking when I implemented this was that in ordinary situations these parameters wouldn't need to be set to anything besides their defaults, but I may have been mistaken, in which case I can remove all bookkeeping of those parameters if private network bookkeeping's been disabled. If we don't remove bookkeeping in this scenario, then the only way to prevent DAQInterface from touching those parameters is if we set "DAQINTERFACE_DISABLE_BOOKKEEPING" (which is something of an extreme, developer-only action).

#18 Updated by Kurt Biery about 2 months ago

I've also been wondering if the right option is to not book-keep these parameters when private network bookkeeping is disabled. I'm just not sure (yet) whether that causes other unintended consequences.

One course of action would be to move in that direction and see if the validation tests on various platforms work. Another one would be to get together and think through various scenarios.

(I think that we would all agree that setting DAQINTERFACE_DISABLE_BOOKKEEPING to true is not a viable option since it is more for developer testing.)

#19 Updated by John Freeman about 2 months ago

Based on our discussion at today's artdaq meeting, I've modified things so that at the new head of feature/Issue21769_SBN_Multicast_Tests (13cb09580517d10019d286eb2c18c87047f56a37), the following applies:

  • If disable_private_network_bookkeeping is true, then the multicast_interface_ip and table_update_multicast_interface are left untouched.
  • Addresses listed by ifconfig of the form 10.x.y.z are also considered to be private network address, along with the already-included 192.168.y.z

#20 Updated by Kurt Biery about 2 months ago

Initial tests of the latest code are looking good.

One wrinkle is that when
  • I set a specific subnet for multicast_interface_ip and table_update_multicast_interface (for example, 192.168.157.0 on mu2edaq01),
  • I've specified disable_private_network_bookkeeping: true in the DAQInterface settings file, and
  • I set the routing_master_hostname to a specific subnet (e.g. 192.168.157.0) in RoutingMaster1.fcl,

the routing_master_hostname parameter gets over-written and the new value doesn't match the multicast interface that the BoardReaders are using. (In my testing on mu2edaq01, the RM hostname in boot.txt is 'localhost' and the hostname that is used by the RoutingMaster code is 'mu2edaq01.fnal.gov'.

How can I specify a RoutingMaster1.fcl/routing_master_hostname that doesn't get over-written? Is this another candidate for doing nothing when disable_private_network_bookkeeping is true?
Thanks,
Kurt

#21 Updated by John Freeman about 2 months ago

Using "traditional" (i.e., disable_private_network_bookkeeping == true) bookkeeping, DAQInterface sets the routing_master_hostname parameter automatically so that if users move a routing_master from one host to another in their boot file they don't need to worry about manually changing the value. I would argue that this is bookkeeping functionality we would want to retain. In the use case that the user really does want to override DAQInterface's bookkeeping, they can set the parameter explicitly for an override in the boot file.

#22 Updated by Kurt Biery about 2 months ago

If RoutingMaster1.fcl/routing_master_hostname were instead named RoutingMaster1.fcl/table_update_multicast_interface, would that change your point of view?

#23 Updated by Kurt Biery about 2 months ago

Also, the most likely specific value that a user would use for RoutingMaster1.fcl/table_update_multicast_interface (aka RoutingMaster1.fcl/routing_master_hostname) is x.y.z.0, which, by design, is transportable.

#24 Updated by John Freeman about 2 months ago

The proposal seems reasonable, in that it would achieve the goal of having private network bookkeeping be a toggle for the whether the parameters involved in routing table updates were bookkept or not. There would, of course, be the issue of the routing_master FHiCL people used needing to be changed (routing_master_hostname -> table_update_multicast_interface) in order to take advantage of the change to artdaq, but perhaps the effort's worth it.

#25 Updated by Kurt Biery about 2 months ago

Good. We/I can look into changing the name.

In the meantime, I need a way to specify a value for RoutingMaster1.fcl/routing_master_hostname that is not book-kept. You mentioned something about specifying a value in the boot.txt file, but I'm not sure how I would do that. Could you provide a sample?
Thanks

#26 Updated by John Freeman about 2 months ago

Basically you just add a line to the boot file which is identical to the sort of line you'd have in a FHiCL document. E.g.:

routing_master_hostname: "1.2.3.4" 

...and what would happen is that at the very end of bookkeeping, DAQInterface would look for every FHiCL document where routing_master_hostname was a parameter and overwrite its value with "1.2.3.4". This technique is currently used, e.g., on ProtoDUNE.

#27 Updated by Kurt Biery about 2 months ago

OK, but unfortunately, that's not what is needed. I only need to overwrite the one instance of routing_master_hostname. (which admittedly, could be better named in RoutingMaster1.fcl)

#28 Updated by John Freeman about 2 months ago

Ugh, right - routing_master_hostname isn't exactly the same thing in the boardreaders and eventbuilders, as you pointed out near the top of the issue, so it would damage things to assign the same value across the documents. In that case, with the current code on the branch I'm afraid you're stuck having to use DAQINTERFACE_DISABLE_BOOKKEEPING, as described in https://cdcvs.fnal.gov/redmine/projects/artdaq-utilities/wiki/Daqinterface_for_developers .

Now, if we want to say that a requirement for this Issue is that the general public should have to manually set routing_master_hostname for the routing_master in case private network bookkeeping is disabled, it of course would be quite simple to change the code to do this. We'd change this line:

        table_to_bookkeep = re.sub("routing_master_hostname\s*:\s*\S+",
                                      "routing_master_hostname: \"%s\"" % (router_process_hostnames[router_process_subsystem].strip("\"")),

to
if "RoutingMaster" not in self.procinfos[i_proc].name or not self.disable_private_network_bookkeeping:
        table_to_bookkeep = re.sub("routing_master_hostname\s*:\s*\S+",
                                      "routing_master_hostname: \"%s\"" % (router_process_hostnames[router_process_subsystem].strip("\"")),

#29 Updated by Kurt Biery about 2 months ago

That sounds excellent; that change would be much appreciated.

#30 Updated by John Freeman about 2 months ago

Change has been made on the feature branch via commit 399c22b9a129e3cb50ca3f1591ce6dd0e7853bf7

#31 Updated by Kurt Biery about 2 months ago

To test this, I'll try a couple of variations of settings on several different computers. The variations in settings that I have in mind are:
  1. default values of parameters in the mediumsystem_with_routing_master sample config and the default value of disable_private_network_bookkeeping (which is false)
  2. default values of parameters in the mediumsystem_with_routing_master sample config and a value for disable_private_network_bookkeeping of true
  3. special values of parameters in the mediumsystem_with_routing_master sample config and a value for disable_private_network_bookkeeping of true, where the special sample config values are multicast_interface ones like x.y.z.0

I'll uses an artdaq_demo v3_06_01 software area, with an artdaq branch of feature/23362_ReportNetworkInterface and an artdaq-utilities-daqinterface branch of feature/Issue21769_SBN_Multicast_Tests.

#32 Updated by Kurt Biery about 2 months ago

[Useful command: tshow | grep -i successfully | egrep -i 'token|multicast|acknow' | grep <process label> | head | tdelta -ct 1]

mu2edaq13
  • all defaults - succeeded (run 9) [the test used address 10.226.9.27]
  • disable_private_network_bookkeeping: true - succeeded (run 10)
  • PrivNetBookkeeping off and special network interface values of 192.168.157.0 - succeeded (run 13) after an artdaq code change to translate the subnet to a specific address within RoutingMasterCore.
mu2edaq01
  • all defaults - succeeded (run 16) [the test used address 10.226.9.16]
  • disable_private_network_bookkeeping: true - succeeded (run 17)
  • PrivNetBookkeeping off and special network interface values of 131.225.80.0 - succeeded (run 21)
sbnd-daq33
  • all defaults - succeeded (run 1) [the test used address 192.168.230.33]
  • disable_private_network_bookkeeping: true - failed, as expected, since multicasts are not enabled on the public network
  • PrivNetBookkeeping off and special network interface values of 192.168.230.0 - succeeded (run 4)
icarus-vst01
  • all defaults - succeeded (run 2) [the test used address 192.168.184.103]
  • disable_private_network_bookkeeping: true - failed, as expected, since multicasts are not enabled on the public network
  • PrivNetBookkeeping off and special network interface values of 192.168.184.0 - succeeded (run 4)
np04-srv-015
  • all defaults - succeeded (run 1) [the test used address 10.73.136.35]
  • disable_private_network_bookkeeping: true - succeeded (run 2), since 0.0.0.0 translates to a valid private-network address
  • PrivNetBookkeeping off and special network interface values of 10.73.136.0 - succeeded (run 3)
np04-srv-001
  • PrivNetBookkeeping off and special network interface values of 10.73.138.0 - succeeded (run 4)

#33 Updated by Kurt Biery about 2 months ago

  • Status changed from Resolved to Reviewed


Also available in: Atom PDF