Project

General

Profile

Feature #23404

If there's an XML-RPC communication issue between nodes, DAQInterface should make it more obvious that that's the problem

Added by John Freeman about 1 month ago. Updated 24 days ago.

Status:
Reviewed
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
10/09/2019
Due date:
% Done:

100%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

This issue is based on Ron's experience using DAQInterface at Icarus earlier this week, in which he tried running a boardreader on a different host than the host DAQInterface was running on while there was a problem with the XML-RPC connection between the two hosts. DAQInterface's output when it encounters this failure mode could be a bit clearer. In a nutshell, here's what appeared:

Tue Oct  8 07:54:25 CDT 2019: BOOT transition complete

Exception caught in DAQInterface attempt to query status of artdaq process
pmtx01 at icarus-vst02:14100; most likely reason is process no longer
exists

Tue Oct  8 07:54:28 CDT 2019: CONFIG transition underway

i.e., DAQInterface asked for the boardreader's status on the other host and an exception was thrown and then swallowed; this typically appears because a boardreader crashed but can also happen if there's a connectivity issue.

Then only later, when it tried sending the init transition to the boardreader, the following exception thrown within the external "socket" module caused DAQInterface to return to the "Stopped" state:

error: [Errno 113] No route to host

A couple of observations:

  • It would have been ideal if DAQInterface hadn't even bothered to begin the config transition after it saw that it couldn't communicate with the boardreader. The reason it swallowed the first exception and continued on is because that way it can support systems where certain artdaq processes are allowed to die (see #22061 for more info on this). I think I can get DAQInterface to be smart enough to know when it is and isn't appropriate to quit after a failed query.
  • The "No route to host" exception, while not terribly worded, isn't as enlightening as it could be. I could get DAQInterface to look for that exception and then print out an error along the lines of: "Attempt to communicate with process X on node Y caused a "No route to host" exception; this is likely due to a network issue preventing XML-RPC from working between this node and node Y"

Associated revisions

Revision 601c99d5 (diff)
Added by John Freeman about 1 month ago

JCF: Issue #23404: engineer errors which reasonably recreate the error conditions described in this issue

Revision 573c7d34 (diff)
Added by John Freeman about 1 month ago

JCF: Issue #23404: come up with a better description when an exception is thrown during an attempt to query the status of an artdaq process

Revision c7164606 (diff)
Added by John Freeman about 1 month ago

JCF: Issue #23404: if self.exception has been set, then the next call to the runner() function should result in a recover

Revision 096f0793 (diff)
Added by John Freeman about 1 month ago

JCF: Issue #23404: changes to daqinterface.py to make it clearer what's happening when you can't communicate via XML-RPC with a process

Revision a75761a9 (diff)
Added by Ron Rechenmacher 25 days ago

Issue #23404 - remove unused check_proc_exceptions_number_of_status_failures

History

#1 Updated by John Freeman about 1 month ago

  • % Done changed from 0 to 100
  • Status changed from New to Resolved

Issue resolved. With commit 096f0793f00aeee451323614089eef474a1f7ccc at the head of feature/23404_describe_xmlrpc_problem, the following is now the case:

  • If an exception is thrown when querying the status of an artdaq process either because the process has died or because you can't communicate with it via XML-RPC, DAQInterface alerts you to the likely cause
  • If any exception is thrown when querying an artdaq process's status, any subsequent commands in the queue - e.g., a config request - get discarded, and DAQInterface puts itself back into the Stopped state via a recover. Note that it remains the case that a "Connection refused" exception - caused by process death - needs to occur twice, so that DAQInterface has a chance to remove a process from the existing processes list in the event that the user wants a run to be robust against a given process's death (again, see #22061 for more on this use case).

#2 Updated by Ron Rechenmacher 24 days ago

  • Status changed from Resolved to Reviewed

On mu2edaq cluster, kdestroyed and ssh-agent -k, and specified remote host for component01...
Got timeout message (asking for password).
All worked when ssh-agent running.
Merged feature/23404_describe_xmlrpc_problem into develop.



Also available in: Atom PDF