Project

General

Profile

Bug #21114

DPMs are exiting unexpectedly

Added by Richard Neswold over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
High
Category:
Data Pool Manager
Target version:
Start date:
10/11/2018
Due date:
% Done:

0%

Estimated time:
Duration:

Description

Denise, Jim and Brian all reported that console applications are getting DPM_PEND errors and then, eventually, NO_DPM errors. I logged into each CLX machine that runs a DPM and found all but CLX5's were not running. (The DPM on CLX5 only accepts requests from MCR consoles, so the MCR wasn't seeing any disruption.) I restarted all the DPMs besides CLX5's. Not too soon later, they started terminating again.

One strength of the DPMs is it uses ACNET's multicast requests feature for service discovery (and load balancing). We can add or remove DPMs and the clients will adjust to the changes automatically. The weakness, however, is that, if a client triggers a bug that shuts down a DPM, it'll cycle through all of them. This is why we're very cautious when releasing new code and also why we restrict who can use the MCR's DPM.

Denise checked to see if they were using so much memory that the OS killed them. She reported that the memory usage looked stable.

History

#1 Updated by Richard Neswold over 1 year ago

I shut down CLX25 and CLX18's DPMs and then restarted them on the command line so I could see the errors as they occurred.

A client is connecting and then canceling quickly. When DPM tries to return a list ID to the client, it uses part of the reply ID in the calculation. The code was written to assume the reply ID would remain valid. In this case, however, an exception is being thrown because the request was already canceled and the reply ID isn't valid. When enough of these unhandled exceptions happen, the DPM application terminates.

Commits b0caa5ed and 29746c9e check for this condition and simply ignore the request.

I've run these commits on CLX25 and CLX18 and they occasionally get a burst of the these cancels. But the new code simply reports the occurrences.

#2 Updated by Richard Neswold over 1 year ago

I added code (2596319b) to report the address of the client that immediately cancels. I'm logged into the DPMs on both CLX25 and CLX18 and I'm seeing a steady stream of these coming from a client running on CLXSRV. Sometimes they're a few seconds apart and sometimes I see four per second. But they're all from CLXSRV.

#3 Updated by Richard Neswold over 1 year ago

  • Status changed from Assigned to Closed

Wally is using a library that Beau wrote to look-up a device index in his web app. Beau's routine invokes an ACL script to do it. A side effect of running anything that uses CLIB is that a connection is made to DPM (the assumption being anyone using CLIB wants control system data.) In this case, the ACL script returns the DI quickly and terminates which caused the thrashing on the DPMs.

Beau is changing his library since he knows another way to get the DI without using ACL. I'm currently restarting all the DPMs with the new code that ignores clients that immediately cancel their requests.



Also available in: Atom PDF