DPMs are exiting unexpectedly
Denise, Jim and Brian all reported that console applications are getting
DPM_PEND errors and then, eventually,
NO_DPM errors. I logged into each CLX machine that runs a DPM and found all but
CLX5's were not running. (The DPM on CLX5 only accepts requests from MCR consoles, so .) I restarted all the DPMs besides
CLX5's. Not too soon later, they started terminating again.
One strength of the DPMs is it uses ACNET's multicast requests feature for service discovery (and load balancing). We can add or remove DPMs and the clients will adjust to the changes automatically. The weakness, however, is that, if a client triggers a bug that shuts down a DPM, it'll cycle through all of them. This is why we're very cautious when releasing new code and also why we restrict who can use the MCR's DPM.
Denise checked to see if they were using so much memory that the OS killed them. She reported that the memory usage looked stable.
#1 Updated by Richard Neswold about 2 years ago
I shut down
CLX18's DPMs and then restarted them on the command line so I could see the errors as they occurred.
A client is connecting and then canceling quickly. When DPM tries to return a list ID to the client, it uses part of the reply ID in the calculation. The code was written to assume the reply ID would remain valid. In this case, however, an exception is being thrown because the request was already canceled and the reply ID isn't valid. When enough of these unhandled exceptions happen, the DPM application terminates.
I've run these commits on
CLX18 and they occasionally get a burst of the these cancels. But the new code simply reports the occurrences.
#2 Updated by Richard Neswold about 2 years ago
I added code (2596319b) to report the address of the client that immediately cancels. I'm logged into the DPMs on both
CLX18 and I'm seeing a steady stream of these coming from a client running on
CLXSRV. Sometimes they're a few seconds apart and sometimes I see four per second. But they're all from
#3 Updated by Richard Neswold about 2 years ago
- Status changed from Assigned to Closed
Wally is using a library that Beau wrote to look-up a device index in his web app. Beau's routine invokes an ACL script to do it. A side effect of running anything that uses CLIB is that a connection is made to DPM (the assumption being anyone using CLIB wants control system data.) In this case, the ACL script returns the DI quickly and terminates which caused the thrashing on the DPMs.
Beau is changing his library since he knows another way to get the DI without using ACL. I'm currently restarting all the DPMs with the new code that ignores clients that immediately cancel their requests.