Inconsistent behavior with D105
Bill Marsh reported that he's getting inconsistent results on
D105 (Beam Budget Monitor) with the new DPM. With two instances running on
CLX4, he sees one work perfectly fine and the other getting 100%
#1 Updated by Richard Neswold over 4 years ago
We weren't able to reproduce the problem. Today, however, Bill said it was happening again. His two instances were using
CLX22 as their DPM. After poking around a bit, I discovered there were several one-shot requests that weren't getting data (some one-shots were started on Sept 20th -- two weeks ago!) It appears there may be a race condition when merging one-shots.
#2 Updated by Richard Neswold over 4 years ago
Bill reported it's happened agin this morning. This time, each instance of
D105 is using a different DPM. The working instance is on
CLX20's and the non-working one is on
CLX37's. This is interesting because I was going under the theory that the second instance on the same DPM would cause the problem.
#3 Updated by Richard Neswold over 4 years ago
Charlie and I discovered that
acnetd has been reporting
ACNET_QUEUEFULL errors on some nodes. This error means that an ACNET service (in this case
DPMD) isn't responding to requests fast enough. Looking at the DPM code, I embarrassingly realized all DPM traffic is going through one ACNET handle! All service discovery messages, DPM list management, DPM returned data, and all front-end communications!
I released a new version of DPM this morning with two important changes:
- DPMs now open two, client-only handles,
FE_ONE, which are used for front-end periodic and one-shot communications, respectively. Now the only incoming messages to the
DPMDhandle are requests, so the new DPMs should be more responsive to requests.
- Some of our CLX machines seem to have a nonfunctioning load average computation (i.e. when
topshows 100% activity, the load average shows 0.04, or something way too low.) We were using the load average to delay responses to the service discovery message. Since some nodes were using load averages that were always low, they were taking all the DPM traffic. Today's version computes a random delay between 0 and 100ms.
The unresponsiveness due to these two behaviors may have played a part in this issue.
D105 does a series of one-shots, which are fairly request-intensive. The load average bug resulted in only one or two nodes handling all the Control System's requests. Using only one ACNET handle made those nodes run out of resources.
We will run with this configuration and see if
D105 still has problems.
#8 Updated by Richard Neswold about 4 years ago
DPM takes several precautions to make sure clients receive the data they requested. One delicate condition is when a client restarts a list. If the client reuses a ref ID, we don't want to return device data that was previously associated with the ref ID. We correctly filter that data so the client sees a consistent stream of data.
Unfortunately, this strategy doesn't work well with one-shot data.
D105 spawns an
SA that wakes up once a minute and does a flurry of one-shot reads. We see that the
SA can ask for a one-shot before the previous result is delivered. In this case, the previous, in-flight one-shot may get filtered.
I committed a change where one-shots don't get filtered if lists are restarted. This should fix the problem for any CLIB application. For non-CLIB clients that restart their list before they received their one-shot data, they may get stale data. Fortunately, for non-CLIB clients, this is an unusual and incorrect use-case, so it shouldn't be a problem.
Charlie ran BBM for 24 hours and saw a few pending errors. He then ran it another 24 hours after my change and didn't see any pending errors.
Right now, this new code is running on
FGATE. On Monday, I'll restart the others.
#10 Updated by Richard Neswold almost 4 years ago
Well, it survived for 3 months before the problem re-occurred. I'm taking that as a good sign that we've found the problem, corrected most of it, and yet there's a tiny race condition that we haven't yet fixed. If we had finer-grained categories in Redmine, I'd move this issue to CLIB's interface to new DPM. Charlie and I will have to instrument CLIB to see why it's getting stuck.
#11 Updated by Richard Neswold over 3 years ago
Brian has an ACL script that reads a device once every two minutes. The reading is done as a one-shot request. After a few days, his script gets "stuck" in that it no longer receives data from the device -- even when he makes a new one-shot request. He found that, if the reading takes too long to arrive and he cancels the one-shot, he can start another one-shot on the device and it returns data.
This seems like DPM isn't getting the one-shot from the front-end. If another one-shot is requested, DPM sees that it matches the current, pending one-shot read which results in two pending reads. If the one-shot is removed, DPM cleans up the request properly and new requests for the device will work.
We could verify this using the following steps:
- Wait for the script to get "stuck" again (the script shouldn't cancel the request during this test.)
- Start another task/script that makes a one-shot request to the device. The task needs to use the same DPM as the stuck script so the request will get merged.
- At this point, the new task shouldn't be receiving data either.
- Restart the original, stuck script. It shouldn't receive data.
- Restart both, the data should start flowing.
#12 Updated by Richard Neswold over 3 years ago
I'm running test code on
FGATE. In this version, one-shot reads will timeout in a short time and the error will propagate back to the client. Previously we were waiting for
acnetd to time out the request, which could take up to 6 minutes. In that time, more one shots could be requested and they would get merged with the pending request.
I wrote a simple client that mimics Brian's client to see if I can get it to happen.
#13 Updated by Richard Neswold about 3 years ago
- Category set to Data Pool Manager
- Target version set to DPM v1.6
Stretched the timeout to 1.5 seconds (commit 4f4888b2). Some front-ends are notoriously slow and the 1/4 second timeout was too short for them. This code is running on
FGATE so, if you want to test your app against this change, set your new DPM node to
$ lnmctl NEWDPM_NODE=FGATE
#14 Updated by Richard Neswold about 3 years ago
This fix: 945cf54f
Fixed a bug when a request timed out, it wasn't restarted (so DPM_PENDs were reported forever.)
Fixed a bug when requests were being removed from a combined request.
Please test your application against
FGATE by entering the following on your console and then running your app (the title bar of the window should show
FGATE as the name of the DPM being used.)
Let me know if this seems to fix your problem.
#17 Updated by Richard Neswold about 3 years ago
- Status changed from Closed to Assigned
- Target version changed from DPM v1.6 to DPM v1.7
It's still happening with this application. The difference between this issue and the other
DPM_PEND issues is that this application starts up fine but, after a while, it'll go to
DPM_PEND and never recover.
#18 Updated by Richard Neswold almost 3 years ago
Bill reported an instance where the problems occurred and, after restarting his application, it occurred again 8 hours later. They then restarted
acnetd and things seemed much better. Maybe there's an issue in
acnetd or in the
CLIB interface rather than new DPM. We're widening the search.