Project

General

Profile

Bug #13703

Inconsistent behavior with D105

Added by Richard Neswold over 3 years ago. Updated over 1 year ago.

Status:
Assigned
Priority:
Normal
Category:
Data Pool Manager
Target version:
Start date:
08/29/2016
Due date:
% Done:

0%

Estimated time:
Duration:

Description

Bill Marsh reported that he's getting inconsistent results on D105 (Beam Budget Monitor) with the new DPM. With two instances running on CLX4, he sees one work perfectly fine and the other getting 100% DPM_PEND errors.


Related issues

Related to ACNET Service - Bug #14368: acnetd is frequently reporting QUEUE_FULL on idle clientsClosed11/02/2016

Related to DPM - Support #14521: W1 is having problems with DPMClosed11/16/2016

Related to DPM - Bug #15629: PA4187 (R52, I52) DPM_PEND using new DPMClosed02/22/2017

Related to ACNET Service - Bug #19829: Found a bug in replier clean-up.Closed04/27/2018

Related to ACNET Service - Bug #20222: Assigned DPM that has an IssueClosed06/25/2018

History

#1 Updated by Richard Neswold about 3 years ago

We weren't able to reproduce the problem. Today, however, Bill said it was happening again. His two instances were using CLX22 as their DPM. After poking around a bit, I discovered there were several one-shot requests that weren't getting data (some one-shots were started on Sept 20th -- two weeks ago!) It appears there may be a race condition when merging one-shots.

#2 Updated by Richard Neswold about 3 years ago

Bill reported it's happened agin this morning. This time, each instance of D105 is using a different DPM. The working instance is on CLX20's and the non-working one is on CLX37's. This is interesting because I was going under the theory that the second instance on the same DPM would cause the problem.

#3 Updated by Richard Neswold about 3 years ago

Charlie and I discovered that acnetd has been reporting ACNET_QUEUEFULL errors on some nodes. This error means that an ACNET service (in this case DPMD) isn't responding to requests fast enough. Looking at the DPM code, I embarrassingly realized all DPM traffic is going through one ACNET handle! All service discovery messages, DPM list management, DPM returned data, and all front-end communications!

I released a new version of DPM this morning with two important changes:

  1. DPMs now open two, client-only handles, FE_PER and FE_ONE, which are used for front-end periodic and one-shot communications, respectively. Now the only incoming messages to the DPMD handle are requests, so the new DPMs should be more responsive to requests.
  1. Some of our CLX machines seem to have a nonfunctioning load average computation (i.e. when top shows 100% activity, the load average shows 0.04, or something way too low.) We were using the load average to delay responses to the service discovery message. Since some nodes were using load averages that were always low, they were taking all the DPM traffic. Today's version computes a random delay between 0 and 100ms.

The unresponsiveness due to these two behaviors may have played a part in this issue. D105 does a series of one-shots, which are fairly request-intensive. The load average bug resulted in only one or two nodes handling all the Control System's requests. Using only one ACNET handle made those nodes run out of resources.

We will run with this configuration and see if D105 still has problems.

#4 Updated by Richard Neswold about 3 years ago

  • Status changed from Assigned to Feedback

Last update should have changed this to 'Feedback' state.

#5 Updated by Richard Neswold about 3 years ago

  • Related to Bug #14368: acnetd is frequently reporting QUEUE_FULL on idle clients added

#6 Updated by Richard Neswold about 3 years ago

We feel #14368 is what's causing this issue. We're in the process of restarting the new DPMs using the fixed acnetd. We'll need to run for several weeks before we're truly confident it's solved.

#7 Updated by Richard Neswold about 3 years ago

#8 Updated by Richard Neswold about 3 years ago

DPM takes several precautions to make sure clients receive the data they requested. One delicate condition is when a client restarts a list. If the client reuses a ref ID, we don't want to return device data that was previously associated with the ref ID. We correctly filter that data so the client sees a consistent stream of data.

Unfortunately, this strategy doesn't work well with one-shot data.

D105 spawns an SA that wakes up once a minute and does a flurry of one-shot reads. We see that the SA can ask for a one-shot before the previous result is delivered. In this case, the previous, in-flight one-shot may get filtered.

I committed a change where one-shots don't get filtered if lists are restarted. This should fix the problem for any CLIB application. For non-CLIB clients that restart their list before they received their one-shot data, they may get stale data. Fortunately, for non-CLIB clients, this is an unusual and incorrect use-case, so it shouldn't be a problem.

Charlie ran BBM for 24 hours and saw a few pending errors. He then ran it another 24 hours after my change and didn't see any pending errors.

Right now, this new code is running on CLX18, CLX19, and FGATE. On Monday, I'll restart the others.

commit: 8324371b

#9 Updated by Richard Neswold about 3 years ago

DPMs have been restarted on CLX5, CLX20, CLX21, CLX22, CLX23, and CLX25. All DPMs are running the new code. I'll close this issue if D105 runs cleanly for a week.

#10 Updated by Richard Neswold over 2 years ago

Well, it survived for 3 months before the problem re-occurred. I'm taking that as a good sign that we've found the problem, corrected most of it, and yet there's a tiny race condition that we haven't yet fixed. If we had finer-grained categories in Redmine, I'd move this issue to CLIB's interface to new DPM. Charlie and I will have to instrument CLIB to see why it's getting stuck.

#11 Updated by Richard Neswold over 2 years ago

Brian has an ACL script that reads a device once every two minutes. The reading is done as a one-shot request. After a few days, his script gets "stuck" in that it no longer receives data from the device -- even when he makes a new one-shot request. He found that, if the reading takes too long to arrive and he cancels the one-shot, he can start another one-shot on the device and it returns data.

This seems like DPM isn't getting the one-shot from the front-end. If another one-shot is requested, DPM sees that it matches the current, pending one-shot read which results in two pending reads. If the one-shot is removed, DPM cleans up the request properly and new requests for the device will work.

We could verify this using the following steps:

  • Wait for the script to get "stuck" again (the script shouldn't cancel the request during this test.)
  • Start another task/script that makes a one-shot request to the device. The task needs to use the same DPM as the stuck script so the request will get merged.
  • At this point, the new task shouldn't be receiving data either.
  • Restart the original, stuck script. It shouldn't receive data.
  • Restart both, the data should start flowing.

#12 Updated by Richard Neswold over 2 years ago

I'm running test code on FGATE. In this version, one-shot reads will timeout in a short time and the error will propagate back to the client. Previously we were waiting for acnetd to time out the request, which could take up to 6 minutes. In that time, more one shots could be requested and they would get merged with the pending request.

I wrote a simple client that mimics Brian's client to see if I can get it to happen.

#13 Updated by Richard Neswold about 2 years ago

  • Category set to Data Pool Manager
  • Target version set to DPM v1.6

Stretched the timeout to 1.5 seconds (commit 4f4888b2). Some front-ends are notoriously slow and the 1/4 second timeout was too short for them. This code is running on FGATE so, if you want to test your app against this change, set your new DPM node to FGATE:

$ lnmctl NEWDPM_NODE=FGATE

#14 Updated by Richard Neswold about 2 years ago

This fix: 945cf54f

Fixed a bug when a request timed out, it wasn't restarted (so DPM_PENDs were reported forever.)

These fixes: 50673006 and 60705a39

Fixed a bug when requests were being removed from a combined request.

Please test your application against FGATE by entering the following on your console and then running your app (the title bar of the window should show FGATE as the name of the DPM being used.)

lnmctl NEWDPM_NODE=FGATE

Let me know if this seems to fix your problem.

#15 Updated by Richard Neswold about 2 years ago

  • Related to Bug #15629: PA4187 (R52, I52) DPM_PEND using new DPM added

#16 Updated by Richard Neswold about 2 years ago

  • Status changed from Feedback to Closed

All the modifications we did while trying to resolve this issue were worthwhile. The last changes, however, probably fixed the problem. The code will be available to DPM clients when I release v1.6 of DPM.

#17 Updated by Richard Neswold about 2 years ago

  • Status changed from Closed to Assigned
  • Target version changed from DPM v1.6 to DPM v1.7

It's still happening with this application. The difference between this issue and the other DPM_PEND issues is that this application starts up fine but, after a while, it'll go to DPM_PEND and never recover.

#18 Updated by Richard Neswold almost 2 years ago

Bill reported an instance where the problems occurred and, after restarting his application, it occurred again 8 hours later. They then restarted acnetd and things seemed much better. Maybe there's an issue in acnetd or in the CLIB interface rather than new DPM. We're widening the search.

#19 Updated by Richard Neswold over 1 year ago

  • Target version changed from DPM v1.7 to DPM v1.9

Since v1.7 has been replaced with v1.8, we'll push this issue to v1.9.

#20 Updated by Richard Neswold over 1 year ago

We're optimistic that this isn't a DPM problem at all but is an acnetd problem: #19829. Next week we'll roll out a new version of acnetd and see if this problem persists.

#21 Updated by Richard Neswold over 1 year ago

  • Related to Bug #19829: Found a bug in replier clean-up. added

#22 Updated by Richard Neswold over 1 year ago

We need to get one of Bill's applications to run with the new acnetd. Unfortunately it took weeks, to months, for this symptom to show up. So the sooner we get this set-up, the sooner we can see if the problem is fixed.

#23 Updated by Richard Neswold over 1 year ago

  • Related to Bug #20222: Assigned DPM that has an Issue added


Also available in: Atom PDF