Project

General

Profile

Support #14521

W1 is having problems with DPM

Added by Richard Neswold over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
-
Start date:
11/16/2016
Due date:
% Done:

0%

Estimated time:
Duration:

Description

Ming-Jen reported that W1 is having problems with old and new DPM.

He first reported he was getting a LARGE LIST error on new DPM. The LARGE LIST status is only a warning that you're asking for more than 1,000 devices. It still allows the request to go through, however. We've modified new DPM to not emit this warning until 2,000+ devices are requested.

He also reported that on new DPM, he's getting DPM_PEND and, occasionally, ACNET_NOREMEM errors. On old DPM, he's getting DPM_PEND errors.


Related issues

Related to DPM - Bug #13703: Inconsistent behavior with D105Assigned08/29/2016

History

#1 Updated by Richard Neswold over 3 years ago

I think the DPM_PEND status indicates an issue with the front-end, since both old and new are reporting it.

On the other hand, the ACNET_NOREMMEM status is generated by acnetd when it's out of reply IDs. Charlie says when CLIB builds the device list, it doesn't wait for the previous reply before sending the next request; there is a burst of requests. We'll have to see if W2 slows down significantly if we serialize the requests that build the list.

Ming-Jen also mentioned his application has a mix of periodic requests along with one-shots. It's the one-shots that are getting the ACNET_NOREMMEM status, which makes sense if we're running out of reply IDs.

So here are three proposed solutions:

  1. Make CLIB serialize its requests.
  2. Increase the number of reply and request IDs in acnetd (requires a re-compile and re-installation)
  3. Add more DPMs (we have four running now and Ming-Jen's app is using 25% of the reply IDs)

#2 Updated by Richard Neswold over 3 years ago

Ming-Jen is going to convert his one-shot requests into repetitive requests. We'll see if this improves the situation.

#3 Updated by Richard Neswold over 3 years ago

Worked with Ming-Jen this morning. We think there's an issue, still, with one-shot requests. W1's repetitive requests are working fine. This application, however, sends 1,200+ one-shots when a clock event fires. All the errors are associated with the one-shot requests.

Bill Marsh mentioned his BBM application has started having problems with one-shots starting today. I've added him to this issue.

#4 Updated by Richard Neswold over 3 years ago

We see that the 1,000 one-shot requests are being sent to front-ends individually instead of grouped together! This regression happened recently, probably when we separated one-shots from repetitive requests in CLIB. Now I'm trying to determine which layer of DPM isn't grouping the requests together (because the fe_worker layer is seeing one at a time.)

#5 Updated by Richard Neswold over 3 years ago

  • Status changed from Assigned to Feedback

We think we found and fixed (bc0b2ed4) the problem.

Although no client interface supports it yet, DPM has always supported a "fixed" data acquisition list. A fixed list can't be modified and restarted with other clients' requests. We added it because we thought some clients may want to have an uninterruptible stream of data. When Charlie and I split off one-shot requests (a month or two ago), we categorized the one-shots as "fixed" because once the one-shot made, there's no sense in restarting it for another one-shot; just send the other one shot.

The bug was that none of the fixed lists were getting merged at all, so every one-shot request to a front-end was creating a new request consisting of one device. In Ming-Jen's case, his 1,016 devices were creating 1,016 requests! MOOC front-ends can only handle ~200 active requests. Some busy front-ends, like MI2 and MI3, already have many requests and couldn't handle the hundreds of new requests from Ming-Jen's application, hence the ACNET_NOREMMEM status.

This commit fixes the problem so that, if a fixed list hasn't yet been sent, we can add more device requests to it. This greatly reduces the number of temporarily active requests on a front-end.

I will release new DPMs on Monday.

#6 Updated by Richard Neswold over 3 years ago

  • Related to Bug #13703: Inconsistent behavior with D105 added

#7 Updated by Richard Neswold over 3 years ago

All DPMs have been restarted with this new code. Feedback from Ming-Jen and Bill Marsh will determine the next step.

#8 Updated by Richard Neswold over 3 years ago

Ming-Jen has confirmed the patch fixes his problem in W1.

#9 Updated by Richard Neswold over 3 years ago

  • Status changed from Feedback to Closed

I'm closing this issue since Ming-Jen confirmed the problem has been fixed. I'll wait for Bill's feedback before closing #13703.



Also available in: Atom PDF