W1 is having problems with DPM
Ming-Jen reported that
W1 is having problems with old and new DPM.
He first reported he was getting a LARGE LIST error on new DPM. The LARGE LIST status is only a warning that you're asking for more than 1,000 devices. It still allows the request to go through, however. We've modified new DPM to not emit this warning until 2,000+ devices are requested.
He also reported that on new DPM, he's getting
DPM_PEND and, occasionally,
ACNET_NOREMEM errors. On old DPM, he's getting
#1 Updated by Richard Neswold over 3 years ago
I think the
DPM_PEND status indicates an issue with the front-end, since both old and new are reporting it.
On the other hand, the
ACNET_NOREMMEM status is generated by
acnetd when it's out of reply IDs. Charlie says when CLIB builds the device list, it doesn't wait for the previous reply before sending the next request; there is a burst of requests. We'll have to see if
W2 slows down significantly if we serialize the requests that build the list.
Ming-Jen also mentioned his application has a mix of periodic requests along with one-shots. It's the one-shots that are getting the
ACNET_NOREMMEM status, which makes sense if we're running out of reply IDs.
So here are three proposed solutions:
- Make CLIB serialize its requests.
- Increase the number of reply and request IDs in
acnetd(requires a re-compile and re-installation)
- Add more DPMs (we have four running now and Ming-Jen's app is using 25% of the reply IDs)
#3 Updated by Richard Neswold over 3 years ago
Worked with Ming-Jen this morning. We think there's an issue, still, with one-shot requests.
W1's repetitive requests are working fine. This application, however, sends 1,200+ one-shots when a clock event fires. All the errors are associated with the one-shot requests.
Bill Marsh mentioned his BBM application has started having problems with one-shots starting today. I've added him to this issue.
#4 Updated by Richard Neswold over 3 years ago
We see that the 1,000 one-shot requests are being sent to front-ends individually instead of grouped together! This regression happened recently, probably when we separated one-shots from repetitive requests in CLIB. Now I'm trying to determine which layer of DPM isn't grouping the requests together (because the
fe_worker layer is seeing one at a time.)
#5 Updated by Richard Neswold over 3 years ago
- Status changed from Assigned to Feedback
We think we found and fixed (bc0b2ed4) the problem.
Although no client interface supports it yet, DPM has always supported a "fixed" data acquisition list. A fixed list can't be modified and restarted with other clients' requests. We added it because we thought some clients may want to have an uninterruptible stream of data. When Charlie and I split off one-shot requests (a month or two ago), we categorized the one-shots as "fixed" because once the one-shot made, there's no sense in restarting it for another one-shot; just send the other one shot.
The bug was that none of the fixed lists were getting merged at all, so every one-shot request to a front-end was creating a new request consisting of one device. In Ming-Jen's case, his 1,016 devices were creating 1,016 requests! MOOC front-ends can only handle ~200 active requests. Some busy front-ends, like
MI3, already have many requests and couldn't handle the hundreds of new requests from Ming-Jen's application, hence the
This commit fixes the problem so that, if a fixed list hasn't yet been sent, we can add more device requests to it. This greatly reduces the number of temporarily active requests on a front-end.
I will release new DPMs on Monday.