Project

General

Profile

Bug #22390

MACALC devices stop updating after some amount of time

Added by Dennis Nicklaus 8 months ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Start date:
04/16/2019
Due date:
% Done:

0%

Estimated time:
Duration:

Description

This has been noticed lately with the CMTS devices T:LGSUM, T:LGSUMC, and T:LPSUMC.
They just stop updating their values after a few days or hours. Lots of unknowns. Is it just these devices? I don't think so. I think it is more that these devices are being more closely monitored,
so they get noticed.
The calculation jobs should just continually be receiving readings which then lets them update their values.
A reset to the device (which restarts the data acq. job) clears the problem and makes them update again, but we shouldn't have to do that routinely.

History

#1 Updated by Dennis Nicklaus 8 months ago

I started a little experiment this afternoon. Starting a calculation similar to T:LGSUM on the erlang front-end clx29e using the erlang dpmclient interface with the result going in to N:M1LSGV. I want to see if it's readings stop or get canceled and restarted at all or with any correlation to the MACALC T:LGSUM. And there aren't a million devices in clx29e so I can see what is happening a little more clearly.
I reset the three noted macalc devices at around the same time that I started the erlang job.

#2 Updated by Dennis Nicklaus 8 months ago

OK, that experiment wasn't obviously helpful. No problems with the clx29 erlang data acq. (I believe it would log something if the dpm_client jobs had to be restarted.)
T:LGSUM died/got frozen at 4/16 20:11:29. There's nothing in the dce10/macalc log between 20:10:56 (PoolPingRestart of an MI2 job) and 20:11:56 (poolPingRestart of an MRF job)
T:LGSUMC freezes at 4/17 04:52:56. In the log at that second are these two apparently unrelated msgs: "AcnetNodeInfo, nodeAnnouncement test: false, DI: 140214, DCE10 DAE_is_up"
and "StateEventDecoder, sequenceNumber: 574, lastSequenceNumber: 5678, numLost: 19348 at Wed, Apr 17 04:53:48 CDT 2019" (Yes, it seems to be logging something about the future?)
Now, at 04:53:56 (exactly one minute after the freeze of LGSUMC) there are several PoolPingRestart messages for various frontends, but none of them are the origin nodes for LGSUMC's components and these PoolPingRestart messages seem to recur regularly (and annoyingly) every 3 minutes.

#3 Updated by Dennis Nicklaus 8 months ago

Just for completeness: T:LPSUMC froze at 9:09:22 4/17. On DCE10 I did a "Dump Pools" of DerivedClientByEvent (superclass of MACALC) and it shows last set times matching the frozen times for these three devices, which is not surprising.



Also available in: Atom PDF