Project

General

Profile

Bug #21672

ps_lxi_driver downloads problem at startup, alarms fail

Added by Dennis Nicklaus 10 months ago. Updated 23 days ago.

Status:
Feedback
Priority:
High
Category:
Sorenson XG/SG Power Supply Driver
Target version:
Start date:
01/11/2019
Due date:
% Done:

0%

Estimated time:
Duration:

Description

When starting clx30e, ,there appear to be startup order/race conditions.
I iupgraded it to erlang 21.1 and daq 1.7. Daq 1.7 now prints a message to the log when an alarm scan dies.
At startup, a lot (all?) the alarm scans die, apparently. The 'DOWN' message shows that they exit with status=normal.
That would generally happen when the driver returns ERR_DEVFAILED for a reading.

I attempted to make things work by using acl to run "download node=clx30e" and re-start all the alarms. When I did that, I noticed a lot of
setting messages "Set current to ..." printed out that otherwise didn't get printed at startup.

So it appears that the driver isn't really ready for downloads when the downloads happen.

I don't think going to 21.1 or daq 1.7 had anything to do with causing this failure. Just with the new version, printing the 'DOWN' messages when the alarm scans die is new and that makes the problem apparent.

History

#1 Updated by Dennis Nicklaus 10 months ago

Just to clarify -- generally, it is the framework that would return the DEVFAILED error when it cannot find a driver process, not the individual driver code itself that would return that error.

#2 Updated by Richard Neswold 10 months ago

  • Category changed from ACSys/FE Framework to Sorenson XG/SG Power Supply Driver

The driver doesn't accept settings until it's communicating with the power supply. If we don't have communications with the power supply when a setting comes in, the driver needs to remember and apply it when we regain communications.

#3 Updated by Richard Neswold 10 months ago

Here's a proposed fix:

Delete the setting record property in the database for each device so that settings are not sent down when the front-end is restarted.

  • For front-end restarts, it connects to each power supply and reads the current setting.
  • When the front-end and power supplies are off for a long time (a shutdown, for example) and then turned back on, we won't try to use a setting that's weeks (or months) old. The supply will come up off and the operators will decide what the setting should be.

Chip: comments?

#4 Updated by Richard Neswold 10 months ago

We should retry this with the latest version of the framework. Dennis and I found code that was still using sync:to_ms/1. We fixed it here: b35acc53.

#5 Updated by Richard Neswold 10 months ago

  • Status changed from New to Feedback

Two fixes:

commit 14f337f8 -- In this change, after the front-end software is up and running, we start a background task which sleeps for 5 seconds before asking for settings and alarms to be downloaded. This gives drivers a little time to initialize themselves before the settings start appearing.

commit dev-ps_lxi|e6016c96 -- In this change, the driver doesn't wait 15 seconds before attempting communications, it tries right away (if it fails, it uses the 15 second delay before trying again.) This, with the previous change, removes the race condition.

If this works, I'll close out the issue.

#6 Updated by Richard Neswold 2 months ago

  • Target version set to dev-ps_lsi v1.0

Did these changes help the situation? Are we still seeing boot problems?

#7 Updated by Dennis Nicklaus about 1 month ago

Still seeing the problem, but to a lesser degree. See the erlang.log.2 file from a boot today 10/16/19 15:52.
The front-end starts, gives the "connected to supply" messages, then 2 seconds later, "Restored communications with " messages, but mixed in, before it is completely up are two "alarmloop received unknown message {'DOWN..." messages which mean that an alarm scan died.

There were other problems earlier today, which I think were caused by not having the recent version of the ps_lxi oftware installed on clx30. I rebuilt and installed it and it seems OK now.

#8 Updated by Richard Neswold 23 days ago

dev-ps_lxi|4eea2826 : Added an option to gen_tcp:connect/4 to only wait 2 seconds for a connection. This lets the worker process wake up much sooner to respond to requests from the driver process.

dev-ps_lxi|c836d2ef : Removed initial, 1 second delay so the worker process tries to connect sooner to the power supply.



Also available in: Atom PDF