Project

General

Profile

Bug #18101

"Tries needed" not working properly

Added by Richard Neswold about 2 years ago. Updated over 1 year ago.

Status:
Feedback
Priority:
Normal
Category:
-
Target version:
Start date:
11/02/2017
Due date:
% Done:

0%

Estimated time:
Duration:

Description

Wally reports that the alarm block's "tries needed" isn't working as intended. He noticed the problem with S:SUMPS. The sump pump would run for ~12 minutes and turn off for ~3 minutes. He set the "tries needed" field to 100 to detect when the pump was on for a long time. He saw the "tries" increase until it hit 100 and the alarm was reported. Then he saw it count down to 68 tries when the pump was off. Then, when it kicked on again, it immediately went to 100 and back into alarm. Attached is a movie showing the behavior.

There are two ways the "tries needed" field should be used (the described behavior is not either of them):

  • Device is in a good state. The device has to be in a bad state for "tries" scans before entering the alarm state. If the device "goes good", the "tries" count is reset to zero.
  • Similar to above except that the device has to be good for "tries needed" scans before the alarm is removed. In other words the "tries" field has to count up and down "tries needed" scans before the alarm state changes. This gives hysteresis for both states. Wally believes this is the intended behavior.

Which algorithm should it be? And make MOOC implement it properly.

digital_alarm.mov (21.2 MB) digital_alarm.mov Richard Neswold, 11/02/2017 10:32 AM

History

#1 Updated by Richard Neswold about 2 years ago

The comments in MOOC agree with the second description (which Wally believed to be the correct behavior.) Now to check if the code implements it properly...

#2 Updated by Richard Neswold about 2 years ago

  • Status changed from New to Feedback
  • Assignee set to Richard Neswold

Proposed fix: 9a72f81c

The old code would partially update the state of the alarm while tnow was being updated so it would get confused when the status changed from good to bad, or vice versa. It handled it by slamming tnow to either 0 or tneeded, whichever was more appropriate.

The new code only updates the alarm state when tnow reaches one of its limits.

Could someone review the fix? We need to test this on a test front-end, too.

#3 Updated by Dennis Nicklaus about 2 years ago

Summary: If the device really did go out of active alarm (off the alarm screen, "gone good") when it went from 100 to 99 (or 68 or whatever), then that is incorrect behaviour. But jumping back to TNow=100 after one bad reading is not necessarily wrong. See below.

I reviewed the code, and I conclude:
  1. I understand (kill me now, please) the "jump back to TNow=TNeeded" behaviour and agree with Rich, above.
  2. The above modified code will miss one case: where the acrsend/dcrsend fails. The block is marked as active/inactive, depending on the direction, but the going good/bad message will not be sent ever, whereas with the previous code, I think it would re-attempt the send each time the block re-scans as bad.

And I have a question for Wally about his observation, and about the behaviour in general about what Wally observed, and Rich fixed. When you say, "Then he saw it count down to 68 tries when the pump was off. Then, when it kicked on again, it immediately went to 100 and back into alarm" and, in particular, by back into alarm do you mean to imply that it was removed from the alarm screen ("went good") as soon as it came down one from 100, and thus was restored to the alarm screen when it jumped back to 100? Or are you just complaining about the fact that it suddenly jumped back from 67 to 100?

Because the effect of the old code was that it has to be TNeeded in a row in order to make the state change. With the old/existing code, you should see the similar behaviour when counting up. With TNeeded=100, it might go from 0 to 5 with bad readings, then get one good reading, and you're back to starting over at 0.

The "in a row" vs. cumulative count is a somewhat significant difference, and if that's the focus of this issue, then I think that is a rather significant change that deserves more than cursory discussion. The old/existing Mooc implementation intends the "in a row" interpretation.

#4 Updated by Richard Neswold about 2 years ago

Dennis Nicklaus wrote:

The above modified code will miss one case: where the acrsend/dcrsend fails. The block is marked as active/inactive, depending on the direction, but the going good/bad message will not be sent ever, whereas with the previous code, I think it would re-attempt the send each time the block re-scans as bad.

Oops! Good catch. This should fix it: af0eb2d2

#5 Updated by Richard Neswold about 2 years ago

Dennis Nicklaus wrote:

Because the effect of the old code was that it has to be TNeeded in a row in order to make the state change. With the old/existing code, you should see the similar behaviour when counting up. With TNeeded=100, it might go from 0 to 5 with bad readings, then get one good reading, and you're back to starting over at 0.

The "in a row" vs. cumulative count is a somewhat significant difference, and if that's the focus of this issue, then I think that is a rather significant change that deserves more than cursory discussion. The old/existing Mooc implementation intends the "in a row" interpretation.

Here's a comment from the MOOC code:

 *   -- Alarm block state (GB, HI, LO, TNOW) should always track the
 *      current conditions (except for bypassed alarms).
 *
 *   -- If an alarm is bypassed it enters a good state: meaning that
 *      GB, HI, LO, and TNOW are cleared and any outstanding alarm message
 *      must be cancelled by a 'good' event report.
 *
 *   -- If a channel goes within bounds, TNOW counts down to but not beyond
 *      zero. As soon as TNOW == 0 a 'good' event report is issued only if
 *      a 'bad' one was previously posted.
 *
 *   -- If a channel goes out of bounds, TNOW counts up to but not beyond
 *      TNEEDED.  As soon as TNOW == TNEEDED we post a 'bad' event report.
 *
 *   -- If we are unable to send an event report, and the channel remains
 *      in the same state, we will notice this fact the next time we scan
 *      it.  The event report will be sent at that time.  No special processing
 *      is necessary.

It doesn't mention reseting tnow which makes it seem MOOC has been broken for a while. Today's fix brings it in line with Wally's interpretation and with the comment's description.

#6 Updated by Richard Neswold about 2 years ago

Dennis Nicklaus wrote:

The "in a row" vs. cumulative count is a somewhat significant difference, and if that's the focus of this issue, then I think that is a rather significant change that deserves more than cursory discussion. The old/existing Mooc implementation intends the "in a row" interpretation.

FWIW, I prefer the "in a row" interpretation rather than the "counting up and down" version. In fact, since most alarms have low "tries needed" values, I don't think I ever realized that we count down to zero to "go good". I think the "tries now" should only count from 0 to "tried needed" to change the state of the alarm. Once the alarm state has changed, "tries now" should reset to 0.

But like Dennis says, this requires more discussion. Maybe others can chime in?

#7 Updated by Dennis Nicklaus about 2 years ago

Kucera says that Linac frontends implement the "in a row" philosophy for tries needed.

#8 Updated by Richard Neswold almost 2 years ago

This (5b2525d1) commit makes it so that "tries now" always counts from 0 to "tries needed" (instead of the up/down counting.)

If someone sees "tries now" at 60 and "tries needed" is 100, are they 60% on the way to going bad? Or 40% on the way to going good? They'd have to look at the current state of the alarm to figure out which direction "tries now" needed to go. That's too complicated and it's silly that "tries now"'s value has alarm state cooked into it; it should only indicate how far left there is to go.

#9 Updated by Richard Neswold over 1 year ago

  • Target version changed from MOOC v4.8 to MOOC v5.0

Push to next version.



Also available in: Atom PDF