Project

General

Profile

Feature #6397

Add detection of missing spill server to watchDataDisk.sh

Added by Peter Shanahan over 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Start date:
06/02/2014
Due date:
% Done:

100%

Estimated time:
Spent time:
Duration:

Description

This issue is a test drive of the Online Support Group Software Quality Assurance procedure for the Online Support Group, which will serve as a template for the NOvA DAQ SQA procedure. In this case, the "Change Management" section applies.

An SQA level has not yet been applied for watchDataDisk.sh, but Low would be appropriate, given that this is a monitoring system designed to provide extra checks.

Generation: This change is to add functionality to watchDataDisk.sh to detect when SpillServer is not running. Shifters seem to consistently miss other queues, such as red and pink boxes in the SpillServer monitor. The need for additional means to bring SpillServer outages to the shifters' attention has been discussed in NOvA FarDet Outfitting meetings, and in this ECL entry
Disk Watcher is a candidate for this, since it knows whether a run is going for each partition.

SQALevel: Low

Approval: No prior approval needed - 2 (<32) hours work anticipated.

Requirements: The script must detect when Pulse-Per-Second data is not incrementing for a partition, despite a run being in progress, and issue a well-defined warning message to the message service for that partition. There is no requirements document for this script.

Design: There are no changes to major logic changes. An additional check for a recent "t05" file update is used in function checkStatusChange. For each partition, if a run is detected to be going, but the t05 has not updated in activeTime seconds, a warning message is sent to the message facility backbone. The text is
Partition <partitionnumber> PPS stream not updating. SpillServer might be broken. This message can be caught by the message analyzer, if it is running.

Test Plan: This will be tested first on NDOS, with the following tests:
  1. Does the warning get generated if the PPS stream will be disabled in the configuration?
  2. Does the warning not get generated if the PPS stream is enabled and incrementing?

User Documentation: This is a simple script providing additional backup. User Documentation does not exist at this time. Shifter instructions will be updated when this feature is rolled out.

Further steps await testing. Documentation time to this point: 15 minutes. Coding time: 15 minutes.

History

#1 Updated by Peter Shanahan over 5 years ago

Peter Shanahan wrote:

Design: There are no changes to major logic changes. An additional check for a recent "t05" file update is used in function checkStatusChange. For each partition, if a run is detected to be going, but the t05 has not updated in activeTime seconds, a warning message is sent to the message facility backbone. The text is

Design: Check for recent "t05" replaced by "t05" or "PPS", since the latter is used at NDOS. To clarify, this check is done by grepping for this pattern in the name of all recently touched files

#2 Updated by Peter Shanahan over 5 years ago

Testing Tests were successfully completed on June 3, 2014

Design (Update) During testing, it became clear that simply checking the existence of a recent t05/PPS steam is not sufficient, if activeTime+1 minutes [3] is close to the subrun duration, since a new file will be opened for every subrun. There is now an additional check that the t05/PPS file has a timestamp less than ppsLagMax seconds [20] in the past.

#3 Updated by Peter Shanahan over 4 years ago

  • Status changed from New to Resolved

#4 Updated by Peter Shanahan over 4 years ago

  • % Done changed from 0 to 100


Also available in: Atom PDF