Project

General

Profile

SISPI ICS Alarms and Telemetry Guide

This document describes the software used to log ICS alarms and telemetry into the SISPI database.

Software Organization

The ICS software consists of Labview applications for the different control and monitoring tasks and a set of shell and python scripts that provide the necessary interface to allow alarm messages and telemetry information from the Labview applications to be archived in the DECam database. The Labview software is managed by the ICS team and is not further discussed in this Wiki. The shell and python scripts, however, are maintained in the SISPI svn repository and are managed as SISPI (eups) products. The following provides a brief overview of the repository structure.

The software for all ICS components is kept in the ICS sub-repository. The base URL is

svn ssh://p-sispi@cdcvs.fnal.gov/cvs/projects/sispi/ICS

Software for the individual ICS components can be found under this base address. Currently we have the following devices: ICC, Heaters, CageTemps, ICSAlarms, LN2, Ionpump, FacilityUtilities and Photodiode. An additional directory named ICS is set up for general ICS scripts and documentation. There is also the ICSTelemetry package which contains the scripts for starting the Telemetry components. Each of these components is its own SISPI product.

Following SISPI conventions each of these components has a trunk, tags and branches sub directory in the repository. Under this we have the standard ups directory for the table file and a bin directory for executables, links and shell scripts. Some products also have a python, src or doc directory for python scripts, source code and documentation, respectively.

The ICS scripts produce log files and hence need access to a writable directory. The environment variable $ICS_LOG_DIR is defined in the Site product table file. The alarm script writes to $ICS_LOG_DIR/alarms and the telemetry scripts write to $ICS_LOG_DIR/telemetry. Each script creates it own log file each time it starts. The log files are named as follows: Component-YYYY-MM-DD_hh-mm-ss.log

A TCP/IP-based communications protocol is used to communicate with the Labview controllers and the ICS hardware. The IP addresses of these controllers are defined in the Site product.

Alarms

The process that gets alarms from the RIO and sends them to the database is $ICSAlarms_DIR/python/alarm_server_receive.py. It also check to see that all the ICS telemetry scripts are running and issues an alarm if it finds that any of them are missing. It also checks to see that there is data in the ICS telemetry tables in the database within some time period that is defined in the code (currently 30 minutes). It uses the class DBCheckICSTelemetry from the DB/telemetry package. If it detects that no data has been stored within the specified time interval it issues an alarm.

It is started automatically at boot time by a script called $ICSAlarms_DIR/bin/alarmserver (a copy of this script is installed in /etc/init.d). This script runs $ICSAlarms_DIR/bin/run_alarmserver.sh which sets up the correct environment and then runs the python code. The process id of the python code is stored in $ICS_LOCK_DIR/alarms/alarmserver.pid. It is necessary to tell the alarmserver script which version of the Site package to use. This is because the Site product defines the connection information for the database. This is done by placing a file in /data_local/ICS that is called Alarms_Site_version. This file contains the version of the Site package to use, e.g. fnaldev, ctioprd (the dev or prd refers to the version of the database) . The alarmserver script reads this file and passes the version as an argument to the run_alarmserver.sh script. In order to control the version of the ICSAlarms package that will be run there is a file in /data_local/ICS called ICSAlarms_version. This file contains the version of the ICSAlarms package that will be used. The alarmserver script reads this file and passes the version as an argument to the run_alarmserver.sh script.

The alarmserver script is a standard System V start/stop script.

setup ICSAlarms

then

alarmserver stop
alarmserver start
alarmserver status
alarmserver restart (stop and then start)

There is also a script $ICSAlarms_DIR/bin/alarmserver_status.sh. This uses the alarmserver script to monitor the status of the python code and restarts it if it finds that it is not running. This runs every 15 minutes and is controlled by a crontab entry in the sispi account. If it does a restart it sends email to an alias decam-alarms that is defined in the .mailrc file in the sispi account.


alias decam-alarms buckley@fnal.gov kuhlmann@anl.gov cease@fnal.gov mbonati@ctio.noao.edu estrada@fnal.gov

There is an external alarm handler (ExternAlarmHandler) that runs on ics1.ctio.noao.edu and triggers on ICS alarms that are sent to the database. The handler sends email to the appropriate list of people when a given alarm happens. This mapping between alarm types and the people who should be notified is stored in the alarm schema in the database.

If for some reason you want to turn off alarms into the database then you will need to disable the crontab entry otherwise it will keep restarting the process.

Telemetry

The ICS produces various telemetry data that need to be logged into the database. For the ICS components CageTemps, Heaters, Ionpump and LN2 there are python scripts in the DB/telemetry package that store the data in the appropriate tables in the database. These python scripts

ICS_CageTemperatures.py
ICS_FocalPlaneTemperatureCntrl.py (invoked by run_Heaters.sh)
ICS_LN2_process.py
ICS_IonPump.py
ICS_FacilityUtilities.py

are invoked by shell scripts that are in the corresponding ICS package

run_CageTemps.sh
run_Heaters.sh
run_LN2.sh
run_Ionpump.sh
run_FacilityUtilities.sh

All the telemetry shell scripts setup their environment using eups. Each script has an associated data directory. This location is defined in the individual scripts, e.g. $ICS_CAGETEMPS_DATA = /data/ICS/CageTemps. The script changes directory to the new sub-directory of the data directory and then proceeds to contact the CompactRIO to copy files and store the contents in the database. The file is moved to the done sub-directory after it has been processed. It records log messages in $ICS_LOG_DIR/telemetry in timestamped files, a new one is created each time the process is started.

The telemetry for the ICC and Photodiode is read from the CompactRIO over a socket. These python scripts live in the python sub-directory of the ICC and Photodiode products respectively and use the class DBStoreICSTelemetry from the DB/telemetry package to store the data in the appropriate table. The ICC script also detects faults on the crates and issues alarm messages.

All the telemetry scripts are started automatically at boot time by a script called $ICSTelemetry_DIR/bin/telemetryserver (a copy of this script is installed in /etc/init.d). It is necessary to tell the telemetryserver script which version of the Site package to use. This is because the Site product defines the connection information for the database. This is done by placing a file in /data_local/ICS that is called Telemetry_Site_version. This file contains the version of the Site package to use, e.g. fnaldev, ctioprd (the dev or prd refers to the version of the database). The telemetryserver script reads this file and passes the version as an argument to the run_<process>.sh scripts.

The telemetryserver script is a standard System V start/stop script.

setup ICSTelemetry

then

telemetryserver start - starts everything
telemetryserver stop - stops everything
telemetryserver status - shows status on everything
telemetryserver restart - stops everything and then starts everything

It can also be used to start each component individually.

setup ICSTelemetry

then

telemetryserver start <process>
telemetryserver stop <process>
telemetryserver status <process>
telemetryserver restart <process>

process = CageTemps Heaters Ionpump ICC LN2 Photodiode FacilityUtilities

Each process has its own pid file in $ICS_LOCK_DIR/telemetry.

There is also a script $ICSTelemetry_DIR/bin/telemetryserver_status.sh. This uses the telemetryserver script to monitor the status of the scripts. If it finds that one of them is not running it restarts it. This runs every 15 minutes and is controlled by a crontab entry in the sispi account. If it does a restart it sends email to an alias decam-telemetry that is defined in the .mailrc file in the sispi account.

alias decam-telemetry buckley@fnal.gov kuhlmann@anl.gov kkuehn@hep.anl.gov estrada@fnal.gov

If for some reason you want to turn off telemetry into the database then you will need to disable the crontab entry otherwise it will keep restarting the processes. I don't think this should be necessary as these processes do not send emails like the alarms and can be left running all the time unless there is a good reason not too.

Current known issues - it is not known whether any of these issues still exist as of 2018-12-28

  • Sometimes the telemetry scripts get stuck trying to copy files from the CompactRIO or the CompactFieldpoint. It is not clear why this happens. Restarting the offending script usually clears the problem although sometimes it may take several restarts.
  • For some reason, that I have been unable to debug, the telemetry and alarmserver scripts do not actually get started properly at boot time. They will start when cron runs the telemetryserver_status.sh and alarmserver_status.sh monitoring scripts.
  • There is a problem with the COMPONENT status message from the RIO sometimes containing no data. This causes the alarm_server_receive.py script to crash. It will be restarted when cron runs the alarmserver_status.sh monitoring script. This seems to be related to the enabling/disabling of the ALARM_STATUS flag but Marco has not been able to track it down yet.

CTIO-Specfic configuration

All the components are installed in /usr/remote/software/products. These directories are just for the code and are owned by the codemanager account. Things such as data, log files, pid files or anything else that is transient should not be stored there. The working directories are located in /data_local/ICS. This is where all things like data, logfiles etc should be stored. All the scripts run on ics1.ctio.noao.edu. Both $ICS_LOCK_DIR and $ICS_LOG_DIR point at /data_local/ICS.

Crontab entries

alarmserver_status.sh (sispi account on ics1)

0-59/15 * * * * source /usr/remote/software/products/eups/bin/setups.sh; setup ICSAlarms; 
alarmserver_status.sh

telemetryserver_status.sh (sispi account on ics1)

0-59/15 * * * * source /usr/remote/software/products/eups/bin/setups.sh; setup ICSTelemetry; 
telemetryserver_status.sh

ExternAlarmHandler (sispi account on ics1)

2 * * * * source /usr/remote/software/products/eups/bin/setups.sh; setup python; setup Site; 
setup ExternAlarmHandler; ExternAlarmHandler decam_alarms@ctio.noao.edu mail1.ctio.noao.edu 
-i extern -p /tmp/extern_alarm_handler.pid -v &amp;> /tmp/extern_alarm_handler.log &amp;