Project

General

Profile

SLC - Experts Only » History » Version 32

« Previous - Version 32/33 (diff) - Next » - Current version
Andrew Mogan, 03/05/2020 02:35 PM


Slow monitoring and control - Experts Only

Caution!

  • This page contains information for experts and links to information for experts.
  • Even if something on this page looks like a procedure, it most likely assumes additional knowledge and/or requires additional steps not included on this page.
  • Some of the information here might be misleading or confusing to non-experts. Confusion, loss of time, lost data, or even damage might result if misused. (That said, most of this page is harmless.)

There are some non-expert trouble-shooting procedures: see SLC - Troubleshooting.

Background references

IOC setup info

There are a number of different I/O controllers that serve EPICS channel data.

PMT HV controller

The PMT HV is now supplied by a Wiener power supply. Previously it was supplied by a VME-based BiRa system with its own VME controller. If you are interested in reading the old instructions about setup and troubleshooting of the BiRa system, see revision 26 or earlier of this wiki page.

IOC for Wiener supplies + soft channels

Multiple IOCs run on ubdaq-prod-smc that collect Wiener power supply data via SNMP and provide "soft" channels for various calculations and for data "pushed" from various scripts (see Data import scripts on this page).

The main setup for this is the EPICS "db" and "dbd" files defining the channels, plus a fairly standard IOC startup script to load them. See EPICS db setup info on this page, and also the startup script smc-uB.iocsh.

The IOCs are automatically started by the run_smc.sh script -- see the run_smc.sh section below.

In the event that ubdaq-prod-smc suffers a hardware failure, these IOCs and the run_smc.sh script will run just as well on ubdaq-prod-smc2.

IOC for Power Distribution Units

An IOC runs on ubdaq-prod-ws01 to collect data from the Power Distribution Units (PDUs) in the DAQ room. The communication is via SNMP. Network configuration currently prevents connecting to the PDUs from any computer other than ubdaq-prod-ws01, ubdaq-prod-ws02, or ubdaq-prod-ipmi.

The IOC is automatically started by the run_ws01.sh script -- see the run_ws01.sh section below.

In the event that ubdaq-prod-ws01 suffers a hardware failure, the IOC and the run_ws01.sh script will run just as well on ubdaq-prod-ws02.

Rackmon ("Glomation") boxes

Basic orientation:

  • The Glomation boxes start their IOC's using a script start_ioc.sh in the home directory of the uboonedaq user.
  • The script is executed as the uboonedaq user at boot via a crontab entry.
  • It runs the IOC in a "screen" session.
  • There is a file named RACK in the uboonedaq user home directory that says what rack the Glomation is in, using the rack short names used in EPICS variable names. (E.g., PM01 for purity monitor 1.)

Getting the software installed and updating Glomations with new software is sufficiently complicated that it has own wiki page: SMC - Installing software on Glomations - Experts Only. That wiki page also has information about swapping rackmon boxes.

Data import scripts

There are several data import programs for various things. Almost all run on the uboonesmc account on ubdaq-prod-smc.fnal.gov. There is one runs continuously -- CaputServer -- and it is started by the run_smc.sh script. (See the run_smc.sh section below.) The rest are run periodically by a cron job on the uboonesmc account. The crontab file is kept in slowmoncon/apps/ubdaq-prod-smc_run/crontab.

EPICS db setup info -- make_db

The EPICS database setup is somewhat complicated, but almost all of it can be made by typing "make" in the slowmoncon/make_db directory, and then "make install" in the same directory. After doing this, it is also necessary to "make" in slowmoncon/apps/Wiener. The PMTHV database is not made this way; changes to it must be made by editing the files in slowmoncon/apps/PMTHV directly.

The make_db Makefile also makes a number of "PV Table Save" (.pvs) files in the make_db/pvs directory. Some are templates for saving current values, and some are useful for "hot install" of new alarm ranges without rebooting the IOC. See the SMC - Expert - hot alarm range install page.

Gui (Operator Panel Interface)

Editing the gui displays is mostly harmless, assuming you don't do something unreasonable like mislabeling a control.

Almost any trouble-shooting procedure we have for the Gui really belongs on the SLC - Troubleshooting page.

If we ever have some expert procedures, we'll note them here.

Archiver setup

CAUTION! Be very careful regarding the SQL database containing archived data CAUTION!

Overall, the setup follows the guidance of the CSS installer/maintainers manual, chapters 10. Relational Database and 11. Archive System.

Here are some key points:

  • The configuration table of channels to archive, and under what conditions, as well as the tables of archived values, are all kept in a Postgresql database hosted on ubdaq-prod-smc. The database is named slowmoncon_archive. The table names and schema are exactly as described in the above CSS reference.
  • Adding channels to the database requires doing the right "INSERT" command to the channels table in the slowmoncon_archive database.
  • Do not attempt to delete channels from the archive database. It is ok to disable them by changing their sampling mode, but deleting them is likely to mess up the channel_id numbers.
  • Do not use the CSS-provided tools for adding channels. Use the tool described below.

How to add a channel to the archiver, MicroBooNE style

In slowmoncon/apps/archiver_setup in our git repository, there is a file named add_channels.py. This is a utility for making SQL scripts to add channels to the archiver. It is intended to be safer than using the ArchiveConfigTool, mostly because an expert can check the SQL statements by hand before executing them.

You first need to understand the archiver table schema in chapters 10 and 11 of the CSS manual (see reference above). In particular, you need to know that there are archiving "groups", and that there is chan_grp table that assigns each group a number. Here is how to list the groups with their numbers using psql:

  $ psql -h smc-priv -U smcreader slowmoncon_archive
  Password for user smcreader: 
  psql (9.2.4)
  Type "help" for help.

  slowmoncon_archive=> select * from chan_grp;
   grp_id |   name    | eng_id |               descr                | enabling_chan_id 
  --------+-----------+--------+------------------------------------+------------------
        1 | Weather   |      1 | Weather Group                      |             
        2 | Beam      |      1 | Beam data group                    |             
        3 | Cryo      |      1 | Cryogenics data                    |             
        4 | Rack      |      1 | Rack monitoring data               |             
        7 | Power     |      1 | Power supplies                     |                 
        5 | DAQ       |      1 | DAQ monitoring data                |                 
        6 | Computers |      1 | Computer status data ("PCStatus")  |                 
        8 | ZMON      |      1 | Ground Isolation Impedance Monitor |                 
  (8 rows)

There are two possible ways to invoke add_channels.py:

    python add_channels.py (epics-db-file) (archiver-grp-id)
    python add_channels.py (csv-file)

There are two standard use cases for the add_channels.py tool:

  • Case 1: Add records found in an EPICS .db file to an existing group. Example of adding channels to group ID "1" (Weather):

       python add_channels.py  MyDbFile.db  1  >  add_my_pvs.sql
    
       # (inspect add_my_pvs.sql, possibly edit -- change smpl_mode, remove records not desired, etc.)
    
       psql < add_my_pvs.sql
      

    (The actual change is made by that last step, when the psql command is run.)
  • Case 2: Add records found in an EPICS .db file when there is NO existing group yet. Also useful if you prefer editing CSV files to SQL command files. Example of adding channels to group ID "9":

       python add_channels.py  MyDbFile.db 9  > /dev/null
    
       # the above will create a file MyDbFile.db.temp.csv.
       #
       # - manually edit MyDbFile.db.temp.csv
       # - If group 9 does not exist in the archiver yet, insert the following lines
       #   at the top of the .csv file:
       #     $TABLE,chan_grp
       #     grp_id,name,eng_id,descr,enabling_chan_id
       #     9,"MyNewGroupName",1,"My new archiver group",NULL
       #
       # You can also make other changes to the .csv file.
       # Then do the following
    
       python add_channels.py MyDbFile.db.temp.csv > add_my_pvs.sql
    
       # (inspect add_my_pvs.sql, possibly edit as above)
    
       psql < add_my_pvs.sql
      

How to keep the archiver from receiving/saving too many samples

  • Adjust ADEL, the "archiver deadband". This will reduce the number of updates that the archiver receives, and is the preferred solution. (See the EPICS record reference manual: https://wiki-ext.aps.anl.gov/epics/index.php/RRM_3-14)
  • Change smpl_mode and related values in the archiver configuration table channel -- see archiver documentation. (And be careful!)

Alarm handler setup

In case of a need to rebuild the whole alarm server database, we have a MicroBooNE-specific tool that will build a useful alarm tree database from our database files. It can be found in slowmoncon/apps/setup_archiver. Please read the CSS ArchiveEngine documentation in the Control System Studio Guide and the comments at the top of the source code n setup_archiver, carefully, before using.

In more normal circumstances, we can add, subtract, and reconfigure channels through the GUI, after authenticating as someone with the right privileges. This is documented in the CSS built-in help, and rhere is an instructional video here: https://www.youtube.com/watch?v=fusDOZJ4hN0 -- the video just demonstrates deleting a channel from the alarm system, but the same technique can be used to add channels, add guidance to channels, etc. (Be careful, deleting a channel is an operation that can only be reversed by restoring a backup version of the alarm database.)

run_smc.sh

The script slowmoncon/apps/ubdaq-prod-smc_run/run_smc.sh will start any slowmoncon background program that should always be running, if it is not already running. This job is run every two minutes by cron (see crontab below), but it can be run at any time to restart something faster. Only execute run_smc.sh as the uboonesmc user on ubdaq-prod-smc. Executing it as any user on any other host, or as any other user on ubdaq-prod-smc, will result in duplicate copies of the Archiver, the AlarmServer, etc.

As of 2017/01/19, here is the list of programs the run_smc.sh keeps running:

  • Programs requiring database or Java Message System access:
    • AlarmServer
    • ArchiveEngine
    • JMS2RDB -- for log messages
    • alarm_messager -- sends annunciated alarms to slack
    • alarmsounder -- makes a sound on a web page if annunciated alarm happens, if someone has the web page up
  • Other programs:
    • AlarmSumsIoc -- IOC used for alarm sum calculation
    • ArPurityIoc -- IOC used for purity monitors (do we still need this?)
    • BeamDataIoc -- IOC used for beam data
    • CryoIoc -- IOC used for cryo data
    • GangliaIoc -- IOC used for ganglia data from DAQ and ComputerStatus
    • GangliaMonitorToEPICS -- monitors Ganglia and updates EPICS variables quickly
    • InFluxDBIoc -- IOC for CRT data
    • java-opc-client -- imports Cryo data to a text file
    • snmpIoc -- The IOC for the Wiener power supply channels and other soft channel
    • CaputServer -- for receiving values for ArPurity [no longer used]

All of these are currently run in "screen" sessions. (See man screen.)

You can get a list of all the processes running under screen by using "screen -ls" as uboonesmc@ubdaq-prod-smc

In the event that ubdaq-prod-smc is down for some reason, this script will run just as well on smc2. Take care not to run the same processes on more than one host.

run_ws01.sh

The script slowmoncon/apps/ubdaq-prod-smc_run/run_ws01.sh is analogous to run_smc.sh but for starting any slowmoncon background program that should always be running on ws01. Currently the only such program is the PDU IOC.

In the event that ubdaq-prod-ws01 is down for some reason, this script will run just as well on ws02. Take care not to run the same processes on more than one host.

crontab

The correct crontab settings for the uboonesmc account on ubdaq-prod-smc are in slowmoncon/apps/ubdaq-prod-smc_run/crontab. This crontab should only be used for the uboonesmc account on ubdaq-prod-smc. Any other use will cause confusion from conflicting programs.

As of 2015/07/17, here is a list of things done by crontab:

  • script that keeps everything running that should keep running, once every 2 minutes * script that reads the weather data, once every 2 minutes * script that reads the beam data, once every 2 minutes * script that reads the cryo data from a text file in ~bcarls/javaOPCClientLArTF, once every 2 minutes * script that reads the ganglia data, once every minute. (This may be replaced by a continuously running server with very low latency in the future.)

switching slowmon processes to a different host

If the slowmon processes have to be switched to a different host for some reason, it is simply a matter of removing the crontab for user uboonesmc on smc (to keep processes from being restarted -- see crontab above), stopping all the running processes (see run_smc.sh above), and then installing the crontab under the uboonesmc account on some other machine. If the database is moved off of ubdaq-prod-smc too (e.g., if ubdaq-prod-smc has simply died), then see also "switching to a different master database server" below.

Removing crontab

On the computer on which you want to stop slowmon processes (e.g., ubdaq-prod-smc), do

uboonesmc@ubdaq-prod-smc $ crontab -r

Stopping slowmon processes

  • First, stop archiver cleanly by pointing any browser to http://ubdaq-prod-smc:5912/main. Then change "main" to "stop" in the previous URL. Then close that window so you don't accidentally reload and kill the archiver later.
  • Then as the uboonesmc user on ubdaq-prod-smc do
    uboonesmc@ubdaq-prod-smc $ screen -ls
    
  • Issue "kill" for each process number listed in the screen output. Then repeat "screen -ls" to confirm they are all down.

Starting up on a different server

Log into uboonesmc@(new host) and install the appropriate crontab there. E.g., if starting on near2, do

uboonesmc@ubdaq-prod-near2 $ crontab ~/slowmoncon/apps/ubdaq-prod-smc_run/crontab

Within 2 minutes the crontab should run the run_smc.sh command on that machine and start all processes. (You can run it manually if you're impatient, as described in the run_smc.sh section above.)

switching to a different master database server

There are three servers that need access to the SQL database. They each have a configuration file. All you have to do is replace the host name in each file. Go to the slowmoncon/apps directory in the uboonsmc account and edit these three files:

alarmserver_setup/alarmserver_settings.ini
archiver_setup/uboonedaq-smc_archiver.ini
JMS2RDB_setup/plugin_customization.ini

Each should have exactly one uncommented line with the string "url=jdbc:postgresql://smc-priv/slowmoncon_alarm" in it. Change the hostname following "//" in each file. Then restart the archiver, alarmserver, and JMS2RDB processes. The easiest way to do this is to do "screen -ls" to see the process numbers and then kill those processes. (However, there is a nicer way to stop the archiver described in "starting up on a different server" above.)

See also elog entries 38903, 38928, and 38962.

Updating CSS to point to the right database host:

The CSS also needs some updating after doing the above changes. Once all is done, before restarting the CSS GUI, we need to edit the CSS preferences and change any reference to "smc-priv:61616" to the new database server (for example, "near2-priv:61616") and then restart the CSS GUI with correct settings. Without doing this the shifter will keep getting the "server timeout" errors due to not looking at the right JMS server. The changes should be made in "Preferences" under the CSS "Edit" menu.
(Note that it is hard to list all the settings a priori, but the main idea is to go through the list and change any urls that has "postgresql" or "jms" in it. So, what is given below is mainly for guidance)
1. Under "CSS Core", under "Shared JMS Connection", change needed from OLD to NEW host
2. Under Trends, Data Browser
3. Change needed under "jms settings" under "alarm server" both for "RDB server" and "JMS server URL"
4. Annunciator and Message history
5. Note: no need to change the IP under the CSS Core address list as we want this to point to the whole subnet.
6. Look at other places as well.

After changing all the settings, cleanly exit CSS and restart it.

Saved plots history data problem

Whenever there is a database switch, the saved plots files (typically .plt extension) need some tweaking as well since they remember OLD database they are pulling the data from (even after updating the CSS ->Edit->Preferences with the correct database server urls and everything). This is a glitch in CSS. The way this problem can be noticed is after a database change, if you take a saved plot and try to go back in history, to say last 10 days, you will notice that it doesn't load any history data since it is looking at the wrong database server. The simple way to fix it is the following:
1. In CSS, for each saved plot, open the properties tab of the plot (if it is not already open)
2. Select "Traces" tab under it and select all traces listed there. You can do that by clicking on the first trace, scrolling down to the last trace, and shift-clicking to select all.
3. Right click on the selected traces and select “Use default data source”

Also note that if someone just tries to make a NEW plot of a variable after the database change, that will come out fine. The mentioned problem is only for saved plots.

Stopping/Starting slow controls processes before/after power outage

Here are the steps to systematically stop/start slow controls processes during power outage.
Note: The following should be done ONLY by Slow Controls expert and in an emergency when they are not around, the run coordinators should do it.

Stopping Slow Controls Processes before power outage:

Note: in the following steps, be sure to note the different accounts (ubooneshift vs uboonesmc) and hosts (ubdaq-prod-ws01 vs ubdaq-prod-smc) for each of the steps.

1. We bring the slow controls system down 30 minutes (or 1 hour) before the DAQ machines come down. So, we need to make a note of the DAQ shut down time to be prepared. Contact Run coordinator and make a note of this and coordinate with DAQ expert.
2. Take a screen shot of the Keithley current and pickoff point monitor settings. These get wiped out during a power outtage.
3. Null out the crontab (that resides on the uboonesmc account on ubdaq-prod-smc) that restarts the slow controls processes every two minutes, so it doesn’t restart the processes.
Do “crontab -r” on ubdaq-prod-smc while logged in as uboonesmc.
4. Once step 2 is done, stop all individual slow controls processes running in “screen” (on the uboonesmc account on ubdaq-prod-smc) one after another.
To see what is running under screen: logged in as uboonesmc on ubdaq-prod-smc, do: screen -ls
4a. Reconnect to each screen process using screen -r.
For example, screen -r Alarm Server and cleanly stop using:
“exit” if it is a epics process or use “ctrl-C” for other processes. It's ok to kill them with ctrl-C, but not kill -KILL or abrupt power-down.
5. Shut down CSS cleanly i.e, the shared CSS GUI needs to be exited/closed by the shifter.
6. After step 4, log in as ubooneshift on ubdaq-prod-ws01, kill the VNC using: "vncserver -kill :2”
7. Done!

Starting Slow Controls Processes after Power outage:

Note: in the following steps, be sure to note the different accounts (ubooneshift vs uboonesmc) and hosts (ubdaq-prod-ws01 vs ubdaq-prod-smc) for each of the steps.

1. Make sure DAQ systems and network racks are turned on and good.
2. Login to ubdaq-prod-smc as uboonesmc and first reinstall the crontab: "crontab ~/slowmoncon/apps/ubdaq-prod-smc_run/crontab”
3. Once this is done, the crontab should automatically restart all slow controls processes in about 2 minutes
4. Verify this using "screen -ls" (with out the quotes) after 2 or 3 minutes
5. Logged in as ubooneshift on ubdaq-prod-ws01, restart the vncserver using the “startVNC.sh” script as follows: source startVNC.sh
6. Follow the remaining instructions on the Slow Controls Guide wiki page for launching the slow controls screen In case of problems, see SLC_-_Troubleshooting.
7. This should bring up the Shifter GUI, it might take a couple of minutes for things to come up and get green. The Slow Controls expert needs to work with the shifter to take a closer look at the alarms and make sure that the system is restored to the previous state (before the power outage).
8. Reset the settings on the Keithley pickoff and current monitor to those prior to the power outage.
9. Set the current monitor Keithley to one-shot mode and clear the error queue using the "*RST" and "*CLS" commands entered one at a time into the "Send command" box. Verify afterwards that the voltage auto range is still off and voltage range is 0.1 V. (See elog entries and 67187 for additional explanation.)
10. If the shifters see sustained magenta/red/yellow alarms (apart from those that are affected) after the power outage, immediately notify the Slow Controls expert who will determine if it is the slow controls failure or not.
11. Done! Beer o'clock!

CRTDAQ to EPICS connection: starting and restarting

This issue manifests as a pink "disconnected" status box in the CRT DAQ.

A process running under uboonepro account on ubdaq-prod-crtevb.fnal.gov transfers data from the InfluxDB monitoring in CRTDAQ to the EPICS IOC. This process should be automatically started by the following line in uboonepro's crontab:

*/30 * * * * /home/uboonepro/sc_daemon_start.sh>/dev/null 2>&1

This starts a process running python /home/uboonepro/.local/bin/sc2epicsdaemon.py. If that daemon process dies, the crontab line will restart it within 30 minutes. However, if the process hangs or otherwise fails, the sc_daemon_start.sh script will not restart it. If that happens, simply log in as , use ps xw to find the process running python /home/uboonepro/.local/bin/sc2epicsdaemon.py start, kill that process, and then manually execute /home/uboonepro/sc_daemon_start.sh.

Note that in order to ssh as as noted above, you must first have your username added to the appropriate k5login file. If you try to log in and receive the error Permission denied (gssapi-keyex,gssapi-with-mic)., then you're not yet on that login file. To have your username added, contact the current CRT expert and ask that they add you to the appropriate k5login file to access ubdaq-prod-crtevb as uboonepro. If the current CRT expert can't access this file either, first loudly complain to the RunCo's and other CRT experts for having an outdated list (again), then contact any of the people listed below as a backup. NOTE: This list is valid as of Mar. 5th, 2020, but will almost certainly need updating in the near future as people leave or change institutions.