Slow monitoring and control - Experts Only¶
- Table of contents
- Slow monitoring and control - Experts Only
- Background references
- IOC setup info
- Data import scripts
- EPICS db setup info -- make_db
- Gui (Operator Panel Interface)
- Archiver setup
- Alarm handler setup
- switching slowmon processes to a different host
- switching to a different master database server
- Stopping/Starting slow controls processes before/after power outage
- CRTDAQ to EPICS connection: starting and restarting
- VNC authentication running but rejecting all connections or failing on every authorization attempt
- This page contains information for experts and links to information for experts.
- Even if something on this page looks like a procedure, it most likely assumes additional knowledge and/or requires additional steps not included on this page.
- NOTE: As of early 2020, all slow controls processes running on ubdaq-prod-smc were moved to ubdaq-prod-smc2. Much of this documentation still refers to ubdaq-prod-smc, but you should replace that with ubdaq-prod-smc2.
- Some of the information here might be misleading or confusing to non-experts. Confusion, loss of time, lost data, or even damage might result if misused. (That said, most of this page is harmless.)
There are some non-expert trouble-shooting procedures: see SLC - Troubleshooting.
- Record reference manual: https://wiki-ext.aps.anl.gov/epics/index.php/RRM_3-14
- Device support, including SNMP, stream, save/restore: http://www.aps.anl.gov/epics/modules/soft.php
- Index of other documentation: http://www.aps.anl.gov/epics/docs/index.php
- Control System Studio (CSS)
IOC setup info¶
There are a number of different I/O controllers that serve EPICS channel data.
PMT HV controller¶
The PMT HV is now supplied by a Wiener power supply. Previously it was supplied by a VME-based BiRa system with its own VME controller. If you are interested in reading the old instructions about setup and troubleshooting of the BiRa system, see revision 26 or earlier of this wiki page.
IOC for Wiener supplies + soft channels¶
Multiple IOCs run on ubdaq-prod-smc that collect Wiener power supply data via SNMP and provide "soft" channels for various calculations and for data "pushed" from various scripts (see Data import scripts on this page).
The main setup for this is the EPICS "db" and "dbd" files defining the channels, plus a fairly standard IOC startup script to load them. See EPICS db setup info on this page, and also the startup script smc-uB.iocsh.
The IOCs are automatically started by the run_smc.sh script -- see the run_smc.sh section below.
In the event that ubdaq-prod-smc suffers a hardware failure, these IOCs and the run_smc.sh script will run just as well on ubdaq-prod-smc2.
IOC for Power Distribution Units¶
An IOC runs on ubdaq-prod-ws01 to collect data from the Power Distribution Units (PDUs) in the DAQ room. The communication is via SNMP. Network configuration currently prevents connecting to the PDUs from any computer other than ubdaq-prod-ws01, ubdaq-prod-ws02, or ubdaq-prod-ipmi.
The IOC is automatically started by the run_ws01.sh script -- see the run_ws01.sh section below.
In the event that ubdaq-prod-ws01 suffers a hardware failure, the IOC and the run_ws01.sh script will run just as well on ubdaq-prod-ws02.
Rackmon ("Glomation") boxes¶
- The Glomation boxes start their IOC's using a script
start_ioc.shin the home directory of the uboonedaq user.
- The script is executed as the uboonedaq user at boot via a crontab entry.
- It runs the IOC in a "screen" session.
- There is a file named
RACKin the uboonedaq user home directory that says what rack the Glomation is in, using the rack short names used in EPICS variable names. (E.g.,
PM01for purity monitor 1.)
Getting the software installed and updating Glomations with new software is sufficiently complicated that it has own wiki page: SMC - Installing software on Glomations - Experts Only. That wiki page also has information about swapping rackmon boxes.
Data import scripts¶
There are several data import programs for various things. Almost all run on the uboonesmc account on ubdaq-prod-smc.fnal.gov. There is one runs continuously -- CaputServer -- and it is started by the run_smc.sh script. (See the run_smc.sh section below.) The rest are run periodically by a cron job on the uboonesmc account. The crontab file is kept in slowmoncon/apps/ubdaq-prod-smc_run/crontab.
EPICS db setup info -- make_db¶
The EPICS database setup is somewhat complicated, but almost all of it can be made by typing "make" in the slowmoncon/make_db directory, and then "make install" in the same directory. After doing this, it is also necessary to "make" in slowmoncon/apps/Wiener. The PMTHV database is not made this way; changes to it must be made by editing the files in slowmoncon/apps/PMTHV directly.
The make_db Makefile also makes a number of "PV Table Save" (.pvs) files in the make_db/pvs directory. Some are templates for saving current values, and some are useful for "hot install" of new alarm ranges without rebooting the IOC. See the SMC - Expert - hot alarm range install page.
Gui (Operator Panel Interface)¶
Editing the gui displays is mostly harmless, assuming you don't do something unreasonable like mislabeling a control.
Almost any trouble-shooting procedure we have for the Gui really belongs on the SLC - Troubleshooting page.
If we ever have some expert procedures, we'll note them here.
CAUTION! Be very careful regarding the SQL database containing archived data CAUTION!
Here are some key points:
- The configuration table of channels to archive, and under what conditions, as well as the tables of archived values, are all kept in a Postgresql database hosted on ubdaq-prod-smc. The database is named
slowmoncon_archive. The table names and schema are exactly as described in the above CSS reference.
- Adding channels to the database requires doing the right "INSERT" command to the
channelstable in the
- Do not attempt to delete channels from the archive database. It is ok to disable them by changing their sampling mode, but deleting them is likely to mess up the
- Do not use the CSS-provided tools for adding channels. Use the tool described below.
How to add a channel to the archiver, MicroBooNE style¶
slowmoncon/apps/archiver_setup in our git repository, there is a file named
add_channels.py. This is a utility for making SQL scripts to add channels to the archiver. It is intended to be safer than using the ArchiveConfigTool, mostly because an expert can check the SQL statements by hand before executing them.
You first need to understand the archiver table schema in chapters 10 and 11 of the CSS manual (see reference above). In particular, you need to know that there are archiving "groups", and that there is
chan_grp table that assigns each group a number. Here is how to list the groups with their numbers using
$ psql -h smc-priv -U smcreader slowmoncon_archive Password for user smcreader: psql (9.2.4) Type "help" for help. slowmoncon_archive=> select * from chan_grp; grp_id | name | eng_id | descr | enabling_chan_id --------+-----------+--------+------------------------------------+------------------ 1 | Weather | 1 | Weather Group | 2 | Beam | 1 | Beam data group | 3 | Cryo | 1 | Cryogenics data | 4 | Rack | 1 | Rack monitoring data | 7 | Power | 1 | Power supplies | 5 | DAQ | 1 | DAQ monitoring data | 6 | Computers | 1 | Computer status data ("PCStatus") | 8 | ZMON | 1 | Ground Isolation Impedance Monitor | (8 rows)
There are two possible ways to invoke
python add_channels.py (epics-db-file) (archiver-grp-id) python add_channels.py (csv-file)
There are two standard use cases for the
- Case 1: Add records found in an EPICS .db file to an existing group. Example of adding channels to group ID "1" (Weather):
python add_channels.py MyDbFile.db 1 > add_my_pvs.sql # (inspect add_my_pvs.sql, possibly edit -- change smpl_mode, remove records not desired, etc.) psql < add_my_pvs.sql
(The actual change is made by that last step, when the psql command is run.)
- Case 2: Add records found in an EPICS .db file when there is NO existing group yet. Also useful if you prefer editing CSV files to SQL command files. Example of adding channels to group ID "9":
python add_channels.py MyDbFile.db 9 > /dev/null # the above will create a file MyDbFile.db.temp.csv. # # - manually edit MyDbFile.db.temp.csv # - If group 9 does not exist in the archiver yet, insert the following lines # at the top of the .csv file: # $TABLE,chan_grp # grp_id,name,eng_id,descr,enabling_chan_id # 9,"MyNewGroupName",1,"My new archiver group",NULL # # You can also make other changes to the .csv file. # Then do the following python add_channels.py MyDbFile.db.temp.csv > add_my_pvs.sql # (inspect add_my_pvs.sql, possibly edit as above) psql < add_my_pvs.sql
How to keep the archiver from receiving/saving too many samples¶
ADEL, the "archiver deadband". This will reduce the number of updates that the archiver receives, and is the preferred solution. (See the EPICS record reference manual: https://wiki-ext.aps.anl.gov/epics/index.php/RRM_3-14)
smpl_modeand related values in the archiver configuration table
channel-- see archiver documentation. (And be careful!)
Alarm handler setup¶
In case of a need to rebuild the whole alarm server database, we have a MicroBooNE-specific tool that will build a useful alarm tree database from our database files. It can be found in slowmoncon/apps/setup_archiver. Please read the CSS ArchiveEngine documentation in the Control System Studio Guide and the comments at the top of the source code n setup_archiver, carefully, before using.
In more normal circumstances, we can add, subtract, and reconfigure channels through the GUI, after authenticating as someone with the right privileges. This is documented in the CSS built-in help, and rhere is an instructional video here: https://www.youtube.com/watch?v=fusDOZJ4hN0 -- the video just demonstrates deleting a channel from the alarm system, but the same technique can be used to add channels, add guidance to channels, etc. (Be careful, deleting a channel is an operation that can only be reversed by restoring a backup version of the alarm database.)
The script slowmoncon/apps/ubdaq-prod-smc_run/run_smc.sh will start any slowmoncon background program that should always be running, if it is not already running. This job is run every two minutes by cron (see crontab below), but it can be run at any time to restart something faster. Only execute run_smc.sh as the uboonesmc user on ubdaq-prod-smc. Executing it as any user on any other host, or as any other user on ubdaq-prod-smc, will result in duplicate copies of the Archiver, the AlarmServer, etc.
As of 2017/01/19, here is the list of programs the run_smc.sh keeps running:
- Programs requiring database or Java Message System access:
- JMS2RDB -- for log messages
- alarm_messager -- sends annunciated alarms to slack
- alarmsounder -- makes a sound on a web page if annunciated alarm happens, if someone has the web page up
- Other programs:
- AlarmSumsIoc -- IOC used for alarm sum calculation
- ArPurityIoc -- IOC used for purity monitors (do we still need this?)
- BeamDataIoc -- IOC used for beam data
- CryoIoc -- IOC used for cryo data
- GangliaIoc -- IOC used for ganglia data from DAQ and ComputerStatus
- GangliaMonitorToEPICS -- monitors Ganglia and updates EPICS variables quickly
- InFluxDBIoc -- IOC for CRT data
- java-opc-client -- imports Cryo data to a text file
- snmpIoc -- The IOC for the Wiener power supply channels and other soft channel
- CaputServer -- for receiving values for ArPurity [no longer used]
All of these are currently run in "screen" sessions. (See man screen.)
You can get a list of all the processes running under screen by using "screen -ls" as uboonesmc@ubdaq-prod-smc
In the event that ubdaq-prod-smc is down for some reason, this script will run just as well on smc2. Take care not to run the same processes on more than one host.
The script slowmoncon/apps/ubdaq-prod-smc_run/run_ws01.sh is analogous to run_smc.sh but for starting any slowmoncon background program that should always be running on ws01. Currently the only such program is the PDU IOC.
In the event that ubdaq-prod-ws01 is down for some reason, this script will run just as well on ws02. Take care not to run the same processes on more than one host.
The correct crontab settings for the uboonesmc account on ubdaq-prod-smc are in slowmoncon/apps/ubdaq-prod-smc_run/crontab. This crontab should only be used for the uboonesmc account on ubdaq-prod-smc. Any other use will cause confusion from conflicting programs.
As of 2015/07/17, here is a list of things done by crontab:
- script that keeps everything running that should keep running, once every 2 minutes * script that reads the weather data, once every 2 minutes * script that reads the beam data, once every 2 minutes * script that reads the cryo data from a text file in ~bcarls/javaOPCClientLArTF, once every 2 minutes * script that reads the ganglia data, once every minute. (This may be replaced by a continuously running server with very low latency in the future.)
switching slowmon processes to a different host¶
If the slowmon processes have to be switched to a different host for some reason, it is simply a matter of removing the crontab for user uboonesmc on smc (to keep processes from being restarted -- see crontab above), stopping all the running processes (see run_smc.sh above), and then installing the crontab under the uboonesmc account on some other machine. If the database is moved off of ubdaq-prod-smc too (e.g., if ubdaq-prod-smc has simply died), then see also "switching to a different master database server" below.
On the computer on which you want to stop slowmon processes (e.g., ubdaq-prod-smc), do
uboonesmc@ubdaq-prod-smc $ crontab -r
Stopping slowmon processes
- First, stop archiver cleanly by pointing any browser to http://ubdaq-prod-smc:5912/main. Then change "main" to "stop" in the previous URL. Then close that window so you don't accidentally reload and kill the archiver later.
- Then as the uboonesmc user on ubdaq-prod-smc do
uboonesmc@ubdaq-prod-smc $ screen -ls
- Issue "kill" for each process number listed in the screen output. Then repeat "screen -ls" to confirm they are all down.
Starting up on a different server
Log into uboonesmc@(new host) and install the appropriate crontab there. E.g., if starting on near2, do
uboonesmc@ubdaq-prod-near2 $ crontab ~/slowmoncon/apps/ubdaq-prod-smc_run/crontab
Within 2 minutes the crontab should run the run_smc.sh command on that machine and start all processes. (You can run it manually if you're impatient, as described in the run_smc.sh section above.)
switching to a different master database server¶
There are three servers that need access to the SQL database. They each have a configuration file. All you have to do is replace the host name in each file. Go to the slowmoncon/apps directory in the uboonsmc account and edit these three files:
Each should have exactly one uncommented line with the string "url=jdbc:postgresql://smc-priv/slowmoncon_alarm" in it. Change the hostname following "//" in each file. Then restart the archiver, alarmserver, and JMS2RDB processes. The easiest way to do this is to do "screen -ls" to see the process numbers and then kill those processes. (However, there is a nicer way to stop the archiver described in "starting up on a different server" above.)
See also elog entries 38903, 38928, and 38962.
Updating CSS to point to the right database host:¶
The CSS also needs some updating after doing the above changes. Once all is done, before restarting the CSS GUI, we need to edit the CSS preferences and change any reference to "smc-priv:61616" to the new database server (for example, "near2-priv:61616") and then restart the CSS GUI with correct settings. Without doing this the shifter will keep getting the "server timeout" errors due to not looking at the right JMS server. The changes should be made in "Preferences" under the CSS "Edit" menu.
(Note that it is hard to list all the settings a priori, but the main idea is to go through the list and change any urls that has "postgresql" or "jms" in it. So, what is given below is mainly for guidance)
1. Under "CSS Core", under "Shared JMS Connection", change needed from OLD to NEW host
2. Under Trends, Data Browser
3. Change needed under "jms settings" under "alarm server" both for "RDB server" and "JMS server URL"
4. Annunciator and Message history
5. Note: no need to change the IP under the CSS Core address list as we want this to point to the whole subnet.
6. Look at other places as well.
After changing all the settings, cleanly exit CSS and restart it.
Saved plots history data problem¶
Whenever there is a database switch, the saved plots files (typically .plt extension) need some tweaking as well since they remember OLD database they are pulling the data from (even after updating the CSS
>Edit>Preferences with the correct database server urls and everything). This is a glitch in CSS. The way this problem can be noticed is after a database change, if you take a saved plot and try to go back in history, to say last 10 days, you will notice that it doesn't load any history data since it is looking at the wrong database server. The simple way to fix it is the following:
1. In CSS, for each saved plot, open the properties tab of the plot (if it is not already open)
2. Select "Traces" tab under it and select all traces listed there. You can do that by clicking on the first trace, scrolling down to the last trace, and shift-clicking to select all.
3. Right click on the selected traces and select “Use default data source”
Also note that if someone just tries to make a NEW plot of a variable after the database change, that will come out fine. The mentioned problem is only for saved plots.
Stopping/Starting slow controls processes before/after power outage¶
Here are the steps to systematically stop/start slow controls processes during power outage.
Note: The following should be done ONLY by Slow Controls expert and in an emergency when they are not around, the run coordinators should do it.
Stopping Slow Controls Processes before power outage:¶
Note: in the following steps, be sure to note the different accounts (ubooneshift vs uboonesmc) and hosts (ubdaq-prod-ws01 vs ubdaq-prod-smc) for each of the steps.
Also note: In early 2020, SLC processes running on ubdaq-prod-smc were moved to ubdaq-prod-smc2. If this changes again in the future, replace the following instances of ubdaq-prod-smc2 with ubdaq-prod-smc.
1. We bring the slow controls system down 30 minutes (or 1 hour) before the DAQ machines come down. So, we need to make a note of the DAQ shut down time to be prepared. Contact Run coordinator and make a note of this and coordinate with DAQ expert.
2. Take a screen shot of the Keithley current and pickoff point monitor settings. These get wiped out during a power outage. To find these panels, navigate to
TPCDrift_HV.opi and click on the two orange boxes on the right labeled
3. Null out the crontab (that resides on the uboonesmc account on ubdaq-prod-smc2) that restarts the slow controls processes every two minutes, so it doesn’t restart the processes.
Do “crontab -r” on ubdaq-prod-smc2 while logged in as uboonesmc.
4. Once step 2 is done, stop all individual slow controls processes running in “screen” (on the uboonesmc account on ubdaq-prod-smc2) one after another.
To see what is running under screen: logged in as uboonesmc on ubdaq-prod-smc2, do:
4a. Reconnect to each screen process using
screen -r <PID> (where
<PID> is listed when doing
screen -ls) and cleanly stop using:
“exit” if it is a epics process or use “ctrl-C” for other processes. It's ok to kill them with ctrl-C, but not kill -KILL or abrupt power-down.
5. Shut down CSS cleanly i.e, the shared CSS GUI needs to be exited/closed by the shifter.
6. After step 4, log in as ubooneshift on ubdaq-prod-ws01, kill the VNC using: "vncserver -kill :2”
Starting Slow Controls Processes after Power outage:¶
Note: in the following steps, be sure to note the different accounts (ubooneshift vs uboonesmc) and hosts (ubdaq-prod-ws01 vs ubdaq-prod-smc2) for each of the steps.
1. Make sure DAQ systems and network racks are turned on and good.
2. Login to ubdaq-prod-smc2 as uboonesmc and first reinstall the crontab: "crontab ~/slowmoncon/apps/ubdaq-prod-smc_run/crontab”
3. Once this is done, the crontab should automatically restart all slow controls processes in about 2 minutes
4. Verify this using "
screen -ls" (with out the quotes) after 2 or 3 minutes
5. Logged in as ubooneshift on ubdaq-prod-ws01, restart the vncserver using the “startVNC.sh” script as follows:
6. Follow the remaining instructions on the Slow Controls Guide wiki page for launching the slow controls screen In case of problems, see SLC_-_Troubleshooting.
7. This should bring up the Shifter GUI, it might take a couple of minutes for things to come up and get green. The Slow Controls expert needs to work with the shifter to take a closer look at the alarms and make sure that the system is restored to the previous state (before the power outage).
8. Referring to the screenshots you took earlier, reset the settings on the Keithley pickoff and current monitor to those prior to the power outage. Note that you'll need to set the "control permit time" box to some positive value (e.g. 1.0) to edit the settings. Remember to set it back to 0.0 when you're done.
9. Enter the "*RST" and "*CLS" commands one at a time into the "Send command" box for the current monitor Keithley. This resets the Keithley to single measurement ("one-shot") mode and clears the error queue. Verify afterwards that the voltage auto range is off (0) and voltage range is 0.1 V. (See elog entry 67187 for additional explanation.)
10. If the shifters see sustained magenta/red/yellow alarms (apart from those that are affected) after the power outage, immediately notify the Slow Controls expert who will determine if it is the slow controls failure or not.
11. Done! Beer o'clock!
CRTDAQ to EPICS connection: starting and restarting¶
This issue manifests as a pink "disconnected" status box in the CRT DAQ.
A process running under uboonepro account on ubdaq-prod-crtevb.fnal.gov transfers data from the InfluxDB monitoring in CRTDAQ to the EPICS IOC. This process should be automatically started by the following line in uboonepro's crontab:
*/30 * * * * /home/uboonepro/sc_daemon_start.sh>/dev/null 2>&1
This starts a process running
python /home/uboonepro/.local/bin/sc2epicsdaemon.py. If that daemon process dies, the crontab line will restart it within 30 minutes. However, if the process hangs or otherwise fails, the sc_daemon_start.sh script will not restart it. If that happens, simply log in as firstname.lastname@example.org, use
ps xw to find the process running
python /home/uboonepro/.local/bin/sc2epicsdaemon.py start, kill that process, and then manually execute
Note that in order to ssh as email@example.com as noted above, you must first have your username added to the appropriate k5login file. If you try to log in and receive the error Permission denied (gssapi-keyex,gssapi-with-mic)., then you're not yet on that login file. To have your username added, contact the current CRT expert and ask that they add you to the appropriate k5login file to access ubdaq-prod-crtevb as uboonepro. If the current CRT expert can't access this file either, first loudly complain to the RunCo's and other CRT experts for having an outdated list (again), then contact any of the people listed below as a backup. NOTE: This list is valid as of Mar. 5th, 2020, but will almost certainly need updating in the near future as people leave or change institutions.
- Mike Kirby firstname.lastname@example.org
- Wes Ketchum email@example.com
- Thomas Mettler firstname.lastname@example.org
- Rui An email@example.com
VNC authentication running but rejecting all connections or failing on every authorization attempt¶
If the VNC server is running but no one can log in, check tail of the log in
/home/ubooneshift/.vnc/ubdaq-prod-ws01.fnal.gov\:2.log for messages like this one:
Tue Mar 24 14:54:25 2020 Connections: blacklisted: 127.0.0.1
This indicates that the vnc server received 5 or more failed password attempts in a row. If it only received exactly 5 failed attempts, then the problem will clear in 10 seconds; but if it gets more than 5 failed attempts during the blacklist period then it will take 10*(2**n) seconds, where n is the number of failures after the 5th. If no one has been able to connect for a while, it is likely n is large enough that you don't want to wait that long. In this case, you should stop the VNC gui and restart the vnc server as follows:
- first get the process numbers of the CSS gui. Example:
$ ps uxw|grep css 49642 302044 0.0 0.0 106236 924 pts/13 S Mar09 0:00 /uboonenew/epics_css/v3_3_10a_nsls2//css-nsls2 -pluginCustomization /uboonenew/epics_css/v3_3_10a_nsls2//CSS_plugin_customization.ini -data /home/ubooneshift/.ControlSystemStudio/krb5--as-ubooneshift-on-ubdaq-prod-ws01/CSS 49642 302045 4.0 3.6 11534380 854956 pts/13 Sl Mar09 848:56 /usr/bin/java -Xmx1024m -Xms128m -XX:MaxPermSize=128M -jar /uboonenew/epics_css/v3_3_10a_nsls2//plugins/org.eclipse.equinox.launcher_1.2.0.v20110502.jar -os linux -ws gtk -arch x86_64 -showsplash -launcher /uboonenew/epics_css/v3_3_10a_nsls2/css-nsls2 -name Css-nsls2 --launcher.library /uboonenew/epics_css/v3_3_10a_nsls2//plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.1.100.v20110505/eclipse_1407.so -startup /uboonenew/epics_css/v3_3_10a_nsls2//plugins/org.eclipse.equinox.launcher_1.2.0.v20110502.jar --launcher.overrideVmargs -exitdata 1b9002b - -pluginCustomization /uboonenew/epics_css/v3_3_10a_nsls2//CSS_plugin_customization.ini -data /home/ubooneshift/.ControlSystemStudio/krb5--as-ubooneshift-on-ubdaq-prod-ws01/CSS -vm /usr/bin/java -vmargs -Xmx1024m -Xms128m -XX:MaxPermSize=128M -jar /uboonenew/epics_css/v3_3_10a_nsls2//plugins/org.eclipse.equinox.launcher_1.2.0.v20110502.jar 49642 362095 0.0 0.0 103328 852 pts/6 S+ 10:13 0:00 grep css
- Then kill first the child (302045 above) and then the parent (302044) after the child was gone using the linux "kill" command.
- After that, you can kill the vncserver using
vncserver -kill :2
(note only one dash on the option)
- Finally just restart vncserver from the bare terminal command line and then start CSS gui as usual.
[ubooneshift@ubdaq-prod-ws01 ~]$ ./startVNC.sh
All the rigamarole of killing the CSS processes that way was to try to give it a chance to save its state cleanly. Rarely we will get some sort of corruption that will make the CSS gui freeze on starting.