- Table of contents
- Experts how-to
- DCS Introduction
- VNC to control room (by Filip)
- DCS Processes
- DCS Shortcuts on NOVA Shift Handbook
- Useful Information about PVs
- What to do if...
- Other experts
- Useful links
DCS stands for Detector Control Systems. It is the combined system that lets the pieces of the detector talk to each other, monitors them and controls them. Each FEB, temperature sensor, and HV rack is in some way connected to the DCS system.
The FD DCS system is made up of three main subsystems, each using a different underlying set of software maintained by separate groups. As a DCS expert, it's important to understand what each of these systems do and how they work together, and who the developers are for each, so if a DCS error occurs you know which system is at fault and who to contact. These are:
- http://synoptic.fnal.gov/ or
- http://novadcs-far-master-01.fnal.gov:8000/synoptic/ (may need VPN if off site), or
- http://synoptic.fnal.gov/novafar for a web display-only for offsite (quick view of data without control/set capabilities).
The synoptic page is one of the monitors always running in the control room, and a DCS expert should be familiar with the synoptic pages and what is displayed on them. The ACNET server computers are novadcs-far-readout-0X where X is 1, 2, 3, or 4. There is also a data logger for viewing historical data (docdb 8900 or the Sensor-log Analysis expert's page here).
EPICS CSS (http://controlsystemstudio.github.io/)¶
Control System Studio (CSS) is a system developed off site, based on EPICS, and adapted to the NOvA experiment. It directly controls the FEBs and DCMs. CSS keeps track of things through Process Variables (PVs) which get updated as new information comes in. Some PVs come directly from the FEBs and DCMs, and others get copied from the ACNET system (and vice versa) as the different systems talk to each other. The main product of the CSS system is the APD Temperature Monitor which is always running in the control room, which constantly monitors the temperature of each FEB, whether the TECs are on or off, and whether the cooling system is operating properly. This critical system still has some hiccups and often times if there is a DCS problem it will be here. The APD Temperature monitor can be configured to run remotely with some work. The developers responsible for this system for NOvA are Gennadiy Lukhanin (email@example.com) and Kirk Bays (firstname.lastname@example.org). The computer generally used for CSS work are:
CSS was used for NDOS and the temperature display for the FD is similar (see docdb 6472 for manual). A DCS expert should be familiar with each page in the APD Temperature monitor. Setting it up to run off site is quite a chore, but at least visit the control room and see it and how each page works. It has a detector view, a diblock view, a DCM view, and a page with extra information for a particular FEB.
Each DCM is a computer that has a running program that reads in the data of each FEB on the DCM. This program is called a 'DCM IOC'. A global program called the 'Regional IOC' deals with diblock specific and detector wide information. A list of the PVs each handle as well as more information can be found at https://cdcvs.fnal.gov/redmine/projects/novadcs/wiki/CSS-EPICS_on_FD/. These IOC programs are critical for normal operations. If they fail, the CSS monitor will not get the information it needs, and we will be blind to the status of the hardware.
iFix is another Fermilab software system that runs on many experiments. It is mostly responsible for the details of the water cooling and dry gas system, monitoring every one of the many flow meters, dew point sensors, pressure sensors, and more. If any of these fail, logic exists that should automatically shut off parts of the detector that are at risk, as well as alerting experts. There is no iFix display in the control room, as this system generally just monitors things and only acts if there is a serious problem. Relevant information is also passed to ACNET and CSS and can be seen in their displays. The iFix program can easily be run remotely by using windows remote desktop to connect to a dedicated computer in the Ash River control room, but an account is required. Usually any DCS problem will not be with the iFix system, and this subsystem can be largely ignored when troubleshooting DCS problems. However, looking at the extensive details in iFix can be very helpful in understanding the hardware layout and determining if there is a real hardware failure at the FD site when an error occurs, or if it is just a software glitch. The important people working on the iFix system are Erik Voirin (email@example.com), who designed the NOvA iFix system and logic, and Mark Knapp (firstname.lastname@example.org), who administrates the system and maintains the accounts.
Here is a list of the computer that host DCS processes. These are all located in the control room in the far detector hall at Ash River. They are 8 core PCs running Scientific Linux Fermi 6. All are .fnal.gov if offsite.
novadcs-far-master-01: Major ACNET processes, synoptic server
novadcs-far-master-02: FD CSS, regional IOC
novadcs-ctrl-master: NDOS CSS
novadcs-readout-01: Some ACNET things, Epics alarm handling
novadcs-readout-02: Epics-ACNET bridge, ACNET data logger, database
novadcs-readout-03: Database backup, testing
VNC to control room (by Filip)¶
If you run windows and would like to be able to connect to the VNC session running DCS displays:
You need 3 things - port forwarding (tunnel) on your machine, privileges to login the CR DAQ machine and a VNC client.
I use Putty to log in and set the tunnel, then i run realVNC viewer (or TightVNC viewer).
The DCS Synoptic cluster of screens actually runs LOCALLY on DAQ-05 machine in Fermilab control room.
All other sessions are actually VNC screens of sessions running at Far Detector control room machines.
For some reason (unable to run Firefox separately), the VNC server is started on DAQ-05 and then we log on to that session from the same machine (weird, i know).
To access that machine, you have to be saved in a config file (your username@FNAL.GOV) - to get that, talk to the DAQ ppl.
Assuming you have the ability to use putty with Kerberos (I use Network Identity Manager), try to log to email@example.com.
Did it get through? Good.
Now to set the tunnel.
Go to the Connection section of your putty session (it's always good practice to save those sessions...).
In SSH/Auth/GSSAPI i have both "Attempt GSSAPI authentication (SSH-2 only)" and "Allow GSSAPI credential delegation" enabled. (i am not sure you need both, but it works this way...).
In X11, enable the "Enable X11 forwarding".
In Tunnel, set a tunnel from source port 5985 to destination port localhost:5955 (and click on "add").
That last step created the tunnel.
Now let set the VNC client.
Download the realVNC viewer http://www.realvnc.com/download/. (sadly, they ask you for your e-mail...).
Launch RealVNC viewer, connect to "localhost:5985", enter the password (ask (me) for that personally)
Now you've connected to your local port, which is tunneled to putty port, which is connected to the machine, and it forwards the graphics and everything thru X11 protocol to your VNC application.
Please, test that so we can perk it to perfection.
Marianna can actually try to put together the Mac/Chicken how-to.
The Regional IOC handles diblock and detector wide variables. It is critical for proper operations. It runs on novadcs-far-master-02.
Restarting the FAR DET regIOC binary from a test release at the Far detector (by Genn).¶
1. Login DCS machine
2. reconnect to the screen session with
screen -R regioc2or find the proper name from a
3. once in the screen session execute the following commands
source /home/novadcs/DAQOperationsTools/novadaq_setup.sh --opt cd /home/novadcs/testRelForOperations/ srt_setup -a cd $SRT_PRIVATE_CONTEXT/NovaDaqDcs/epics/iocBoot/iocregIOC_linux-x86_64 export DET=2 regIOC ./st-far-det.cmd
4. detach from the screen session by pressing
5. exit the ssh session.
Restarting the NEAR DET regIOC binary from a test release at the near detector.¶
1. Login to the DCS machine
2. reconnect to the screen session with
screen -R regIOCor find the proper name from a
2.1 If regIOC screen doesn't exist, start with
screen -d -m -S regIOC
3. once in the screen session execute the following command
4. detach from the screen session by pressing
5. exit the ssh session.
Each of the 84 DCMs runs a dcmIOC process that handles the PVs associated with DCMs and FEBs. These processes are critical to proper operations.
If you have a suspicion that something is wrong with one of the dcmIOC processes, one can check if a DCM IOC is running by querying a PV that would be handled by that DCM. For example, for FD diblock 1 DCM 5:
If you don't get back an array with 64 entries, something is wrong.
To restart a dcmIOC process, the shifter should know how, or one can do this:
example for det-2-02-01.
DO NOT forget to export the detector, diblock, and dcm (NDX) variables so the proper DCM will be controlled.
Starting at novadcs-far-master-02:
source /nova/novadaq/setup/setup_novadaq_nt1.sh --opt
export DET=2 ; export DIB=02; export NDX=01
This is a process that logs Epics PV values into a database.One would like to get the Archive Engine summary before doing something with PVArchiver. The
webpage for that is at:
On this webpage, one can see the detailed summary of the Archive Engine.
The interesting part is to look at the channels. It will have in general 9216 channels.
The channel number will be written in black ink if everything is normal.
It will be numbered in red if something is bad.
Another thing of interest is the Disconnected row. It will have disconnected channels in red (if any).
If the number of disconnected channels is less than 50% then it is okay and one would like to keep monitoring it.
It takes several hours to reconnect to all the monitored channels so you do not want to restart it to often.
If the % of disconnected is small (judge by experience) then it means that some channels have fallen off.
They will be connected automatically in a while.
Only when it crosses 50% one would like to fix it by following these simple steps:
1. Get kerberos ticket Login to DCS machine,Set up DCS environment
kinit ssh -X firstname.lastname@example.org setup_novadcs
After this one will see:
Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled.
2. Check the Archiver status
This will show: Checking for ArchiveEngine: running
If it is running, it is okay. If you are unlucky, it will show unused.
3. To fix this:
archiver.server stop archiver.server start
DCS Shortcuts on NOVA Shift Handbook¶
NOVA power supply control through DCS for the FD¶
Manual for PDB channels checkout through DCS Display¶
On detector APD checkout procedure¶
Regulating APD bios voltages using voltage divider setting¶
Start Cooling APD Temperature Monitor¶
1. APD temperature monitor is in disconnected state (pink outline) on overview page.
2. Cooled APD goes into alarm state (Red FEB box on DCM display of APD temperature monitor)
ACNET data logger¶
This is a process that logs ACNET variables into a database.
Useful Information about PVs¶
How to read a PV¶
There are tons of PVs. They are easy to read using the caget command. To do this, log into a DCS computer like novadcs-far-master-02. This will require a Kerberos ticket and novadcs permissions, and you have to ssh to DCS computers from another fnal.gov computer. If offsite, just start at nova01.fnal.gov or novatest01 or something. Currently I (Kirk) have permission to log in as novadcs@ but can't log on just as my own user name; if you log in as novadcs, be careful, as you have permission to do a lot of things that could negatively affect operations. Once logged on, type
setup_novadcsto set up the DCS environment. Now you can use
caget(to read a PV) and
caput(to set a PV - warning, use with care, as this can affect operations). You can also use camonitor to get a list of PVs that updates every 5 seconds (ctrl-c to quit).
For instance, to read the temperature of a DCM, this would work:
caget dcm-2-02-02:10_febtempPV values can be single numbers or arrays.
List of important PVs¶
Many PVs start with det (detector), meaning it is a detector wide variable (regional IOC), or dcm, meaning it is a PV belonging to a particular DCM (DCM IOC).
Some definitions of variables (the names are as used in EPICS), which are all integers:
$(DET) means the detector number (one digit). 1 is ND, 2 is FD, 3 is NDOS (?).
$(DIB) means the diblock number (2 digits). On the FD it goes from 01 to 14 (though parts of the detector that are not yet installed may not be enabled) (?)
$(NDX) means the index of the DCM on the diblock (2 digits). On the FD it goes from 01 to 12.
$(FEB) means the FEB number (1 or 2 digits). It goes from 0 to 63. For some PVs that have :$(FEB) in them, the option exists to view all 64 FEB channels at once as an array by changing :$(FEB)_xxx to :xxx_array
dcm-$(DET)-$(DIB)-$(NDX):$(FEB)_febtemp - the temperature of the FEB (degrees C?) (febtemp_array)
dcm-$(DET)-$(DIB)-$(NDX):_feb_status - The state of all the FEBs in the DCM. Determines the DCM color in CSS. See possibilities below. Array.
dib-$(DET)-$(DIB):_dcm_status - similar to feb_status but for a DCM. Turns color if any FEB inside has a problem. Not an array.
dcm-$(DET)-$(DIB)-$(NDX):$(FEB)_alarm - if FEB is in alarm.
dcm-$(DET)-$(DIB)-$(NDX):_alarm - if DCM is in alarm, propagated from if FEB is in alarm.
dib-$(DET)-$(DIB):_alarm - if diblock is in alarm, propagated from DCM.
The int value of the _status variables goes from 1-8 in order of the legend in CSS.
So, for instance, for _feb_status:
1 - cooled, OK
2 - uncooled, OK
3 - alarm
4 - inactive
5 - no TECC
6 - disconnected
7 - dry air off, tecc off
8 - dry air on, tecc on
A comprehensive list of PVs is here: https://cdcvs.fnal.gov/redmine/projects/novadcs/wiki/CSS-EPICS_on_FD
What to do if...¶
The APD Temperature monitor shows 'disconnected'¶
First, check if there is a network error with the computer hosting the regional IOC at Ash River, or if the computer itself is down. To check this, simply see if you can log onto novadcs-far-master-02.
If that isn't the problem, then check if the regional IOC is running. This is frequently the issue. To check, simply do
caget det-2:_da_flow_status_0as an example - if there is a problem, then the regional IOC is not running. In this case, restart the regional IOC.
If only one particular DCM is down, you can check if the DCM IOC for that DCM is running as well by checking if the PVs associated with that DCM are working correctly. The DCM IOCs should be running and can be queried even if the regional IOC is down. To restart a DCM that is down, see the shifter instructions https://cdcvs.fnal.gov/redmine/projects/novaoperations/wiki/NOvA_Shifts.
The FD APD Temperature monitor is being slow¶
Usually, just a restart will fix this.
The FD APD Temperature monitor is frozen or won't start¶
There are a few frequent issues that can cause problems like this.
First, if there is something wrong with the Kerberos connection. If the shifter is trying to open CSS, and nothing at all happens, this is likely the problem. Have the shifter check their ticket. If they open a terminal on their machine (novadaq-far-master-02), and try to log in to a DCS computer: ssh novadcs@novadcs-far-master-02 then it should work. If there is a permission denied error, there is something wrong with Kerberos, and if getting a new ticket doesn't solve it, they should ask Jaroslav or Peter.
Second, if CSS is frozen and won't stop. There is a deskop icon to 'stop CSS' they should try. Hopefully this kills it. If not, you can manually log in to novadcs-far-master-02 (as user novadcs - as a DCS expert, you hopefully have Kerberos permissions to do this), and just kill any running instance of CSS. Just do a ps -ef | grep -i css and look at the running processes; you'll see things listed as:
<user name> <process ID> <ID#> <#> <date started> <something> <time running> <description>
The description will be very long, but will start with the path of the program running, like /nova/ups/epics... If it ends in 'ArchiveManager' you can ignore it, but if it says 'eclipse/css' then kill it with kill -9 <process ID>. Kill all of them like this, then have the shifter restart CSS.
If that STILL doesn't work, or if there is an error like 'workspace already in use', you can try something else. Log onto novadcs-far-master-02, then type:
and type Yes please! to confirm.
If that STILL doesn't work, there is a final solution to the 'workspace in use error'. Have the shifter open a terminal, log onto novadcs-far-master-02, then have them type:
then run the script /home/novadcs/bin/start-CSS.sh
If the problem is anything more significant than just having them restart, send an email to Athans and Kirk so we can keep track of these instances, or make a note of them and present them at the Thursday meeting.
The basic explanation of these solutions is as follows. The 'start CSS' icon on the shifter computer is just a shortcut to a script that logs into the DCS computer (novadcs-far-master-02) and runs another script that is basically 'cssgui.server start', while the 'stop CSS' icon does the same except with a 'cssgui.server stop'. If the Kerberos ticket isn't working the connection to the DCS computer won't work. If CSS won't start but Kerberos works, usually too many instances of CSS are running and it is confused, so one needs to kill them and/or clean the metadata with cssgui.server clean. Lastly, for the 'workspace in use' error, the instance of CSS is tied to the environment variable KRB5_USER_NAME. Each name has a unique workspace. As a last resort changing the KRB5_USER_NAME to anything else with a simple export can work if the standard (KRB5_USER_NAME=novashift) workspace is having trouble.
Besides the general on call phone list, here are some other experts for specific things:
DCS Computers expert:
1. Accessing the ACNET datalogger: http://nova-docdb.fnal.gov:8080/cgi-bin/ShowDocument?docid=8900
2. DCS manual for NDOS: http://nova-docdb.fnal.gov:8080/cgi-bin/ShowDocument?docid=6472
3. On Call Phone List: http://nova-docdb.fnal.gov:8080/cgi-bin/RetrieveFile?docid=8806;filename=NOvA_Phones-20130919.pdf;version=5
4. EPICS APD Monitor info: http://nova-docdb.fnal.gov:8080/cgi-bin/ShowDocument?docid=8935
5. FD alarms and interlocks logic: http://nova-docdb.fnal.gov:8080/cgi-bin/ShowDocument?docid=8916
6. FD dry air PVs and info: http://nova-docdb.fnal.gov:8080/cgi-bin/ShowDocument?docid=7667
7. Shifter information: https://cdcvs.fnal.gov/redmine/projects/novaoperations/wiki/NOvA_Shifts
8. Offsite FNAL VPN: https://vpn.fnal.gov
9. Control Room Snapshot Server (VPN required): http://novadaq-ctrl-datamon.fnal.gov:8083/snapshot/ShowImageList.jsp
10. EPICS code (regional and DCM IOC): https://cdcvs.fnal.gov/cgi-bin/public-cvs/cvsweb-public.cgi/novacvs/Online/pkgs/NovaDaqDcs/epics/