Project

General

Profile

Previous Page What to do while on shift How To ...

This page is obsolete. Do not follow these instructions. L.Suter April 3rd 2015
Links to new instructions can be found here https://cdcvs.fnal.gov/redmine/projects/novaoperations/wiki

DAQ Trouble Shooting Guide

Updated 2013-08-02

Problem Symptom Cause Solution
"Resource Discovery" fails
When attempting to start a new run there are missing resources when you move to the "Select Resources" stage. Resource manager is holding the resources in an unused (or irredeemably crashed) partition. Look for a partition tab in the resource manger. Right click on the partition tab and select "release partition. The from RCMainWindow "Rediscover Resources" and "Select Resources". The hardware should now be available. NOTE: MAKE SURE THAT THE PARTITION YOU ARE RELEASING IS NOT IN USE BY SOMEONE ELSE. Call the DAQ Expert if problems persist.
"Hardware Configure" fails
Hardware config appears to hang for more than a minute at more than 90% complete, with no error. One of the DCMs probably did not receive its configuration message Hit "Abort" on the progress popup. If "Configure Hardware" Run Control button turns green, try again. This will usually result in the remaining DCM configuring properly.
At the "Configure Hardware" stage of starting a run, one or more dcms report a timing-link error. Either the master or slave TDU may be in a bad state Contact DAQ (Timning) expert.
SERDES lock error on one or more FEBs Attempting to configure broken or unplugged FEBs. Make sure a correct channel mask is being applied (see /nova/config/run_history).
Run starts, but few or no good events
DCMs complain about unable to read mmap, and/or DCMs configure 0 FEBs Database has FEB and/or Pixel masks that effective turn off an entire DCM. Probably due to database load problem. Try loading from the command line.
Starting a run just does not work Problems persist after multiple tries and the cpu load on (at least) one of the DCMs is very high.
You can check the cpu load for user processes on DCMs in Ganglia. Maybe more than one dcm-app is running on a dcm. Log into that DCM and kill the dcm-apps or reboot just the problem DCM (example of 1-1):
> stopRunControl.sh
> stopSystem
> reboot-dcm-1-1 && sleep 120
> check-dcm-all
or if not good enough, all the DCMs:
> reboot-dcm-all && sleep 120
> check-dcm-all
or, if still not good enough, power cycle the problem dcm (example of 1-1):
> power-off-1-1 && sleep 30
> power-on-1-1 && sleep 120
> check-dcm-all

or finally, power cycle ALL the dcms:
> power-off-instrumented && sleep 30
> power-on-instrumented && sleep 120
> check-dcm-all
Whenever you finally pass checkSystem, then start over with restarting DAQ and run.
> startSystem
> checkSystem
> startRunControl.sh
Followed the above procedures, but runs are still crashing shortly after starting. Possible TDU Problem Follow the same procedures above, but also reboot the TDU.
Run in progress fails
Data appears to have stopped flowing Some component has gotten irredeemably confused. FIRST, stop the run by clicking on the "End Run" button in the RC GUI. THEN
> stopRunControl.sh
> restartSystem
> checkSystem
> startRunControl.sh
See Start a data taking run for fuller details.
Trigger Problems
DAQ running, but no beam spills seen (EavesDropper window not updating with new spills, Don't see NuMI spill roughly every third beep in rcMainWindow)
It's possible that the EavesDropper messaging connection is stale, in which case, EavesDropper needs to be restarted, too. Since the messaging system is stopped and restarted when startSystem, stopSystem, restartSystem are run, this will be the case after any of these actions.

Check that beam is running (should see "3"s on channel 13, and hear beeps. Ask MINOS or Minerva shifter if in doubt, or look at MINOS "big green button" where "Time since last NuMI kicker fire - $A9 signal" is the time since the last NuMI spill.
If beam is running, but no beam spills are being served, you need to restart the Spill Server.
In any window logged into the DAQ cluster (e.g., novadaq-ctrl-master.fnal.gov), and with the DAQ environment set up:
> startBeamSpills.sh

To restart the EavesDropper, issue a ^c in the terminal window that is running it. The restart, usually possible by just hitting the up-arrow once and hitting return. If need be, you can type
> NssSpillReceiverEavesDropper
Beam spill trigger data is empty for both NuMI and Booster spills
There are two different Master Timing Distribution Units (TDUs) which are located in physically different places. If one of the TDUs becomes confused over what time it is (i.e. loses GPS lock during a reinitialization) it can think it is a random date like 2015-Jan-1 or 2025-July-4 or... The system needs to have its GPS manually reset.
Follow the instructions here to reset the GPS lock.
Other
DAQApplicationManager shows many processes as "yellow" (running but not responding), and the system otherwise appears normal. Possibly due to stale connection to DDS Message Servers. To solve, restart DAQApplicationManager from file/restart menu
Ganglia shows anomalously high microslice rate on a dcm (expect 20 KHz at the moment) Some FEBs have lost their sync
> stopRunControl.sh
> restartSystem
> checkSystem
> startRunControl.sh
"The Nuclear Option"; What to try if all else fails and before contacting experts late at night Other solutions don't resolve persistent problems.
from a terminal window logged into novadaq-ctrl-master
> stopRunControl.sh
> stopSystem
> stopDDSEverywhere.sh

TDU control interface window: file->exit
DAQApplication Manager window: file->exit
Everything stopped at this point.
Power cycle the DCMs
Double click on the Kerberos (key) icon.
From any terminal window logged into novadaq-ctrl-master, restart DDS:
> startDDSEverywhere.sh

From the novadaq-ctrl-master where the TDU Control Interface was running, restart the TDU Control Interface:
> TDUControl -p -r

Double click on DAQApp Manager icon.
From another novadaq-ctrl-master console:
> startDDSEverywhere.sh
> startSystem
> startRunControl.sh

start a run from the Run Control GUI window.
My TDU thinks it is some crazy date (2015-Jan-1, 1776-July-4....) The GPS has lost sync and needs to be reset. Follow the instructions here
The event display and/or the online monitor update slowly, or are hard to interact with Problem could be memory viewer. Reduce width to 6 columns, increase update period to 250ms
Can't login to DCM via network, hangs or is unreachable. Various View DCM console port to check error messages. Try this: Use icon to the desktop on nova-02 for accessing the consoles on the dcms. To access the consoles just double click the icon. You will be presented with a screen session that has them all listed, and will allow you to switch back and forth at will between them. All of the consoles are in logging mode, so they will run continuously even if no one is connected. In the event of a failure, just double click the icon and go to the DCM in questions to see what happened. Alternatively, if you want to connect directly use: ssh -t /home/novadaq/dcm_consoles and you will connect to the session. Currently there are no limits to the number of simultaneous connections this supports.
Can't start PedestalDataRunner as it shows the following message:
Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
X11 connection rejected because of wrong authentication.
pedestaldatarunner: Fatal IO error: client killed
/home area is full Call DAQ Expert, who would need to clean some space in /home area
Can't start either OnMon or EventDisplay. Window pops up and then immediately disappears. Kerberos ticket is expired. Renew the kerberos ticket on the machine running OnMon and the EVD (click on the key icon on the tool bar.)
Previous Page What to do while on shift How To ...