ShiftBulletinBoard » History » Version 551
Shift Bulletin Board¶
Matt Strait is acting Run Coordinator until Monday October 10, 8am. He is reachable at 630-840-4031 (office), 612-501-2520 (cell), or 630-377-0063 (home)
Make sure to read these before every shift.
Far Detector DiskWatcher¶
StatusHearbeat messages updating ~ 20 min in FD DisWatcher (novacr02) is OK.
Ignore errors about FEB 15 on DCM-1-02-04. There is no APD plugged in here until the afternoon Oct 6 or possibly later.
On Saturday Oct 8 and Saturday Oct 15, there will be firefighter drills in and around Wilson Hall. Don't be alarmed by uniformed personnel running around the building. We don't expect the alarms to go off.
Ganglia running on farm-46 beginning Mar 4 2016:¶
DAQ Ganglia monitoring systems are operating in a degraded state. Many of the checklist plots are produced by ganglia which may be down.
Ganglia web pages: we are currently using a back up machine for the Ganglia: https://novadaq-far-farm-46.fnal.gov/ganglia/. When you open in a web browser for the first time you need to accept a 'risky' certificate and allow it, after you will be prompted for the authentication, the login is 'nova' and password is OLD nova password (without 'nu' at the beginning).
- Some links will be broken. But, if you change in the link novadaq-far-daqmon -> novadaq-far-farm-46, they should work!
Are you aware of the What to do on Shift website?¶
If this site is new to you, please review it! It is very comprehensive, and contains answers to most questions that you will have on shift.
Need to troubleshoot a DAQ issue?¶
Follow the link to an interactive site that will direct you through the trouble shooting process and initiating contact with experts.
DataLogger crashed, error, not running:¶
When the DataLogger is not running, it is pink in the DAQ Application Manager, please do not try to restart it from the DAQ manager Gui window. In this case you always need to stop a run and release and re-reserve resources from the Run Control windows, if it works. If not, you need to stop/kill the Run Control and restart whole DAQ, or call a DAQ expert.
FD DSO scan disk¶
data5-a for both reading and writing DSO scan results. Hopefully this will be selected for you automatically.
If you accidentally check "Select All"¶
If you accidentally check "Select All" when selecting resources during a start run procedure, please refer to the experts bulletin board for a complete list of valid resources for FD.
After starting a run, you may have to issue a hardware sync¶
After starting a new run for either detector, please check the following metrics for the following error modes: The Nearline Good Subruns plot showing partial detector subrun failures:
If you see either one of these error modes, please attempt to resync the detector using the big "sync detector to gps" button:
Wait a few minutes and see if the issue resolves itself. If not, please contact the DAQ on-call expert.
If the Spill Server Monitor GUI window has pink or red boxes and the trigger scalars are stopped, before calling a DAQ expert try to restart the Spill Server by clicking on the Spill Server Backbone Restart icon on CR-05 (NearDet DAQ-01). This works for both detectors at once. Wait a minute or two and you should see all processes to turn green and scalars back running on both detectors. If not the case call a DAQ expert.
Do not restart the run due to pink DDS processes!¶
If the DDS Daemons tab is red (but the DAQ Processes tab is black and data is flowing) in the DAQApplicationManager:
and manager process DDS daemons are pink when you select this tab:
OR if a buffer node (bnevb) DDS process is pink, do not attempt to restart the process or run.
In general, if a DDS process is pink, and there are no accompanying error messages (an example of an accompanying error message would be missing heartbeats reported in Run Control for that named process), then data taking is not affected--the shifter should not be concerned. Simply make a note in the ECL (possibly with an accompanying picture).
If the CERN Vidyo conference room disconnects:¶
Please record in the ECL both the time of the disconnect, and any symptoms leading up to / following the disconnect (audio/video cutting out or stuttering, etc). Include which side of the connection you are on (what ROC), and who dropped out. Then email Harry Ferguson <email@example.com> and Sheila Cisko <firstname.lastname@example.org> with a copy of this information. CC the NOvA run coordinators <email@example.com>. This will allow us to better diagnose, and subsequently address, the cause of these dropouts.
Please make certain to fill out the Downtime Logger whenever it pops up¶
When a run is stopped or started (not including runtime rollovers), a Downtime Logger window will pop up. Shifters are required to fill this out. This gives us a standing record of downtimes and their causes, allowing us to determine the best way to minimize downtime in the future. For more information as to how to use the Downtime Logger, see DocDB 14166
Please use Tags and Categories when you type an entry to the ECL¶
When you encounter a problem-- especially a crash--make certain to tag the relevant entries. This makes the ECL more searchable, allowing you (and experts) to more quickly determine when problems started, and whether/how they have been addressed in the past. Please be especially careful to apply Data Logger and DAQ Run Control tags -- these systems are currently our greatest source of downtime.
When you select a tag you have to click on "add" link to be really active! In case you miss, mistake or want to update, you might click on the ECL entry, then on the right upper corner "edit metadata" and update tag, again you have to click on "add" link and "Update tags" button. You might here also change other metadata.
Please be sure that all expert work is documented in the ECL¶
If experts do not do it, please ask them to do it before and after work. Also all underground entry related to the NOvA experiment must be documented in the ECL! And all calls to and from the Control Room must be recorded with name, time and reason why calling.
Please do not spend more than 15 minutes solving any DAQ problem on your own¶
If you find yourself spending more than 15 minutes trying to solve any DAQ problem, call an on-call DAQ expert. This is just as true at 3 AM as it is at 3 PM. Your job is as a shifter is to make certain that the detectors and DAQ are running, or that the right people are working to make it so.
What OnMon plot should I rest on?¶
While a shifter should periodically browse many OnMon plots, it is often the case that a single plot is left up on novacr02 (FD OnMon) and novacr06 (ND OnMon) for extended periods of time. Perhaps one of the more useful plots to leave up as the OnMon 'ground' state is the PixelHitRateMapMipADC, as displayed here:
(This plot is in a good state!)
This plot is found in the RatePlots/HitMaps folder, rather than the more commonly-used Shift folder. You can trace its path in the directory tree to the left of the plot shown above. The Hit Rate plots available in the Shift folder are certainly quite useful, but summarize at the level of FEBs. If, say, half of the pixels in a given FEB were unintentionally disabled, the FEB-level plot would only show a reduced hit rate for that FEB — which might mask an underlying problem.
The PixelHitRateMapMipADC plot shown above is in an example good state. If a few scattered FEBs are missing (white), it is likely that they have dropped out of the run due to typical noise issues. If a large set of adjacent FEBs are all or mostly white for several 'refresh' cycles — or if they are of a drastically different color than their near neighbors, and a DCM has not recently been brought back into the run by a sync — this is almost certainly a problem (see the example plot below):
(This plot is in a bad state!)
Please call your DAQ expert to report any such state, especially if you have recently cooled the detector.
At the beginning of a run, this plot may look very odd; that is normal. Please wait until 10 minutes after the start of a run to consider an oddity in this plot as symptomatic of an underlying problem.
At the start of every hour and every half hour (e.g., at XX:00 and XX:30), the entire PixelHitRateMapMipADC plot will appear white. This is also normal. Wait just a few moments, and things should go back to normal.
The shift folder may be updated to include a pixel level plot in the near future. For now, however, please keep an eye on the RatePlots/HitMaps/ PixelHitRateMapMipADC plot as your OnMon 'ground' state, and review its status frequently during your shift.
Main Control Room¶
If the MCR calls about the beam being off, NOvA shifters are responsible for informing the other NuMI experiment(s) (currently MINERvA) by talking to their shifters. The name, location and phone number for these shifter(s) should be recorded in your Start Shift form.
Correct responses to known crash modes¶
A run crashes with "Error in dcm-2-XX-YY due to Ctrl Reg General Status Error" or "Error in DataLogger due to DataLogger not responsive"¶
Release resources. Stop Run Control (using the desktop icon). Release the active partition (usually 1) from the Resource Manager window, and start Run Control (again using the desktop icon). Then start a run following the usual sequence (see the shifter How-To guide if you're uncertain how to do this). If releasing resources is not possible, or if you cannot stop and start Run Control, see the RunControl tab on the Troubleshooting page .
Note: when you release resources, windows may begin to disappear. This is normal − it's just a sign that the resources associated with those windows are being released!
How many scattered FEBs dropped out is a problem?¶
If you see 10 (3) or more FEBs dropped out on FarDet (NearDet) Nearline, or show up as white on the OnMon display, you should issue an Enable FEB Flow using the green button on the TDU Control Interface.
If you see a whole DCM missing, or many short tracks that end on DCM boundaries¶
This may indicate that the detector is out of sync. This can be addressed by issuing a hardware SYNC. Use the red button 'Sync detector to GPS' on TDU Control Interface only if you see a whole DCM missing or many short tracks that end on DCM boundaries. This may happen mainly when a new run is started (after when resources are reserved, at the beginning of a new run it may take up to 5 minutes the FarDet is stabilized).
After issuing a sync, wait a couple minutes, while watching the Event Display and OnMon to confirm that tracks go through the previously empty DCM. If this does not help in few minutes you may issue a sync again, otherwise call the DAQ expert.
If you experience some problem related to dropping FEBs that does not fall into the two categories above, and if the above actions have no effect (watch Event Display, OnMon or for a Nearline update), please contact the on-call DAQ expert.
When reporting on FEB dropouts in the ECL, make certain to record all symptoms noticed and all actions taken to correct the problem. State explicitly whether you chose "Enable FEB flow" or "Sync detector to GPS" at each step. This will help us to track the effectiveness of each under certain failure modes.
Window message: FEB Timestamp diff error detected! You should issue a sync¶
If you see in the NOvA Message Analyzer window message FEB Timestamp diff error detected! You should issue a sync using the big RED button Sync Detector to Current Time (Hardware) in the TDU Control Interface window (for the given partition). Afterword reset all Rules (r1-r6) in the Message Analyzer window (Rst column).
To disable an alarm in the Message Analyzer¶
In the rare situation that you need to disable an alarm you can do the following. An example is if there is no accelerator 1 Hz triggers and you want to remove the orange pop-up. In the Message Analyzer click on the box to the left of an alarm (for spill server alarms 'warnNoSpillServer Spill Server Problem'). Then on the top right of that plane click on Actions and then click on Disable selections. Similarly you can enable an alarm from the same menu.
Recovering the computers in the ROC-West:¶
If the computers in ROC-West were restarted due to power outage, or need to be recovered for any other reason, follow this document in DocDB.