Project

General

Profile

ShiftBulletinBoard » History » Version 983

« Previous - Version 983/1041 (diff) - Next » - Current version
Shih-kai Lin, 01/17/2019 10:11 AM


Main Page

Shift Bulletin Board

Don't forget to read the permanent conditions listed below the temporary conditions!

Current Running Conditions Temporary Conditions New & Permanent Conditions Pay Special Attention

General Comments from The Run Coordinators:

Run Coordinator Until (Date) Phone Number Favourite tree
Shih-Kai Lin Tuesday, January 22th, 8 am (331) 250-0100 No comment

Not sure what expert to call? Check the on-call contact sheet here

If you have any feedback during your shift, please post it here.

Screenshots for ECL forms: When an ECL form asks for a single plot, please do not post a screenshot of the whole desktop. You might also consider using the Snapshot image server for taking screenshots more easily.

Current running conditions:

Period 8 begins.
NuMI beam is back in the evening of October 20, 2018. The nearline is switched back to the beam condition.

Refresh this webpage before following ANY instructions.
FD: 14 diblocks in partition 1, with configuration FarDet-period8-prod-v0
ND: 4 diblocks in partition 1, with configuration NearDet-period8-prod-v0
Beam: Beam for period 8 has been delivered since Oct 20, 2018.
Beam Mode: RHC.
Cosmic: FD - normally 10 Hz, ND - normally 1 Hz
DDT: running: all enabled (including michele, fast monopole, slow monopole and supernova)
Power outages scheduled: NONE

Last updated: 2018-11-28 07:15

Temporary Conditions - Make sure to read these before every shift.

OnMon Crashes (updated January 16th, 2019)

Recently we have seen occasional OnMon crash. This is something that can be easily done by the shifter. If you feel comfortable to recover the OnMon, please go ahead.
Before you kill and restart OnMon, please first try to identify if it is the viewer or the producer that crashed. This can be easily identified by seeing if the OnMon producer terminal still spits out messages.
If it is just a viewer crash, please only restart the viewer, since this way the producer can still log crash-related messages.
After you restore the OnMon, leave an ECL entry detailing what you have done.

DAQ Dashboard (updated June 13th, 2018)

The main thing to remember about the Dashboard at the moment is not to leave messages in the acknowledged column as this suppresses further messages in the same category. Shifters should not use the Big CleanUp button unless specifically instructed to do so by a DAQ on-call expert or run coordinator. If you find messages left in the Acked column, contact the DAQ on-call expert to clear them.

We have added a new question to the Start-shift form that asks you to "List the number of entries in the Alarm, Warning and Acked columns, in both the FD and ND Dashboard.:". This additional question is in response to a suggestion received via the shift feedback form and will be helpful to shifters when filling out the section of the stable-running-checklist that asks whether there are any new Dashboard Alarms or Warnings.

The DAQ dashboard documentation for shifters and experts can be found here.

bnevb errors and incomplete events (updated October 1st, 2018)

As part of the effort to fix instabilities in the FD DAQ we have changed the "bnevb undefined error" (note this is still temporary and may be changed back). This error was previously listed as fatal (which caused the run to stop) but has been downgraded to a warning which allows the run to continue at the expense of a few 10s of seconds of incomplete events (which is much better than the 30 minutes of downtime we would have while recovering the run.) When this happens, you may see the following:

1) A popup window from Run Control saying something like "Error in bnevbXXX due to Process Unresponsive - Call your DAQ expert."
2) A buffer node may turn yellow in DAQ Application Manager.

Your instructions are as follows:

1) Please follow the checklist instructions outlined on this page: Buffer Node Error Checklist
2) Call the DAQ expert and have them investigate the incomplete event rate from the buffer node in question.

We have also added a question to the stable run checklist regarding the AEvsHour plot found in the Shift folder in OnMon. Please check this plot when you are doing the stable run checklist and if you see the "incomplete event" error persist for more than 5 minutes, please contact the DAQ expert immediately.

Note: there is a known issue where we typically get a burst of NearDet incomplete events lasting ~10 minutes, from about 8:30 am, when the daily SNEWS trigger fires. You do not need to call your DAQ on-call expert about this. You are safe to assume any NearDet burst of incomplete events lasting more than 5 mins, between 8:30 and 9 am, is OK and does not need to be reported.

New Zoom connection info (updated August 21st 2018)

We use Zoom for video conferencing in ROC-West, and the Zoom ID is available on the expert contact sheet in every ROC. The maximum call length for Zoom is 24 hours, so don’t be surprised if the meeting ends automatically after 24 hours and you have to reconnect. We have it set up so anyone can connect and start the meeting, so even if the meeting ends and you are shifting from somewhere other than ROC-West, you should still be able to connect and restart the meeting.


New/Updated Permanent Information

Disconnects in the synoptic displays (updated January 24th)

Some of the synoptic displays can become temporarily interrupted (showing pink or purple in the display) as part of normal running. If you see this happen for one of the displays for a couple of minutes, log it in the ECL but there is no need to contact an expert (as long as thing return to normal.) If all displays go pink/purple or if they stay pink/purple for more than a few minutes, contact an expert.

FD datadisk permanently switched to dd05

Having resolved the issues with datadisk05, we have switched back to longterm running on datadisk05. Onmon and the event display should default to using datadisk05 upon start up.

Muon monitor 4 no longer works. ("new" Dec. 2017)

This is on the NuMI Status Display webpage on CR-04 (the right most of the 4 monitors at the bottom of the page.) This monitor no longer works and this is NOT a NOvA maintained webpage (so we don't have control over removing it.)

No terminals open on ND CR-05 (ND run control) ("new" Oct. 2017)

Please make sure there are no terminals windows open on this machine unless an expert is currently doing work or you have been instructed by an expert to leave it open. This includes checking to make sure windows are not minimized. Simply keep typing "exit" until the windows disappear.

Procedure for handling a temperature event (new Jan 2017)

If there is far detector server room temperature event, you will need to follow these instructions.


Permanent and Stable Information

Are you aware of the What to do on Shift website?

If this site is new to you, please review it! It is very comprehensive and contains answers to most questions that you will have on shift.

Need to troubleshoot a DAQ issue ?

Follow the link to an interactive site that will direct you through the troubleshooting process and initiating contact with experts.

Who is the current DAQ On-Call expert?

Stable Runs

A stable runs checklist should be filled out for both detectors once every two hours, and 10 minutes after anytime a run is started manually.

Calls from the Main Control Room (updated 15 Nov 2016)

If the MCR calls about the beam being off, NOvA shifters are responsible for informing other experiments that use the NuMI beam. Currently, this means MINERvA and MicroBooNE. (MicroBooNE is a Booster Neutrino Beam experiment, but sees an off-axis component of NuMI.)

Minerva runs shifts only from 10am-4pm. During these hours, pass along all information about NuMI status to the MINERvA shifter. Outside these hours, it is not necessary to contact anyone at MINVERvA unless specifically requested to. In this case, call the MINERvA detector expert.

MicroBooNE shifters, regardless of whether they are remote or in ROC-West, can be reached at 937-582-6663.

DataLogger crashed, error, not running:

When the DataLogger is not running, it is pink in the DAQ Application Manager, please do not try to restart it from the DAQ Application Manager. In this case, you always need to stop a run and release and re-reserve resources from the Run Control windows if it works. If not, you need to stop/kill the Run Control and restart the whole DAQ, or call a DAQ expert.

FD DSO scan disk

Use disk data5-a for both reading and writing DSO scan results. Hopefully, this will be selected for you automatically.

After starting a run, you may have to issue a hardware sync

After starting a new run for either detector, please check the following metrics for the following error modes: The Nearline Good Subruns plot showing partial detector subrun failures:

and the Ganglia corrupt microslices metric − FD or ND − showing >1kHz rate for more than a few minutes after the start of the run.

If you see either one of these error modes, please attempt to resync the detector using the big "Sync detector to GPS" button:

Wait a few minutes and see if the issue resolves itself. If not, please contact the DAQ on-call expert.

If the Spill Server Monitor has pink or red boxes

Before calling a DAQ expert, try to restart the Spill Server with the Spill Server Backbone Restart icon on CR-05 (NearDet DAQ-01) and/or try restarting the SNEWS system with the SNEWS Backbone Restart icon. This works for both detectors at once. Wait a minute or two and you should see all processes to turn green.

VNC session not responding to mouse clicks or keyboard input

Each encounter might be different. However, this is what worked in the past:
In advanced options in the VNC viewer options, option->advanced->expert->colorlevel, change the color scheme to pa18 or rgb222 and back.

Shift Etiquette

Please make certain to fill out the Downtime Logger whenever it pops up

When a run is stopped or started (not including runtime rollovers), a Downtime Logger window will pop up. Shifters are required to fill this out. This gives us a standing record of downtimes and their causes, allowing us to determine the best way to minimize downtime in the future. For more information as to how to use the Downtime Logger, see DocDB 14166

Please use Tags and Categories when you type an entry to the ECL

When you encounter a problem-- especially a crash--make certain to tag the relevant entries. This makes the ECL more searchable, allowing you (and experts) to more quickly determine when problems started, and whether/how they have been addressed in the past. Please be especially careful to apply Data Logger and DAQ Run Control tags -- these systems are currently our greatest source of downtime.

When you select a tag you have to click on "add" link to be really active! In case you miss, mistake or want to update, you might click on the ECL entry, then on the right upper corner "edit metadata" and update tag, again you have to click on "add" link and "Update tags" button. You might here also change other metadata.

Please be sure that all expert work is documented in the ECL

If experts do not do it, please ask them to do it before and after work. Also, all underground entry related to the NOvA experiment must be documented in the ECL!

Phone calls received in the control room

All calls to and from the control room must be recorded with name, time and reason why calling.

You need to report who has called. Ask them to repeat their name if needed!

Please do not spend more than 15 minutes solving any DAQ problem on your own

If you find yourself spending more than 15 minutes trying to solve any DAQ problem, call an on-call DAQ expert. This is just as true at 3 AM as it is at 3 PM. Your job is as a shifter is to make certain that the detectors and DAQ are running, or that the right people are working to make it so.

What OnMon plot should I rest on?

While a shifter should periodically browse many OnMon plots, it is often the case that a single plot is left up on novacr02 (FD OnMon) and novacr06 (ND OnMon) for extended periods of time. Perhaps one of the more useful plots to leave up as the OnMon 'ground' state is the PixelHitRateMapMipADC, as displayed here:


(This plot is in a good state!)

This plot is found in the Shift folder and also in the RatePlots/HitMaps folder. You can trace its path in the directory tree to the left of the plot shown above.

The PixelHitRateMapMipADC plot shown above is in an example good state. If a few scattered FEBs are missing (white), it is likely that they have dropped out of the run due to typical noise issues. If a large set of adjacent FEBs are all or mostly white for several 'refresh' cycles — or if they are of a drastically different color than their near neighbors, and a DCM has not recently been brought back into the run by a sync — this is almost certainly a problem (see the example plot below):


(This plot is in a bad state!)

Please call your DAQ expert to report any such state, especially if you have recently cooled the detector.

At the beginning of a run, this plot may look very odd; that is normal. Please wait until 10 minutes after the start of a run to consider an oddity in this plot as symptomatic of an underlying problem.

At the start of every hour and every half hour (i.e., at XX:00 and XX:30), the entire PixelHitRateMapMipADC plot will appear white. This is also normal. Wait just a few moments, and things should go back to normal.

The shift folder may be updated to include a pixel level plot in the near future. For now, however, please keep an eye on the RatePlots/HitMaps/ PixelHitRateMapMipADC plot as your OnMon 'ground' state, and review its status frequently during your shift.

Correct responses to known crash modes

A run crashes with "Error in dcm-2-XX-YY due to Ctrl Reg General Status Error" or "Error in DataLogger due to DataLogger not responsive"

Release resources. Stop Run Control using the desktop icon. Release partition 1 from the Resource Manager window, and start Run Control using the desktop icon. Then start a run following the usual sequence (see the manual). If releasing resources is not possible, or if you cannot stop and start Run Control, see the troubleshooting page.

Note: when you release resources, windows may disappear. This is normal − it's a sign that the resources associated with those windows are being released!

Dropping FEBs

How many scattered FEBs dropped out is a problem?

If you see 10 (6) or more FEBs dropped out on FarDet (NearDet) Nearline, or show up as white on the OnMon display, you should issue an Enable FEB Flow using the green button on the TDU Control Interface.

If you see a whole DCM missing, or many short tracks that end on DCM boundaries

This may indicate that the detector is out of sync. This can be addressed by issuing a hardware SYNC. Use the red button 'Sync detector to GPS' on TDU Control Interface only if you see a whole DCM missing or many short tracks that end on DCM boundaries. This may happen mainly when a new run is started (after when resources are reserved, at the beginning of a new run it may take up to 5 minutes the FarDet is stabilized).

After issuing a sync, wait a couple minutes, while watching the Event Display and OnMon to confirm that tracks go through the previously empty DCM. If this does not help in few minutes you may issue a sync again, otherwise, call the DAQ On-Call expert.

Other problems

If you experience some problem related to dropping FEBs that does not fall into the two categories above, and if the above actions have no effect (watch Event Display, OnMon or for a Nearline update), please contact the on-call DAQ On-Call expert.

Keeping records

When reporting on FEB dropouts in the ECL, make certain to record all symptoms noticed and all actions that haven been taken to correct the problem. State explicitly whether you chose "Enable FEB flow" or "Sync detector to GPS" at each step. This will help us to track the effectiveness of each under certain failure modes.

Message Analyzer:

Window message: FEB Timestamp diff error detected! You should issue a sync

If you see in the NOvA Message Analyzer window message FEB Timestamp diff error detected! You should issue a sync using the big RED button Sync Detector to Current Time (Hardware) in the TDU Control Interface window (for the given partition). Afterword reset all Rules (r1-r6) in the Message Analyzer window (1st column).

To disable an alarm in the Message Analyzer

In the rare situation that you need to disable an alarm, you can do the following: An example is if there are no accelerator 1 Hz triggers and you want to remove the orange pop-up. In the Message Analyzer, click on the box to the left of an alarm (for spill server alarms 'warnNoSpillServer Spill Server Problem'). Then on the top right of that plane click on Actions and then click on Disable selections. Similarly, you can enable an alarm from the same menu.

Recovering the computers in the ROC-West:

If the computers in ROC-West were restarted due to a power outage, or need to be recovered for any other reason, follow this document in DocDB.

Main Page