ShiftBulletinBoard » History » Version 1093

Version 1092 (Leonidas Aliaga Soplin, 11/22/2019 05:33 PM) → Version 1093/1187 (Leonidas Aliaga Soplin, 11/22/2019 05:34 PM)

|[[NOvA_Shifts|Main Page]]|

h1. Shift Bulletin Board

h1. *Don't forget to read the permanent conditions listed below the temporary conditions!*

| *{color:DeepPink}Current Running Conditions*| *{color:orange}Temporary Conditions* | *{color:ForestGreen}New & Permanent Conditions* | *{color: DeepSkyBlue} Pay Special Attention* |

h1. General Comments from The Run Coordinators:

| *Run Coordinator* | *Until (Date)* | *Phone Number* | *Currently reading* |
| *{color: DeepSkyBlue}Leo Aliaga*| December 1st, 8am | (630) 340-9333 | ok|

*Not sure what expert to call?* Check the on-call contact sheet "here":

If you have any feedback during your shift, please post it "here":

Screenshots for ECL forms: When an ECL form asks for a single plot, please do not post a screenshot of the whole desktop. You might also consider using "this page": for taking screenshots more easily. It is updated every 5 minutes. Alternatively, you can also use the "Snapshot image server": (accessible only within Fermilab network).

h1. *{color: DeepPink} Current running conditions:*

*{color: DeepSkyBlue}Period 10 begins.*
NuMI horn polarity is FHC starting on October 29, 2019.

*{color:DeepSkyBlue}Refresh this webpage before following ANY instructions.*
*FD*: 14 diblocks in partition 1, with configuration *FarDet-period10-prod-ddtart2-v0*
*ND*: 4 diblocks in partition 1, with configuration *NearDet-period10-prod-ddtart2-v0*
*Beam Mode*: FHC
*Cosmic*: FD - normally 10 Hz, ND - normally 1 Hz
*DDT*: running: most enabled (including fast monopole, slow monopole and supernova) - Michel and NN Fast Monopole are currently disabled.

Last updated: 2019-10-07 16:00

h2. Temporary Conditions - *{color:DeepSkyBlue} Make sure to read these before every shift.*

h3. *{color:orange} Multiple failures in ND good runs (Updated November 22 2019)*
We are seeing multiple failures in ND good runs, most troubling are failed timestamp and failed slice. slice, The Data Quality group is working to understand this problem.

h3. *{color:orange}NuMI Beam returned (Updated November 18 2019)*
We have beam now and since November 15 the beam is steadily at ~250 kW.

h3. *{color:orange}OnMon Crashes (updated January 16th, 2019)*

Recently we have seen occasional OnMon crashes. Restarting OnMon is something that can be easily done by the shifter. If you feel comfortable to recover the OnMon, please go ahead.
Before you kill and restart OnMon, please first try to identify if it is the viewer or the producer that crashed. This can be easily identified by seeing if the OnMon producer terminal still spits out messages.
If it is just a viewer crash, please only restart the viewer, since this way the producer can still log crash-related messages.
After you restore the OnMon, leave an ECL entry detailing what you have done.

h3. *{color:orange}DAQ Dashboard (updated June 13th, 2018)*

The main thing to remember about the Dashboard at the moment is not to leave messages in the acknowledged column as this suppresses further messages in the same category. *Shifters should not use the Big CleanUp button unless specifically instructed to do so by a DAQ on-call expert or run coordinator*. If you find messages left in the Acked column, contact the DAQ on-call expert to clear them.

We have added a new question to the Start-shift form that asks you to "List the number of entries in the Alarm, Warning and Acked columns, in both the FD and ND Dashboard.:". This additional question is in response to a suggestion received via the shift feedback form and will be helpful to shifters when filling out the section of the stable-running-checklist that asks whether there are any new Dashboard Alarms or Warnings.

The DAQ dashboard documentation for shifters and experts can be found "here":

h3. *{color:orange}bnevb errors and incomplete events (updated January 24th, 2019)*

As part of the effort to fix instabilities in the FD DAQ we have changed the "bnevb undefined error" (note this is still temporary and may be changed back). This error was previously listed as fatal (which caused the run to stop) but has been downgraded to a warning which allows the run to continue at the expense of a few 10s of seconds of incomplete events (which is much better than the 30 minutes of downtime we would have while recovering the run.) When this happens, you may see the following:

1) A popup window from Run Control saying something like "Error in bnevbXXX due to Process Unresponsive - Call your DAQ expert."
2) A buffer node may turn yellow in DAQ Application Manager.

Your instructions are as follows:

1) Please follow the checklist instructions outlined on this page: [[Buffer Node Error Checklist]]
2) Call the DAQ expert and have them investigate the incomplete event rate from the buffer node in question.

We have also added a question to the stable run checklist regarding the AEvsHour plot found in the Shift folder in OnMon. Please check this plot when you are doing the stable run checklist and *if you see the "incomplete event" error persist for more than 5 minutes, please contact the DAQ expert immediately*.

*Note: there is a known issue where we typically get a burst of NearDet incomplete events lasting ~10 minutes whenever a long trigger is being read out. This occurs every morning at 8:30am Central when the SNEWS trigger fires, and every time our supernova trigger fires. You do not need to call your DAQ on-call expert about this.*

h3. *{color:ForestGreen}New Zoom connection info (updated August 21st 2018)*

We use Zoom for video conferencing in ROC-West, and the Zoom ID is available on the expert contact sheet in every ROC. The maximum call length for Zoom is 24 hours, so don’t be surprised if the meeting ends automatically after 24 hours and you have to reconnect. We have it set up so anyone can connect and start the meeting, so even if the meeting ends and you are shifting from somewhere other than ROC-West, you should still be able to connect and restart the meeting.

If you read this, and you send Anne a picture of a pangolin, she will give you candy.

h2. *New/Updated Permanent Information*

h3. *{color:ForestGreen} New Baseline feature in ND and FD dashboards (updated February 11th)*

A new Baseline feature has been implemented in the dashboards for both detectors. This feature highlights any alarm or warning values which have changed as a change in colour, to make it easier to note new alarms and warnings for filling out checklists. Using the Baseline button will reset all highlighted values, and you (the shifter) should feel free to use it whenever convenient to help keep track of new alarms and warnings as they appear.

h3. *{color:ForestGreen} Disconnects in the synoptic displays (updated January 24th)*

Some of the synoptic displays can become temporarily interrupted (showing pink or purple in the display) as part of normal running. If you see this happen for one of the displays for a couple of minutes, log it in the ECL but there is no need to contact an expert (as long as thing return to normal.) If all displays go pink/purple or if they stay pink/purple for more than a few minutes, contact an expert.

h3. *{color:ForestGreen}FD datadisk permanently switched to dd05*

Having resolved the issues with datadisk05, we have switched back to longterm running on datadisk05. Onmon and the event display should default to using datadisk05 upon start up.

h3. *{color:ForestGreen}Muon monitor 4 no longer works. ("new" Dec. 2017)*

This is on the NuMI Status Display webpage on CR-04 (the right most of the 4 monitors at the bottom of the page.) This monitor no longer works and this is NOT a NOvA maintained webpage (so we don't have control over removing it.)

h3. *{color:ForestGreen} No terminals open on ND CR-05 (ND run control) ("new" Oct. 2017)*

Please make sure there are no terminals windows open on this machine unless an expert is currently doing work or you have been instructed by an expert to leave it open. This includes checking to make sure windows are not minimized. Simply keep typing "exit" until the windows disappear.

h3. *{color:ForestGreen} Procedure for handling a temperature event (new Jan 2017)*

If there is far detector server room temperature event, you will need to "follow these instructions":


h2. Permanent and Stable Information

h3. Are you aware of the "What to do on Shift website":

If this site is new to you, please review it! It is very comprehensive and contains answers to most questions that you will have on shift.

h3. Need to "troubleshoot a DAQ issue": ?

Follow the link to an interactive site that will direct you through the troubleshooting process and initiating contact with experts.

h3. Who is "the current DAQ On-Call expert?":

h3. Stable Runs

A stable runs checklist should be filled out for both detectors once every two hours, and 10 minutes after anytime a run is started manually.

h3. NuMI Beep

If the beep accompanying NuMI spills is not working in ROC West, you can open it on the computer that runs zoom with this link:

h3. Calls from the Main Control Room (updated 15 Nov 2016)

If the MCR calls about the beam being off, NOvA shifters are responsible for informing other experiments that use the NuMI beam. Currently, this means MicroBooNE. (MicroBooNE is a Booster Neutrino Beam experiment, but sees an off-axis component of NuMI.)

MicroBooNE shifters, regardless of whether they are remote or in ROC-West, can be reached at 937-582-6663.

(MINERvA has ceased operations, but our start shift form still asks about who the MINERvA Shifter is. Please feel free to leave this blank. Removing the question entirely causes problems with the history of the field in ECL. Yes, we know it is annoying. Sigh. )

h3. DataLogger crashed, error, not running:

When the DataLogger is not running, it is pink in the DAQ Application Manager, please do not try to restart it from the DAQ Application Manager. In this case, you always need to stop a run and *{color:green} release and re-reserve resources* from the Run Control windows if it works. If not, you need to stop/kill the Run Control and restart the whole DAQ, or call a DAQ expert.

h3. FD DSO scan disk

Use disk @data5-a@ for both reading and writing DSO scan results. Hopefully, this will be selected for you automatically.

h3. After starting a run, you may have to issue a hardware sync

After starting a new run for either detector, please check the following metrics for the following error modes: The "Nearline Good Subruns": plot showing partial detector subrun failures:


and the Ganglia corrupt microslices metric − "FD": or "ND": − showing >1kHz rate for more than a few minutes after the start of the run.


If you see either one of these error modes, please attempt to resync the detector using the big "Sync detector to GPS" button:


Wait a few minutes and see if the issue resolves itself. If not, please contact the DAQ on-call expert.

h3. If the Spill Server Monitor has pink or red boxes

Before calling a DAQ expert, try to restart the Spill Server with the *{color:green} Spill Server Backbone Restart icon* on CR-05 (NearDet DAQ-01) and/or try restarting the SNEWS system with the *{color:green}SNEWS Backbone Restart icon*. This works for both detectors at once. Wait a minute or two and you should see all processes to turn green.

h3. VNC session not responding to mouse clicks or keyboard input

Each encounter might be different. However, this is what worked in the past:
In advanced options in the VNC viewer options, option->advanced->expert->colorlevel, change the color scheme to pa18 or rgb222 and back.

h1. *{color:DeepSkyBlue}Shift Etiquette*

h3. Please make certain to fill out the Downtime Logger whenever it pops up

When a run is stopped or started (not including runtime rollovers), a Downtime Logger window will pop up. Shifters are required to fill this out. This gives us a standing record of downtimes and their causes, allowing us to determine the best way to minimize downtime in the future. For more information as to how to use the Downtime Logger, see "DocDB 14166":

h3. Please use Tags and Categories when you type an entry to the ECL

When you encounter a problem-- *{color:red} especially a crash*--make certain to tag the relevant entries. This makes the ECL more searchable, allowing you (and experts) to more quickly determine when problems started, and whether/how they have been addressed in the past. Please be especially careful to apply Data Logger and DAQ Run Control tags -- these systems are currently our greatest source of downtime.

When you select a tag you have to click on "add" link to be really active! In case you miss, mistake or want to update, you might click on the ECL entry, then on the right upper corner "edit metadata" and update tag, again you have to click on "add" link and "Update tags" button. You might here also change other metadata.

h3. Please be sure that *all* expert work is documented in the ECL

If experts do not do it, please ask them to do it before and after work. Also, all underground entry related to the NOvA experiment must be documented in the ECL!

h2. Phone calls received in the control room

All calls to and from the control room must be recorded with name, time and reason why calling.

You need to report _*who*_ has called. Ask them to repeat their name if needed!

h3. Please do not spend more than *{color:red}15 minutes* solving any DAQ problem on *{color:red} your own*

If you find yourself spending more than 15 minutes trying to solve any DAQ problem, *{color:red}call an on-call DAQ expert.* This is just as true at 3 AM as it is at 3 PM. Your job is as a shifter is to make certain that the detectors and DAQ are running, or that the right people are working to make it so.

h2. What OnMon plot should I rest on?

While a shifter should periodically browse many OnMon plots, it is often the case that a single plot is left up on novacr02 (FD OnMon) and novacr06 (ND OnMon) for extended periods of time. Perhaps one of the more useful plots to leave up as the OnMon 'ground' state is the *PixelHitRateMapMipADC*, as displayed here:

(This plot is in a good state!)

This plot is found in the Shift folder and also in the RatePlots/HitMaps folder. You can trace its path in the directory tree to the left of the plot shown above.

The *PixelHitRateMapMipADC* plot shown *above* is in an example *good* state. If a few scattered FEBs are missing (white), it is likely that they have dropped out of the run due to typical noise issues. _If a large set of adjacent FEBs are all or mostly white for several 'refresh' cycles — or if they are of a drastically different color than their near neighbors, and a DCM has not recently been brought back into the run by a sync_ — this is almost certainly a *problem* (see the example plot *below*):

(This plot is in a bad state!)

Please call your DAQ expert to report any such state, *especially if you have recently cooled the detector*.

At the beginning of a run, this plot may look very odd; that is normal. Please wait until 10 minutes after the start of a run to consider an oddity in this plot as symptomatic of an underlying problem.

At the start of every hour and every half hour (i.e., at XX:00 and XX:30), the entire PixelHitRateMapMipADC plot will appear white. This is also normal. Wait just a few moments, and things should go back to normal.

The shift folder may be updated to include a pixel level plot in the near future. For now, however, please keep an eye on the RatePlots/HitMaps/ *PixelHitRateMapMipADC* plot as your OnMon 'ground' state, and review its status frequently during your shift.

h2. Correct responses to known crash modes

h3. A run crashes with *{color:orange}"Error in dcm-2-XX-YY due to Ctrl Reg General Status Error"* or *{color:orange}"Error in DataLogger due to DataLogger not responsive"*

Release resources. Stop Run Control using the desktop icon. Release partition 1 from the Resource Manager window, and start Run Control using the desktop icon. Then start a run following the usual sequence (see "the manual": If releasing resources is not possible, or if you cannot stop and start Run Control, see the "troubleshooting page":

Note: when you release resources, windows may disappear. This is normal − it's a sign that the resources associated with those windows are being released!

h2. Dropping FEBs

h3. How many scattered FEBs dropped out is a problem?

If you see 10 (6) or more FEBs dropped out on FarDet (NearDet) Nearline, or show up as white on the OnMon display, you should issue an Enable FEB Flow using the *{color:green}green button* on the TDU Control Interface.

h3. If you see a whole DCM missing, or many short tracks that end on DCM boundaries

This may indicate that the detector is out of sync. This can be addressed by issuing a hardware SYNC. Use the red button *{color:red}'Sync detector to GPS'* on TDU Control Interface *only* if you see a whole DCM missing or many short tracks that end on DCM boundaries. This may happen mainly when a new run is started (after when resources are reserved, at the beginning of a new run it may take up to 5 minutes the FarDet is stabilized).

After issuing a sync, wait a couple minutes, while watching the Event Display and OnMon to confirm that tracks go through the previously empty DCM. If this does not help in few minutes you may issue a sync again, otherwise, call the DAQ On-Call expert.

h3. Other problems

If you experience some problem related to dropping FEBs that does not fall into the two categories above, and if the above actions have no effect (watch Event Display, OnMon or for a Nearline update), please contact the on-call DAQ On-Call expert.

h3. Keeping records

When reporting on FEB dropouts in the ECL, make certain to record all symptoms noticed and all actions that haven been taken to correct the problem. State explicitly whether you chose "Enable FEB flow" or "Sync detector to GPS" at each step. This will help us to track the effectiveness of each under certain failure modes.

h2. Message Analyzer:

h3. Window message: *{color:blue} FEB Timestamp diff error detected! You should issue a sync*

If you see in the NOvA Message Analyzer window message *{color:blue} FEB Timestamp diff error detected! You should issue a sync* using the big RED button *{color:blue} Sync Detector to Current Time (Hardware)* in the TDU Control Interface window (for the given partition). Afterword reset all Rules (r1-r6) in the Message Analyzer window (1st column).

h3. To disable an alarm in the Message Analyzer

In the rare situation that you need to disable an alarm, you can do the following: An example is if there are no accelerator 1 Hz triggers and you want to remove the orange pop-up. In the Message Analyzer, click on the box to the left of an alarm (for spill server alarms 'warnNoSpillServer Spill Server Problem'). Then on the top right of that plane click on Actions and then click on Disable selections. Similarly, you can enable an alarm from the same menu.

h2. Recovering the computers in the ROC-West:

If the computers in ROC-West were restarted due to a power outage, or need to be recovered for any other reason, follow "this document": in DocDB.

h2. Recovering the Detector Monitoring Cameras:

NOvA is responsible for all of the cameras, at both the near and far detector. The Main Control Room has no responsibility for these cameras. Often they recover on their own, but please do record outages in the logbook and let your DAQ Expert and Run Coordinator know via slack or email. Unless all of the cameras go out at once, it is generally not an emergency situation. If all of the cameras go out at once please call your run coordinator.

|[[NOvA_Shifts|Main Page]]|