Project

General

Profile

ShiftBulletinBoard » History » Version 444

« Previous - Version 444/1014 (diff) - Next » - Current version
Xuebing Bu, 03/12/2016 11:28 AM


Main Page

Shift Bulletin Board - Mar 11, 2016

Jaroslav Zalesak is now acting Run Coordinator (Mar 7 8am - March 21 8am). He is reachable on 630-251-1374 (cell) and 630-840-3186 (work).

 

On Fri, Sat and Sun (Mar 11-13) 8pm-2am for the DAQ expert call Xuebing Bu: 331-330-5988(cell) 630-840-3218 (work)

 

Beam down for 4+ hours:

In a case the beam is announced (by MCR) to be down at least for 4 hours next week, during the working hours (8am-3pm, not on holidays) please call Jon Paley (617-504-4005).
 

Special run conditions beginning Mar 4 2016:

DAQ Ganglia monitoring systems are operating in a degraded state. Shifters should be aware of this when performing checklists. Many of the checklist plots are produced by ganglia which may be down.

Ganglia web pages: since weekend we have been having a back up machine for the Ganglia: https://novadaq-far-farm-46.fnal.gov/ganglia/. When you open in a web browser for the first time you need to accept a 'risky' certificate and allow it, after you will be prompted for the authentication, the login is 'nova' and password is OLD nova password (without 'nu' at the beginning).

Many links--especially from the ECL forms--will not be working now. But, if you change in the link novadaq-far-daqmon -> novadaq-far-farm-46, they should work!
 

Special run conditions beginning Feb 3 2016:

We are running with 130+ buffer nodes in the FarDet DAQ read-out. The full list of them is at the bottom of this page!.

 

After starting a run, you may have to issue a hardware sync

After starting a new run for either detector, please check the following metrics for the following error modes:

The Nearline Good Subruns plot showing partial detector subrun failures:

and the Ganglia corrupt microslices metric showing very high (kHz) rate for more than a few minutes after the start of the run (note that this link is for the FD corrupt microslices metric; please look at the ND corrupt microslices metric if you restart the ND run, available from the ND Ganglia Checklist).

If you see either one of these error modes, please attempt to resync the detector using the big "sync detector to gps" button:

Wait a few minutes and see if the issue resolves itself. If not, please contact the DAQ on-call expert.

Remember to use tags:

When you encounter a problem-- especially a crash--make certain to tag the relevant entries.

This guarantees that information is sent to the parties responsible for that system. Please be especially careful to apply Data Logger and DAQ Run Control tags -- these systems are currently our greatest source of downtime. See Shift Etiquette below this section for more information on how to tag an entry.

 

NuMI TRIGGERS:

If beam is ON, you should see NuMI trigger scalars incrementing on both detectors.

If NOT check the Spill Server Monitor GUI window (on CR-01 - FarDet, CR-05 NearDet). ALL boxes should be GREEN. If ANY (on both detectors) is not green but pink or red and trigger scalars are stopped, before calling a DAQ expert try to restart the Spill Server by clicking on the Spill Server Backbone Restart icon on the CR-05 (NearDet DAQ-01) machine (works for both detectors at once), wait a minute or two and you should see all processes to turn green and scalars back running on both detectors (1Hz trigger just at 1Hz and NuMI trigger at about 0.7 Hz if beam up), If not the case call a DAQ expert about.

In any case, beam is ON or OFF, the 1Hz Accel trigger scalars should be running at just 1 Hz.

 

Do not restart the run due to pink DDS processes!

If the DDS Daemons tab is red (but the DAQ Processes tab is black and data is flowing) in the DAQApplicationManager:

and manager process DDS daemons are pink when you select this tab:

OR if a buffer node (bnevb) DDS process is pink, do not attempt to restart the process or run.

In general, if a DDS process is pink, and there are no accompanying error messages (an example of an accompanying error message would be missing heartbeats reported in Run Control for that named process), then data taking is not affected--the shifter should not be concerned. Simply make a note in the ECL (possibly with an accompanying picture).

 

If Run Control crashes / disconnects:

If a run crashes with an error message that explicitly names RunControl, or if the RunControl GUI turns red:

Once the DAQ is up and running again, email Jon Paley <> with as much information as possible about the crash, including the run number(s), the range of ECL entries relating to the crash, and its solution. Alternately, tag the relevant entries with the DAQ Run Control tag (see Shift Etiquette below for more information about tagging). We need data to solve this problem, and you are the trigger system that tells us we've got signal.

If the CERN Vidyo conference room disconnects:

Please record in the ECL both the time of the disconnect, and any symptoms leading up to / following the disconnect (audio/video cutting out or stuttering, etc). Include which side of the connection you are on (what ROC), and who dropped out. Then email Harry Ferguson <> and Sheila Cisko <> with a copy of this information. CC the NOvA run coordinators <>. This will allow us to better diagnose, and subsequently address, the cause of these dropouts.

 
 
 
 

Shift Etiquette

Please make certain to fill out the Downtime Logger whenever it pops up

When a run is stopped or started (not including runtime rollovers), a Downtime Logger window will pop up. Shifters are required to fill this out. This gives us a standing record of downtimes and their causes, allowing us to determine the best way to minimize downtime in the future. For more information as to how to use the Downtime Logger, see DocDB 14166

Please use TAGs and Categories when you type an entry to the ECL!

This makes the ECL more searchable, allowing you (and experts) to more quickly determine when problems started, and whether/how they have been addressed in the past.

When you select a tag you have to click on "add" link to be really active!* In case you miss, mistake or want to update, you might click on the ECL entry, then on the right upper corner "edit metadata" and update tag, again you have to click on "add" link and "Update tags" button. You might here also change other metadata.

Please be sure that ALL experts work is documented in the ECL

If experts do not do it, please ask them to do it before and after work. Also all underground entry related to the NOvA experiment must be documented in the ECL! And all calls to and from the Control Room must be recorded with name, time and reason why calling.

Please do not spend more than 15 minutes solving any DAQ problem on your own

If you find yourself spending more than 15 minutes trying to solve any DAQ problem, call an on-call DAQ expert. This is just as true at 3 AM as it is at 3 PM. Your job is as a shifter is to make certain that the detectors and DAQ are running, or that the right people are working to make it so.

 
 
 

What OnMon plot should I rest on?

While a shifter should periodically browse many OnMon plots, it is often the case that a single plot is left up on novacr02 (FD OnMon) and novacr06 (ND OnMon) for extended periods of time. Perhaps one of the more useful plots to leave up as the OnMon 'ground' state is the PixelHitRateMapMipADC, as displayed here:


(This plot is in a good state!)

This plot is found in the RatePlots/HitMaps folder, rather than the more commonly-used Shift folder. You can trace its path in the directory tree to the left of the plot shown above. The Hit Rate plots available in the Shift folder are certainly quite useful, but summarize at the level of FEBs. If, say, half of the pixels in a given FEB were unintentionally disabled, the FEB-level plot would only show a reduced hit rate for that FEB -- which might mask an underlying problem.

The PixelHitRateMapMipADC plot shown above is in an example good state. If a few scattered FEBs are missing (white), it is likely that they have dropped out of the run due to typical noise issues. If a large set of adjacent FEBs are all or mostly white for several 'refresh' cycles--or if they are of a drastically different color than their near neighbors, and a DCM has not recently been brought back into the run by a sync--this is almost certainly a problem (see the example plot below):


(This plot is in a bad state!)

Please call your DAQ expert to report any such state, especially if you have recently cooled the detector.

At the beginning of a run, this plot may look very odd; that is normal. Please wait until 10 minutes after the start of a run to consider an oddity in this plot as symptomatic of an underlying problem.

At the start of every hour and every half hour (e.g., at XX:00 and XX:30), the entire PixelHitRateMapMipADC plot will appear white. This is also normal. Wait just a few moments, and things should go back to normal.

The shift folder may be updated to include a pixel level plot in the near future. For now, however, please keep an eye on the RatePlots/HitMaps/ PixelHitRateMapMipADC plot as your OnMon 'ground' state, and review its status frequently during your shift.

 
 
 

Calls

Main Control room:

If MCR calls about beam is OFF: NOvA shifters are responsible to inform the other experiments (MINOS and MINERvA) talking to their shifters. The name, location and phone number should now be recorded in your Start Shift form, if MCR calls use this information to contact MINOS and MINERvA.

Expert calling:

If you call any expert, please fill up the new ECL form called Expert Contact.

 
 
 

Are you aware of the What to do on Shift website ?

If this site is new to you, please review it! It is very comprehensive, and contains answers to most questions that you will have on shift.

 
 
 

Correct responses to known crash modes

A run crashes with "Error in dcm-2-XX-YY due to Ctrl Reg General Status Error Stopping run."

Release resources. Stop Run Control (using the desktop icon). Release the active partition (usually 1) from the Resource Manager window, and start Run Control (again using the desktop icon). Then start a run following the usual sequence (see the shifter How-To guide if you're uncertain how to do this). If releasing resources is not possible, or if you cannot stop and start Run Control, see the RunControl tab on the Troubleshooting page .

Note: when you release resources, windows may begin to disappear. This is normal--its just a sign that the resources associated with those windows are being released!

A run crashes with "Error in DataLogger due to DataLogger not responsive Stopping run"

Release resources. Stop Run Control (using the desktop icon). Release the active partition (usually 1) from the Resource Manager window, and start Run Control (again using the desktop icon). Then start a run following the usual sequence (see the shifter How-To guide if you're uncertain how to do this). If releasing resources is not possible, or if you cannot stop and start Run Control, see the RunControl tab on the Troubleshooting page .

Note: when you release resources, windows may begin to disappear. This is normal--its just a sign that the resources associated with those windows are being released!

RunControl Disconnects:

Follow the procedures for recovery as shown on the RunControl Troubleshooting page . If you have not recently done so, please review this document on the roles Run Control, Resource Manager, and DAQApplication Manager have in controlling the system.

Once the DAQ is up and running again, email Jon Paley <> with as much information as possible about the crash, including the run number(s), the range of ECL entries relating to the crash, and its solution.

Run Control crash:

Following the procedures for recovery as shown on the RunControl Troubleshooting page . Again, if you have not recently done so, please review this document on the roles Run Control, Resource Manager, and DAQApplication Manager have in controlling the system.

Once the DAQ is up and running again, email Jon Paley <> with as much information as possible about the crash, including the run number(s), the range of ECL entries relating to the crash, and its solution.

Anything else

See the shift Troubleshooting page . It is quite comprehensive. If you encounter an error mode that is not described on this page, please contact your DAQ expert. If it takes you longer than 15 minutes to solve the problem--including any time invested in reading the Troubleshooting page--call the on-call DAQ expert.

 
 
 

High temperature in the Computer Room at Ash River:

Alert message: "Temperature Event Detected!"

If you see on the DAQ-01,02 screen a pop-up warning window with "Temperature Event Detected! If the temperature continues to rise, Run Control will be shutdown!", follow instruction in DocDB-11381

 
 
 

Dropping FEBs

10 or more FEBs dropped out on FarDet

If you see 10 or more FEBs dropped out on FarDet Nearline, or show up as white FEBs on OnMon display: you should issue a Enable FEB flow using the green button on TDU Control Interface, Record this in the ECL.

3 or more FEBs dropped out on NearDet

If you see 3 or more FEBs dropped out on NearDet Nearline, or show up as white FEBs on OnMon display: you should issue a Enable FEB flow using the green button on TDU Control Interface, Record this in the ECL.

If you see a whole DCM missing, or many short tracks that end on DCM boundaries

This may indicate that the detector is out of sync. This can be addressed by Issuing a hardware SYNC: you should issue a SYNC using the red button 'Sync detector to GPS' on TDU Control Interface ONLY if you see a whole DCM missing or many short tracks that end on DCM boundaries. This may happen mainly when a new Run Control run is started (after when resources are reserved, at the beginning of a new run it may take up to 5 minutes the FarDet is stabilized).

After issuing a sync, wait a couple minutes, while watching the Event Display and OnMon to confirm that tracks go through the previously empty dcm. If this does not help in few minutes you may issue a sync again, otherwise call the DAQ expert.

Other problems

If you experience some problem related to dropping FEBs that does not fall into the two categories above, and if the above actions have no effect (watch Event Display, OnMon or for a Nearline update), please contact the on-call DAQ expert.

Keeping records

When reporting on FEB dropouts in the ECL, make certain to record all symptoms noticed and all actions taken to correct the problem. State explicitly whether you chose "Enable FEB flow" or "Sync detector to GPS" at each step. This will help us to track the effectiveness of each under certain failure modes.

 
 
 

Message Analyzer:

Window message: FEB Timestamp diff error detected! You should issue a sync

If you see in the NOvA Message Analyzer window message FEB Timestamp diff error detected! You should issue a sync , you should issue a SYNC using the big RED button Sync Detector to Current Time (Hardware) in the TDU Control Interface Gui window (for the given partition), afterwords reset all Rules (r1-r6) in the Message Analyzer window (Rst column). Note: This is only for this ONE specific error "FEB Timestamp diff error detected"

To disable an alarm in the Message Analyzer

In the rare situation that you need to disable an alarm you can do the follow. An example is if there is no accelerator 1 Hz triggers and you want to remove the orange pop-up. In the Message Analyzer click on the box to the left of an alarm (for spill server alarms 'warnNoSpillServer Spill Server Problem'). Then on the top right of that plane click on Actions and then click on Disable selections. Similarly you can enable an alarm from the same menu.

 
 
 

Synoptic Displays:

Synoptic displays are not running

The synoptic (FarDet & NearDet) should be running in the VNC session on novacr03. These displays provide information on power supplies for the Near and Far detectors, for the dry gas systems, alarms, etc. (Still confused as to what kind of displays you should be seeing? See this page , navigate to the Nova tab, and click on some of the display options listed there.) If these Synoptic Displays are not running, launch use the Connect to Synoptic VNC icon on the nova@nova-cr-03 desktop. After the VNC session is launched, you should see the displays, if not start them clicking on the Start Synoptic Viewer icon twice for NearDet and FarDet displays.

 
 
 

Recovering the computers in the ROC-West:

If the computers in ROC-West were restarted due to power outage, or need to be recovered for any other reason, follow this document in DocDB.

 
 
 

OnMon & EvD:

Which datadisk should I choose?

When starting OnMon and Event Display on FarDet, a new popup box appears that asks you to pick which disk to use - select "datadisk-4" for the FD and "datadisk-1" for the ND.

I can't start OnMon or the EVD

If you can't start either OnMon or the EVD, try renewing the kerberos ticket before calling expert.

 
 
 

DDT trigger problems:

For information on how to run DDT and solve problems see Shifter_Instructions

 
 
 

Recent changes (experienced shifters especially, take note!)

Auto StartDAQs:

We are now running automatic StartDAQs (automatic enable FEB commands) on both detectors, this should start automatically at the start of a run. This means you should see green LED indicator called "Auto StartDAQ" in the TDU Control Interface window (on both detectors). If this indicator is red, start the "Auto StartDAQ" manually going to the 'Timing' menu in the TDU Control window, selecting 'Auto StartDAQ' you get a new pop-up window where you check in 'Enable AutoStrartDAQs' box and click OK, the LED indicator should turn green. The values are 1200s fro FarDet and 600s for NearDet.

When the indicators are green, this means shifters do NOT normally have to manually issue FEBenables (startDAQ commands) or SYNC commands (red and green buttons in TDU Control Interface Gui window).

Choosing detector type for ND DAQ and computing checklist forms:

The DAQ and computing checklist has been updated to be both a NearDet and FarDet checklist form. These means you will need to fill out this form twice per shift, once per detector. There is a drop down option to chose which detector you are filling out the form for.

 
 
 

Configurations

Partitions & Cooling
> FarDet: Partition 1 - Diblocks 01-14, HV ON, COOLED. We use the Cold Configuration for the APD settings in the CSS APD Temperature Monitor window.
> NearDet: Partition 1 - Diblocks 01-04, HV ON, COOLED. We use the Cold Configuration for the APD settings in the CSS APD Temperature Monitor window.

TDU timing chain in use :
> FarDet: FD chain 2 (TDU-Master-ARM-02)
> NearDet: ND chain 2 (tdu-near-master-arm-02)

DAQ Named Configurations to use in FarDet and NearDet runs :
> FarDet: Partition1 - FarDetGlobalConfigP1
> NearDet: Partition1 - NearDetGlobalConfigP1

Resource List
> FarDet:
>> Managers: ConfigurationManager, DDTManager, DataLogger, EventDispatcher4, GlobalTrigger, MessageAnalyzer, MessageFacilityServer, MessageViewer, RunControlServer, SNEWSMessage, SpillServer, TDUManager, TriggerScalrs4
>> BNEVB NEW Lists: bng07 (061-070), bng08 (071-080), bng09 (081-090), bng10 (091-100), bng11 (101-110), bng12 (111,112,114-120), bng13 (121-130), bng14 (131-140), bng15 (141,142-150), bng16 (151-153,155-160; disabled 154), bng17 (161-170), bng18 (171-180), bng19 (181-190), bng20 (191-195,197,198; disabled 196,199,200) - in total 134 buffer nodes in read-out.
>> BNEVB OLD-safe Lists: bng07 (061-070), bng08 (071-080), bng09 (081-090), bng10 (091-100), bng11 (101-110), bng12 (111,112,114-120), bng13 (121-130), bng14 (131-140)
>> Timing chains: ALL - DiB-{01-14}{s,t}

Main Page