Project

General

Profile

Start a data taking run » History » Version 91

« Previous - Version 91/144 (diff) - Next » - Current version
Chad Johnson, 12/08/2011 11:45 PM


Previous Page Main Page What to do while on shift DAQ Trouble Shooting Guide

Start a data taking run

updated 2011-12-02

These instructions are for starting a run when Run Control hasn't been started, or is in an uncooperative state from a previous run. For starting a new run after cleanly finishing another, see here.

Under normal circumstances, skip down to the Normal Data Taking section below

First make sure that all of the hardware is powered on. This includes the data concentrator modules, timing distribution units and front end boards. Check the Detector Controls monitoring interface for this information.

UPDATE (makes most sense to people who have shifted previously): We DO NOT like to powercycle dcm's. We only do it when we have exhausted our other options. The way we like to try to get things to work:Restarting a run within Run Control (RC).
If something goes wrong while running, one should try from RC to "End run", "Break Connections", "Detach from Partitions" (try a reset in the execute line if these steps fail). Then from Application Manager restart the process which is causing trouble and proceed with run taking. This option doesn't work frequently so its not mandatory to try this if you have had bad experiences with it. Next, we like to do a stopRunControl.sh, stopSystem, startSystem, startRunControl.sh (these are explained fully below). If that doesn't work, try to reboot dcm's. One can either just try to reboot the problematic dcm's or they can do all at once. Rebooting a single DCM can be done by right clicking on the DCM in question in the Application Manager and selecting the reboot option. If this option is unavailable a power cycle is most likely needed. Additionally all dcms can be rebooted by doing a stopSystem followed reboot-dcm-all from the command line. Then do startSystem, etc. If this doesn't work, see experts because it is unexpected behavior. Next steps will likely be power-cycle the dcms or dealing with the TDU, etc. Also, please see DAQApplicationManager for tips on using this GUI to tell what is going wrong and for how to restart a DCM through the GUI instead of doing it by command line. If problems persist, following these reboots and power cycling steps DDS may need a restart. This is done by running stopDDSEverywhere.sh from the command line after a stopSystem. startDDSEverwhere.sh will restart this server and then normal run start procedure can be followed.

So, to begin the process after your badly-ended run:

On novadaq-ctrl-master, execute the following commands in a terminal as user "novadaq"(command: ssh master):
  1. source /home/novadaq/DAQOperationsTools/novadaq_setup.sh (aliased as setup_online and you only need to do this in the beginning once after the terminal is opened)
  2. stopRunControl.sh
  3. restartSystem (OR you might want to do stopSystem followed by startSystem. This is the exact same as restartSystem but after stopSystem if the system didn't stop decently, you can deal with it instead of having to wait for startSystem before you take further action.)
    1. If you see a message containing something like
       ................X11 connection rejected because of wrong authentication.
      .............................### Forcibly stopping a BackgroundProcess... 
      

      at this point, don't panic and just (re)issue the command: restartSystem
  4. Check the Application Manager to validate that all processes are running.
  5. startRunControl.sh ( ignore the message klist: You have no tickets cached
    1. If you are starting a data taking run after a pedestal run you need to restart the TDU Conrol Client. To do this click on "Config" in the TDU Control Interface window. Select "Restart Runcontrol listen".

If things are going without errors / you don't need to investigate, after stopSystem you should see:

.........................
************************************************************
Successfully stopped all 32 applications.
************************************************************

If instead you get some error messages followed by something like:

************************************************************
ERROR: 1 of 32 expected applications did NOT stop.
************************************************************

this means you need to do more steps.

Problem solving with DCM rebooting (if a DCM is reporting errors)

As a first try, do a dcm reboot. This is NOT the same as power-cycling the dcm's. To do a dcm-reboot:

Right-click on the problem DCM in DAQApplicationManager. A window will pop up - choose "Reboot DCM". A series of windows will pop up asking if you want to continue with the reboot process - click yes until you get the window saying the reboot was successful. At this point the DCM should look red. Once it goes back to pink you know it has rebooted. See DAQApplicationManager for more information.

If for some reason you do not want to do this through the GUI, you can reboot the problem DCM's through the command line as well:

rebooot-dcm-1-1

for dcm 1-1 (aka dcm 6 ) or substitute your desired dcm. Remember to use to position name, not the hardware name (i.e. 1-1 not 06). Validate the state of the system in the Application Manager. Current theory is that if you just have one problem dcm, JUST reboot that one instead of doing them all. So, if possible, reboot one at a time but when everything is just broken madly, then you can reboot them all.

IF you are going to reboot them you, you do:

reboot-dcm-all

BUT -- due to the current state where three dcms are skipped in the readout (in positions 3-2, 3-3, and 4-2) IF you ever reboot or powercycle ALL the dcm's, there is an extra step required to deal with skipped state. You must next (after they are all rebooted) click on the "Bypass DCM Timing" icon seen on the bottom left screen of nova-daq-1.

Now, after you have done your rebooting through the command line, you should check that the reboot worked and is finished. (This is required ONLY if you are doing this through the command line, not the GUI. The GUI color changing to pink corresponds to passing this test.) There is some lag time here -- your first attempts to check might fail but keep trying for a few minutes. To do this:

check-dcm-all

Or for the individual one:

check-dcm-1-1 etc. for your dcm name.

Once the check dcm works, you should see output like:

root@dcm-06= Mon Apr  4 14:06:43 CDT 2011
 14:06:43 up 0 min,  0 users,  load average: 0.08, 0.02, 0.01

root@dcm-08= Mon Apr  4 14:06:42 CDT 2011
 14:06:42 up 0 min,  0 users,  load average: 0.16, 0.03, 0.01

root@dcm-09= Mon Apr  4 14:06:42 CDT 2011
 14:06:42 up 0 min,  0 users,  load average: 0.08, 0.02, 0.01

root@dcm-11= Mon Apr  4 14:06:42 CDT 2011
 14:06:42 up 0 min,  0 users,  load average: 0.08, 0.02, 0.01

root@dcm-12= Mon Apr  4 14:06:42 CDT 2011
 14:06:42 up 0 min,  0 users,  load average: 0.00, 0.00, 0.00

root@dcm-13= Mon Apr  4 14:06:42 CDT 2011
 14:06:43 up 0 min,  0 users,  load average: 0.08, 0.02, 0.01

After DCM's are in a happy place, continue with ...

Now you need to go back to startSystem. If this goes right, you should see:

..................................
************************************************************
Successfully started all 32 applications.
************************************************************

The following notes may apply if you have trouble with startSystem below :
  • repeat steps above if problems persist.
  • Perhaps you want to reboot the dcm's again. Now might be a good time to call the experts.
  • Otherwise, you could also try a dcm power cycle. Details on how to do this are listed here.
  • ALTERNATELY: You may try to start individual Application on the problem machine from the Application Manager by right click on the machine and selecting the start process option.
  • NOTE: A full system start and stop is not needed using this method.
Note: if you have rebooted the DCMs you will need to re-enable cooling (which is otherwise not necessary):
  • Click the "START Cooling" button on the "DCS Home" tab in the DCS-APD Temperature Monitor GUI.
  • The DCS/APD monitor window will be unresponsive for about 80 seconds after you 'START Cooling'.
  • Wait for the indicator light next to that button to turn green, in about 3 minutes
  • You can follow the cooldown status in detail under the NDOS Overview tab

WHEW !!! .... now let's recap what would normally happen if there are no DCM problems.
Skipping the above debugging discussion :

Normal Data Taking

  1. On the novadaq-ctrl-master terminal that is normally open
    • startSystem
    • startRunControl.sh
  2. Verify that the applications are running successfully in the Application Manager
    • You should see several GUIs (MsgViewer, Resource Manager, Resource Viewer and RCMainWindow).
  3. If you are starting a data taking run after a pedestal run you need to restart the TDU Conrol Client.
    • Click on "Config" in the TDU Control Interface window.
    • Select "Restart Runcontrol listen".
  4. In the RC main window :
    1. Click on "Change" and put your name in as the person who started the run.
      • It will remember the last person entered as a default so you should only need to do it once at the beginning of your shift.
    2. There will likely be a Partition 0 tab in the Resource Manager window. It should not be there.
      • Right click on the tab and select release partition
    3. click on "Discover Resources"
    4. click on "Select Resources"; another window should pop open.
      • Click all the + icons, to get a full list of resources
      • Select all Buffer Nodes (bnevb01 through bnevb12.)
      • Select all Managers except SimulationManger :
        • ConfigurationManager
        • DataLogger
        • GlobalTrigger
        • TDUManager
      • Select all the DCMs listed in "tdu01" (eg, 1-1, 1-2, 1-3, 2-1, 2-2, 2-3, 3-1, 3-2, 3-3, 4-1, and 4-2)
      • Click on OK when the selections have been made.
      • If you do not see the resources, you might have to release them from "Partition 0" in the Resource Manager, see DAQ Trouble Shooting Guide for details.
    5. click on "Reserve Resources"
    6. click on "Establish Partition"
    7. click on "Prepare Connections"
    8. click on "Load Connections"
    9. click on "Make Connections"
    10. execute the following command in the "Execute command" text entry box:
      prepare_hardware DCMApplication DCSMode_DCMHwCfgNamedSet
      (and hit enter)
      • You can hit up-arrow with the cursor in the "Execute command" box to recall previous commands. Be careful to select the one you want.
      • When successful, the Load Hardware Config button will turn green
    11. click on "Load Hardware Config."
    12. click on "Configure Hardware"
      • this will take a couple of minutes to step through the DCM's
      • when done, the Prepare Run Config button will turn green
    13. execute the following command in the "Execute command" text entry box:
      prepare_run DCMApplication DCSMode_DCMRunCfgNamedSet GlobalTrigger NUMIw500ud0u_BNBw500ud0u_Cosmicsw500ur100 DataLogger Sample1
      (and hit enter)
      • when done, the Load Run Config button will turn green
    14. click on "Load Run Config."
    15. click on "Configure Run"
    16. Click the "Begin Run" button when you are ready to start the run !
      • Fill out a "Start Run" form in the ECL log.
  5. When the run has started, a "sync to current time" command WILL BE AUTOMATICALLY EXECUTED. If not a manual sync can be issued to the Near Detector Master from the TDUControl window.
  6. Now you should be taking data How to know whether a run is producing (usable) data.
  7. Verify that you are now receiving NuMI, Booster and Calibration triggers on the scalars window.
    • If the TriggerScalars aren't updating new spills every few seconds then:
      • Verify that the beam is on ( check Channel 13 )
      • Attempt to resart the spill server again:
         startBeamSpills.sh
        .
      • If you are still not receiving spill triggers after a few tries call an expert
Previous Page Main Page What to do while on shift DAQ Trouble Shooting Guide