Project

General

Profile

Starting and stopping DAQ SW applications on the DAQ cluster

08-Oct-2010, KAB

First some caveats:
  1. In the medium term, I'd like to see us have a graphical interface to monitor, start, and stop the DAQ applications. However, we're not at that point yet. The scripts described below are incremental steps toward a graphical tool, and I hope that they are sufficient for the short term.
  2. Also in the medium term, we should switch to using the database for storing the information about what processes should run on which nodes. Currently, this information is stored in XML files.
  3. I'd like to have a graphical tool that could assist experts in defining what applications should be run where. This GUI is probably a lower priority than the start/stop one, though.
  4. Currently, the applications and scripts that are used to start and stop DAQ applications are stored in a CVS module (Online/pkgs/DAQOperationsTools). I'm not sure that this is the right choice, but it is a convenient one (so that we can leverage the SRT build environment). It should be noted that this CVS package is not intended to be included in base releases of the DAQ software.
  5. The script and alias names listed below are not etched in stone, and better suggestions are welcome.
  6. Currently, a Message Facility msgserver is not being automatically started. We need to add this soon.

Setup Instructions

To prepare for starting and stopping DAQ applications on the DAQ cluster, please use the following steps:
  1. log into novadaq-ctrl-master as user novadaq
  2. run 'source DAQOperationsTools/novadaq_setup.sh'

Available Commands

Once the setup instructions have been run, the following commands are available:
  1. checkDDSEverywhere.sh
    • The name of this script is slightly misleading. What it really does is to check whether DDS daemons are running on all of the farm and "control" nodes in the cluster. What is missing is testing whether DDS daemons are running on real DCMs. (However, this is currently not really a big problem since we start and stop the DDS daemons on real DCMs on an as-needed basis.)
    • This script uses a fixed set of hosts in its testing. It does not look to see which nodes are part of the current deployment configuration.
    • The output from this script is the result of running 'ospl list' on cluster nodes using RGANG. The output shows each node name and the result of the 'ospl list' command run on that node. When DDS is running, you will see a message saying "Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running". When it is not running, there will be no message. Example outputs are included below.
  2. stopDDSEverywhere.sh
    • The name of this script is slightly misleading. What it really does is to shut down the DDS daemons on all of the farm and "control" nodes in the cluster. It does not stop any DDS daemons that are running on real DCMs. (Stopping DDS daemons on real DCMs is done on an as-needed basis by the stopSystem command that is described below.)
    • This script uses a fixed set of hosts in its testing. It does not look to see which nodes are part of the current deployment configuration.
    • If the shutdown of the DDS daemons on a particular host fails, then probably the right thing to do is log into that host as novadaq, run 'source DAQOperationsTools/novadaq_setup.sh', and try 'ospl stop' by hand.
    • There is no corresponding startDDSEverywhere.sh script, at the moment. Instead, cron jobs on each of the farm and control nodes check every five minutes to see if DDS daemons need to be started and do so, if needed.
  3. startSystem
    • This command starts all of the needed DAQ applications based on the deployment configuration files stored in /nova/config/NDOS/appmgr.
    • The starting of the DAQ applications is done in parallel, and the command waits until all operations have completed, or until 30 seconds have passed, whichever comes first. If the command times out while there are still operations going, it kills them and prints out a summary of what happened.
    • Please Note that when using real DCM hardware, it is "normal" for the startup of the DDS daemons and the DCMApplication on each DCM to time out. This is due to a problem in getting ospl start to return gracefully when run using rgang.py. If/when you see such timeouts, you can inspect the verbose output to verify that things really were started successfully and/or run the checkSystem command to verify that the DCMApplication was started correctly.
    • By default, this command prints out a summary of whether the various applications were successfully started, but specifying the "-v" (verbose) option will generate more output.
  4. checkSystem
    • This command tests all of the expected DAQ applications based on the deployment configuration files stored in /nova/config/NDOS/appmgr. The test that is used is to send an RMS message to the application and verify that a valid response is returned.
    • The testing of the different DAQ applications is done in parallel, and the command waits until all operations have completed, or until 30 seconds have passed, whichever comes first. If the command times out while there are still operations going, it kills them and prints out a summary of what happened.
    • By default, this command prints out a summary of whether the various applications successfully responded, but specifying the "-v" (verbose) option will generate more output.
  5. stopSystem
    • This command stops all of the running DAQ applications based on the deployment configuration files stored in /nova/config/NDOS/appmgr.
    • The stopping of the DAQ applications is done in parallel, and the command waits until all operations have completed, or until 30 seconds have passed, whichever comes first. If the command times out while there are still operations going, it kills them and prints out a summary of what happened.
    • By default, this command prints out a summary of whether the various applications were successfully stopped, but specifying the "-v" (verbose) option will generate more output.
  6. restartSystem
    • Runs stopSystem and startSystem.

Starting Run Control and the Message Viewer

In the system testing that I've done so far, the startup of the Run Control GUI (rcWindow) and the Message Facility GUI (msgviewer) have been done manually. This allowed me to start these GUIs on whatever display I want. Starting these applications in this way also allows us to start as many msgviewer instances as we want.

To start these applications, we would use the following steps:
  1. run the setup instructions listed above
  2. run 'msgviewer' to start an instance of the message viewer
  3. run 'rcWindow' to start an instance of Run Control
    • OR run 'rcWindow -e <cmd_file_name>' to start an instance of Run Control with a list of commands that should automatically be run when Run Control starts
As of 07-Oct-2010, the following Run Control command files are available:
  1. DCSMode_2Realx4.cmd
  2. DCSMode_3Realx4.cmd
  3. PatternData_1Realx4.cmd
  4. PatternData_2Realx4.cmd
  5. PatternData_3Realx4.cmd
  6. SDPMode_2Realx4.cmd
  7. SDPMode_3Realx4.cmd
  8. SimMode1_3x4.cmd
  9. SimMode1_4x4.cmd
  10. DCSMode_WithReservedResources.cmd
  11. PatternData_WithReservedResources.cmd
  12. SDPMode_WithReservedResources.cmd
  13. SimMode1_WithReservedResources.cmd

The first 9 of these command files reserve fixed DCMs and BNEVBs. The last 4 assume that the user has already selected and reserved the resources that he/she wants to use (and set SimMode=1 for the last one).

The first set have the advantage that they can be specified when you start up Run Control (e.g. "rcWindow -e DCSMode_3Realx4 &"), but they force you to use predefined buffer nodes and DCMs.

The second set has to be run from the Run Control GUI after you have started up Run Control (and selected and reserved resources), but they are more re-usable since the buffer nodes and DCMs are not predefined.

Application Deployment Configuration Files

These files are currently stored in /nova/config/NDOS/appmgr. The relevant files are the following:
  • ProcessList.xml - this file contains the map of which application will run on which host
  • HostList.xml - the list of available hosts (e.g. "novadaq-ctrl-farm-01" and "dcm-06")
  • ApplicationTypeList.xml - the list of supported application types (e.g. "Buffer Node EVB" and "DCM Application")

I expect the Host and ApplicationType files to be relatively static. The ProcessList file will be the one that we change most as we move applications around on the cluster.

Log File Locations

Log files for applications that are running on x86 hosts are available under /daqlogs (accessible from any node in the cluster). For DCM logs, we need to log into the appropriate DCM and then look under /daqlogs.

Log file directories have the form /daqlogs/NDOS/<applicationType>/<hostname>/<applicationName><timestamp>.log.

A Sample Data-Taking Session

Here is a list of steps that might be used during a typical data-taking session.
  1. follow the setup instructions listed above
  2. run stopSystem to stop any stale application instances
  3. stop all of the DDS daemons running on the cluster using the stopDDSEverywhere.sh script
  4. wait 5 minutes for the cron jobs on the cluster to restart the DDS daemons
  5. verify that the DDS daemons were restarted using the checkDDSEverywhere.sh script
  6. if desired, edit /nova/config/NDOS/appmgr/ProcessList.xml to change the locations for various applications (e.g. switch DCMApplication instances from x86 to PPC hardware)
  7. run startSystem to start the DAQ applications on the relevant hosts (recall that the commands to start applications on real DCMs may timeout and that this is probably just a side-effect of the way that we are starting DDS on real DCMs at the moment)
  8. run checkSystem to verify that the DAQ applications are running on the relevant hosts (the GT and DL may erroneously be reported as not running when they really are - I need to talk to Alec and Steve about tweaking their code to enable this monitoring)
  9. run "msgviewer &" to start the Message Facility viewer
  10. run "rcWindow &" to start the RCGUI
  11. select the DCMs and BNEVBs that you want in your partition using the RCGUI and go ahead and reserve them
  12. run one of the following command files from the command line input in the RCGUI:
    • DCSMode_WithReservedResources.cmd
    • PatternData_WithReservedResources.cmd
    • SDPMode_WithReservedResources.cmd
  13. stop the run whenever you want using the RCGUI
  14. click on the Break Connections, Detach, and Release Resources buttons in the RCGUI
  15. quit Run Control using File->Quit in the RCGUI
  16. stop all of the distributed applications using stopSystem
  17. repeat steps 6-16 as needed (restarting the DDS daemons is good to do occasionally, and when things are really in a mess, but it doesn't need to be done all the time)

Please Note that if a previous data-taking session did not cleanly release the DCMs and BNEVBS, you may need to edit /nova/config/nova_daq_resource_manager.dat and set the partition for all resources to -1 (negative one).

Sample checkDDSEverywhere.sh output when DDS daemons are running

[novadaq@novadaq-ctrl-master ~]$ checkDDSEverywhere.sh
novadaq@novadaq-ctrl-farm-01= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-farm-02= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-farm-03= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-farm-04= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-farm-05= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-farm-06= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-farm-07= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-farm-08= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-farm-09= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-farm-10= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-farm-11= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-farm-12= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-farm-13= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-farm-14= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-farm-15= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-farm-16= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-datalogger= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-datamon= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-master= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-msglogger= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-runcontrol= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

novadaq@novadaq-ctrl-trigger= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled
Splice System with domain name "NOvA DAQ NDOS DDS Domain" is found running

Sample checkDDSEverywhere.sh output when DDS daemons are NOT running

[novadaq@novadaq-ctrl-master ~]$ checkDDSEverywhere.sh
novadaq@novadaq-ctrl-farm-01= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-farm-02= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-farm-03= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-farm-04= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-farm-05= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-farm-06= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-farm-07= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-farm-08= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-farm-09= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-farm-10= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-farm-11= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-farm-12= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-farm-13= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-farm-14= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-farm-15= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-farm-16= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-datalogger= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-datamon= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-master= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-msglogger= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-runcontrol= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled

novadaq@novadaq-ctrl-trigger= Setting Up the NOVA-DAQ Environment
NOVA-DAQ Environment Enabled