Project

General

Profile

Ganglia

Updated 2011-12-27

Overview

"Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids." It's a monitoring tool that allows us to monitor the health and performance of the various NOvA computers. The total DAQ cluster for a NOvA detector is a "Grid", which is subdivided into "Clusters". For more background on the NOvA implementation of Ganglia, see the DAQ Expert Wiki .

How to use Ganglia

What to look at

There are a large number of performance metrics logged for each machine. You can learn more about starting Ganglia, and browsing and creating your own plots below, but some of the more interesting metrics for judging the quality of data taking might be:

  • Microslice rate
    • A microslice is a collection of all hits (nanoslices) for all Front End Boards (FEB) on a DCM, for a fixed time interval. Normally, a new microslice is produced every 50 usec. Any deviation from the corresponding rate of 20 kHz indicates a problem with the current run.
  • Microslice size
    • The size of a microslice with no data is 12 bytes. If the microslice size is flat at 12 bytes for a DCM, it is not contributing hits to the data.
    • More than a few hundred bytes will start to stress the DCM CPU.
  • load_one load average
  • CPU report (currently broken - 2011-12-27 - Peter Shanahan)
  • Corrupt Microslice report
    • Various problems can cause the DCM to label a microslice as corrupt. Many of these are related to the beginning and ending timestamps of the microslice, so there's some correlation with high microslice rates.
    • A (atomic) resync can be tried, but starting a new run, going back at least to "Reprepare Hardware" will likely be needed.

One would, obviously, like not to have corrupt microslices, and would expect all the other metrics to be pretty constant throughout a run. Please point out anything slightly odd to a DAQ expert.

If something looks particularly heinous (rates in one dcm are way out of whack, say), ending the run and starting a new one may make things look better. Issuing a sync to current time may or may not be helpful. Please make a note in the CRL if you do, and if you see it have any effect.

The Checklist plots are also very important

How to start Ganglia

If you're not in the control room, follow some of the links above.

If you are in the control room, look at the appropriate control room machine (nova-daq-03 for NDOS). If Ganglia pages are not visible, starting a new Firefox browser by clicking on the toolbar icon should bring up tabs including the top level Ganglia page https://novadaq-ctrl-datamon.fnal.gov/ganglia/, and the Checklist page.

Using

At the top of the Ganglia page, you can make major selections, such as switching between the main page and the checklist plots:

You can drill down into a cluster (e.g., DCMs) by scrolling down the Main page, and selecting the desired cluster on the left side of the page:

Once you select a cluster, you will see several summary plots for that cluster, and then a specific metric for each node in the cluster. E.g., DCMs

From either the Main page, or after selecting a cluster, you can select various parameters affecting the displays, such as metric to display, which partition, the time window, sorting of nodes, etc.:

There are also some Bookmarks defined (at least on NDOS), which can be accessed from the bookmark toolbar on Firefox on nova-daq-03:

If you select "Checklist Plots" you'll see some plots with annotations, guiding you through the shifter checklist:

Troubleshooting

  • Looking at a cluster view, but no nodes are shown? Try selecting a different partition or the "All" partition.
  • Various error messages and/or problems displaying any Ganglia view? Try closing and relaunching the browser.