Project

General

Profile

Idea #5979

Provide DAQ monitoring similar to what is done with Ganglia for NOvA DAQ

Added by Kurt Biery over 6 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
04/21/2014
Due date:
% Done:

0%

Estimated time:
160.00 h
Experiment:
Duration:

Description

In the NOvA DAQ system, Ganglia is used to provide DAQ monitoring. ("DAQ monitoring" refers to watching over the health of the DAQ system and ensuring that it is performing well. This is in contrast to "Data Quality Monitoring" which monitors the quality of the data that is being taken and the health of the detector.)

The Ganglia plots provide information on the CPU usage, memory usage, network activity, etc. on the computers that are part of the DAQ system, and it also provides information on the performance of the DAQ system such as event rates, frequency of certain errors, and sizes of events, buffers, and files. These latter quantities are provided by Ganglia custom metrics which are defined, calculated, and provided in the experiment software. Having plots that allow users and experts to correlate DAQ performance with computer system performance can be very helpful in tracking down problems.

Ganglia is only one such system, and others should be considered.

We should provide at least one such type of monitoring within the core artdaq product.

ARTDAQ Metric Plugins.pptx (38.3 KB) ARTDAQ Metric Plugins.pptx Slides presented by Eric on the initial DAQ monitoring metric design. Kurt Biery, 11/11/2014 11:55 AM

Related issues

Related to ds50daq - Feature #4032: Add DAQ monitoring to the ds50daq systemClosed06/07/2013

Related to artdaq - Feature #7637: Provide a few methods in MonitoredQuantity to return a single statisticClosed01/06/201501/14/2015

History

#1 Updated by Kurt Biery about 6 years ago

Eric has started looking into this, and he has a design for providing different monitoring providers. Documentation on the design is available in the attached document (which was presented at the 10-Nov-2014 artdaq discussion).

#2 Updated by Kurt Biery about 6 years ago

  • Assignee changed from Kurt Biery to Eric Flumerfelt

Eric, When you have a chance, please check with Lynn to see if the Ganglia build that you have created might be included in the official distribution area. If not, then we'll figure out a different way to make it available.

I still need to think about your question of whether StatisticsHelper is the right place to hook into the metrics reporting...

In the meantime, I'd like to start the discussion with LBNE folks about whether it would be OK for us to install Ganglia (including the web server piece) onto the 35t DAQ cluster. I'll send an email about that and cc: you and Ron.
Kurt

#3 Updated by Kurt Biery about 6 years ago

Copying my email from yesterday here (should have put my comments here originally).

Hi Eric,
In your presentation on metric plugins on Monday, you asked the question whether StatsHelper is the right place to include metrics reporting...
I wonder if there would be value in creating a dedicated class (maybe a singleton? or, thinking about it more, maybe not) that manages the metric plugins, rather than managing them in the StatsHelper class. In that way, we could send metric reports from code that is lower-level than StatsHelper in artdaq, and we and experimenters could use the metric reporting without using StatsHelper, if they/we want to do that. Depending on the design of the MetricManager (or whatever), an instance of it could be passed to (or fetched inside of) StatsHelper, so we could continue to have the nice feature that quantities that are managed by StatsHelper automatically get reported to the metrics system.
We could talk about making this separation now, or we could wait until we have some operational experience to see if it would be useful...
Kurt

#4 Updated by Kurt Biery almost 6 years ago

  • Related to Feature #7637: Provide a few methods in MonitoredQuantity to return a single statistic added

#5 Updated by Kurt Biery almost 6 years ago

  • Assignee changed from Eric Flumerfelt to Kurt Biery

Over the last several weeks, I improved the generation of a standard set of DAQ metrics in the BoardReaderCore, EventBuilderCode, and AggregatorCore classes in artdaq. There was also a minor bug fix in the artdaq_ganglia_plugin package.

I tested these changes on the DS-50 WH14NE teststand using Ganglia. The custom metric values that were reported in the Ganglia plots were compared to the values reported in the BR, EB, and AG log files, and several running conditions were exercised. For example, the size of the data being written to disk was increased until backpressure was observed in the system, and the plots were checked to verify that the "output wait times" that were reported did, in fact, show that this was the limiting factor in the rate of events through the system.

#6 Updated by Eric Flumerfelt over 4 years ago

  • Status changed from Assigned to Closed

artdaq_ganglia_plugin was released and is now tracked separately.

Also available in: Atom PDF