Provide DAQ monitoring similar to what is done with Ganglia for NOvA DAQ
In the NOvA DAQ system, Ganglia is used to provide DAQ monitoring. ("DAQ monitoring" refers to watching over the health of the DAQ system and ensuring that it is performing well. This is in contrast to "Data Quality Monitoring" which monitors the quality of the data that is being taken and the health of the detector.)
The Ganglia plots provide information on the CPU usage, memory usage, network activity, etc. on the computers that are part of the DAQ system, and it also provides information on the performance of the DAQ system such as event rates, frequency of certain errors, and sizes of events, buffers, and files. These latter quantities are provided by Ganglia custom metrics which are defined, calculated, and provided in the experiment software. Having plots that allow users and experts to correlate DAQ performance with computer system performance can be very helpful in tracking down problems.
Ganglia is only one such system, and others should be considered.
We should provide at least one such type of monitoring within the core artdaq product.
#1 Updated by Kurt Biery about 6 years ago
- File ARTDAQ Metric Plugins.pptx ARTDAQ Metric Plugins.pptx added
- Status changed from New to Assigned
- Assignee set to Kurt Biery
- Target version changed from 577 to v1_12_06
Eric has started looking into this, and he has a design for providing different monitoring providers. Documentation on the design is available in the attached document (which was presented at the 10-Nov-2014 artdaq discussion).
#2 Updated by Kurt Biery about 6 years ago
- Assignee changed from Kurt Biery to Eric Flumerfelt
Eric, When you have a chance, please check with Lynn to see if the Ganglia build that you have created might be included in the official distribution area. If not, then we'll figure out a different way to make it available.
I still need to think about your question of whether StatisticsHelper is the right place to hook into the metrics reporting...
In the meantime, I'd like to start the discussion with LBNE folks about whether it would be OK for us to install Ganglia (including the web server piece) onto the 35t DAQ cluster. I'll send an email about that and cc: you and Ron.
#3 Updated by Kurt Biery about 6 years ago
Copying my email from yesterday here (should have put my comments here originally).
In your presentation on metric plugins on Monday, you asked the question whether StatsHelper is the right place to include metrics reporting...
I wonder if there would be value in creating a dedicated class (maybe a singleton? or, thinking about it more, maybe not) that manages the metric plugins, rather than managing them in the StatsHelper class. In that way, we could send metric reports from code that is lower-level than StatsHelper in artdaq, and we and experimenters could use the metric reporting without using StatsHelper, if they/we want to do that. Depending on the design of the MetricManager (or whatever), an instance of it could be passed to (or fetched inside of) StatsHelper, so we could continue to have the nice feature that quantities that are managed by StatsHelper automatically get reported to the metrics system.
We could talk about making this separation now, or we could wait until we have some operational experience to see if it would be useful...
#5 Updated by Kurt Biery almost 6 years ago
- Assignee changed from Eric Flumerfelt to Kurt Biery
Over the last several weeks, I improved the generation of a standard set of DAQ metrics in the BoardReaderCore, EventBuilderCode, and AggregatorCore classes in artdaq. There was also a minor bug fix in the artdaq_ganglia_plugin package.
I tested these changes on the DS-50 WH14NE teststand using Ganglia. The custom metric values that were reported in the Ganglia plots were compared to the values reported in the BR, EB, and AG log files, and several running conditions were exercised. For example, the size of the data being written to disk was increased until backpressure was observed in the system, and the plots were checked to verify that the "output wait times" that were reported did, in fact, show that this was the limiting factor in the rate of events through the system.