Ganglia metric plugin handling of EventStore run_id.subrun_id
The 15-second interval for publishing metrics inside the ganglia_metric::send_metric code seems to be destructively interfering with the sending of the run_id.subrun_id value from the EventStore class when we automatically switch from one disk file to another by automatically pausing and resuming the run.
This behavior is based on the latest code changes that I have made in EventBuilderCore.cc and AggregatorCore.cc. (currently on the feature/inRunExit branch)
The observed behavior is that the run_id.subrun_id is correctly shown in the Ganglia plots for the EBs and the AGs only for the first subrun in a run. For subsequent subruns, the run_id.subrun_id is shown as zero.Here is what I think is happening:
- the Aggregator sends Pause commands to all of the artdaq processes, in the appropriate order
- the processFragments threads in the AGs and EBs exit and call metricMan.do_stop()
- the ganglia_metric::stopMetrics() method sends zeroes to all metrics accumulators, including the ones for run_id.subrun_id
- more than 15 seconds has passed since the latest updates to the run_id.subrun_id metrics, the zeroes get processed
- the Aggregator sends Resume commands to all of the artdaq processes, in the appropriate order
- the start() methods in AggregatorCore and EventBuilderCore call metricMan.do_start()
- they then call EventStore.startSubRun()
- EventStore.startSubRun() calls metricMan.sendMetric() for the new run_id.subrun_id combination
- inside ganglia_metric::sendMetric(), less than 15 seconds has passed since the zero was passed in, so nothing is sent to Ganglia
- circumvent the 15 second test (since we want the setting of parameters to zero at stopMetrics time to be reliable)
- set the lastSendTime to zero (rather than now) so that the value that is sent next will always reliably be accepted/published
Of course, other options are possible.
#1 Updated by Kurt Biery over 5 years ago
I should have said that I tested my "15 second" theory by buggering the AggregatorCore code so that it slept 16 seconds between the automatic pause and the automatic resume. In that situation, the run_id.subrun_id was succesfully reported in the Ganglia plots for all subruns.
#2 Updated by Eric Flumerfelt over 5 years ago
- Status changed from New to Assigned
This may necessitate changes to the MetricPlugin interface. It might be a good idea to move the "average over n seconds" code from the Ganglia metric to the MetricPlugin interface and then specify in the artdaq code whether the metric should be averaged over time or not (the time to average quantities over would become a configuration parameter for metrics, with 0 meaning "don't average"). This could be useful for high-rate metrics in Graphite, which uses fixed-size storage, and will only store a certain number of data points.
#3 Updated by Eric Flumerfelt over 5 years ago
- % Done changed from 0 to 100
Code is in the respository, under the "TimeAveragingMetrics" feature branch, and has been tested to work. Specific metrics can be flagged in code to be "non-accumulating" and are reported to the metric plugins immediately. Other metrics will accumulate for metrics:plugin:reporting_interval seconds, then the average value will be reported to the metric plugins.