Using the Online Monitoring¶
This page provides useful information on the monitoring system so shifters can make the most of the setup.
The online monitoring runs during data taking and updates at regular intervals so shifters can keep an eye on the status of the run. The monitoring is viewable on the following web page: http://lbne-dqm.fnal.gov.
Things to look for¶
- High RMS RCE channels: These show up in the "RCE ADC Mean" and "RCE ADC RMS" plots. They are indicated by groups of 32 channels that have slightly lower than normal values of ADC mean and very high values of ADC RMS. Note any such groups you see in e-log
- Out of sync RCEs: In the "Ratio of Number of Tick to Max in Event" plot, all entries should be at 1. Note any other entries and how long into the run they occurred.
- FEMB synch problem: This shows up as groups of 32 RCE channels with very high ADC Mean (>1000) and RMS. Note any occurrences in the e-log.
Online Monitoring details¶
Details on the Online Monitoring system can be found on this page: Online_Monitoring. This contains more information than the average shifter will require.
There are three ways that the monitoring viewable on the web page is refreshed:
- an initial update not long after the start of a subrun (defined by the fhicl parameter InitialMonitoringUpdate)
- during the subrun, they are updated at a regular frequency (defined by the fhicl parameter MonitoringRefreshRate)
- at the end of a subrun (assuming the DAQ doesn't crash before we get chance to save them)
The current parameters are set to 30s and 500s (i.e. the monitoring will appear 30s after the start of the run and then updated every 500s, just over 8 minutes, after that). These numbers have worked well so far but please let us know if you feel different parameters may work better. The refresh rate shouldn't be too frequent since it takes at least 10s to save all the data so it will miss all the events read by the DAQ in that time.
The first time the monitoring is saved on the web, a link will appear on the page http://lbne-dqm.fnal.gov/OnlineMonitoring. At the top of each page, the time of the run start and the latest update will be shown, along with the number of events which the DAQ has read in that time so the histograms can be made sense of.
Right now, the monitoring is updated multiple times a day as we start shifts and people suggest improvements. Please do! Send suggestions to email@example.com (I've been spending a lot of the time in the control room so normally you can just chat to me!).
The monitoring is stable and will remain so from this point onwards hopefully! The system won't be changed any more.
If something happens which seems to imply a problem, check the known issues section below. If the current problem isn't listed, please contact Mike Wallbank ASAP (firstname.lastname@example.org); it will be treated urgently! During a particular shift, if there are repeated problems then I'm happy to give the shifter my cell number so I can be contacted faster.
On occasion the monitoring module may throw an uncaught exception and so stop running. The DAQ may continue to run regardless, with no further monitoring appearing. This has only happened a couple of times, but it's worth checking (filter on ArtExceptions in the MessageViewer). If it happens, let me know the run number and I'll look into it!
The DAQ is in a stable enough state now so that one doesn't need to (normally) terminate and init between every run. If the next run uses the same configuration as the previous one, after stopping it (lbnecmd stop daq) then the start transition can normally be applied. However, if the previous run didn't stop completely fine, weird things will happen (e.g. monitoring not show up for subsequent runs, monitoring from previous runs showing up during current run). If you see the error "Timeout receiving fragments after stop, but no endSubRun message is available to send to art." (it's an error from the Aggregator, filter on error to see it) then in order to ensure everything will run as expected on the next run, you must terminate and then re-init the daq. John Freeman is looking into trying to get art to throw an exception if this happens, but we may not have the power to make this happen so be aware!
This appears to be due to a problem with the DAQ which means that a new subRun isn't started until the millislices are sent from the board readers. There have been issues with the hardware which has resulted in 0 millislices being made. This has the effect of not even starting the monitoring, so it doesn't appear on the web page (there would be nothing to see, even if it had been!). This issue can be seen by looking for messages in MessageViewer which say that 0 millislices have been sent/received.
This issue has been seen as a symptom of the Aggregator processes dropping. This tends to be an issue which manifests when a DAQ component drops out and so the Event Builders have to keep waiting in vain for more data, which doesn't appear. This means that the incomplete event count increases steadily from this point onwards. This is a serious issue with the DAQ but is being looked into with the new version of artdaq.
A bit vague, but this is possible if the web server disk fills up so the monitoring data can't be saved on it. This is of course possible but it shouldn't happen -- the disk is 10GB and as of 5 Nov we are using less that 1GB for all runs from May-time. If you suspect this problem has occurred, contact me ASAP! (email@example.com)
The APA numbering is different to the offline. See the figure below: