Grafana For Shifters¶
Grafana is currently being used as a DAQ monitoring system to check on the health and status of runs. It has many panels with expert-level information, and during this page we made a page with information specifically aimed for the shifters. This page will very briefly describe this monitoring tool for the shifters. The way to load into Grafana is described in the ROC Doc (https://sbn-docdb.fnal.gov/cgi-bin/private/ShowDocument?docid=15944, ICARUS doc-db login), and for users at ROC-West, there is also the ability to use the button on CR01 below the VNC buttons. This button still needs some stability testing, but seems to work and will do in a pinch.
The panel for shifters is called Shifter DAQ Status and looks like the following (see below image for details!):
At the top right, you will see a refresh button and a drop down menu. You should make sure 5s is selected to ensure the page will continue updating. It should be set by default to go to 5s.
The top left will print out the last run number (if a run is going this should be the current run number). As of 6 March 2020, there is a known bug where the run number may be behind if a run is started from the READY state, but shifters should be starting from the STOPPED state.
To the right of this are 4 speed meter like displays with the heading Total Event Rates. This panel describes the rates of data-taking. TPC Avg Fragment Rate tells you at what rate we are taking data from TPC mini-crates (typically 5Hz for noise studies in the pre-commissioning phase). PMT Avg Fragment Rate does the same for the PMTs (these aren't typically being run by the shifters right now). EventBuilder Event Rate and DataLogger Event Rate describe the rates that the fragments are being built into events and handled by the data loggers, so these should match the expected data rate.
Since there are 2 event builders and 2 data loggers, it sometimes happens that when the page refreshes it misses one or both. Don't be too alarmed if any number on this page is half the expected value or nothing for a short while, though you should see it come back to the expected rates at times too! The above example is a case where this has happened. If it's never at the right value, then something may be up...
Below the Total Event Rates panel is a set of event counters. On the left are the summed total of events released to art during the run. There are three columns, one for each artdaq process handling events (Event Builder, Data Logger, and Dispatcher). To the right are the set of incomplete events that have been released to art during the run. This means that the event didn't contain the full set of information. For example, say there are 4 mini-crates and only 3 reported their data for the event: this would be marked incomplete. There are two event builders, two data loggers, and one dispatcher (being fed data from one or both data loggers, depending). These numbers are susceptible to the same possibility of being off by a factor of two or temporarily not reporting. If you see incomplete events, make note of it in the ECL. If all the events are incomplete, then you should bring down the run and start over.
Below this are a set of "alarms" -- these are meant to turn red when there is an issue, and the current numbers should be added to the "every 2 hours" run stability ECL post.
The one on the left reports when boards are seen to be "busy" and not passing off the data when it comes in like it should be. A typical mini-crate has 9 boards, so the number 9 could indicate that an entire mini-crate has gone "busy." Anything over 0 should trip this field to turn red. If this stays at a number >0 for a short bit (1 minute or so), then this is a reason to restart the run and make note of this condition in the ECL. Feel free to add screenshots!
The other alarm that is included at the moment is monitoring the circular buffer status of mini-crates. This reports the highest value given for buffer occupancy. Each buffer has something like 250MB, so this alarm is set to trip red at 150MB. It is typically around 1.7MB or so, nearly a factor of 100 below the threshold to trip red. If this number starts to climb, take note in the ECL. In any case, this is something we ask for on the "every 2 hours" run stability ECL entry.
An ECL entry should be made each time a run is stopped (indicating why it was stopped) and when a new run is started, giving the details of the run. Every 2 hours while a run is going, we ask that you make an ECL entry detailing info about the run. Starting 14 March 2020, we have the DAQ (Commisioning) Checklist which has the necessary elements to fill out. See the post below for an example of the newer style form.
In this form, Alarm TPC is the "busy" status box, and Alarm BUFFER is the circular buffer status.
Please also note on this "every 2 hours" post if the Archiver is running (this is in the OnMon pages on the bottom right monitor of CR 01)