DM - Shifters » History » Version 63
DM - Shifters¶
Instructions for Data Management monitoring:¶
Data Management monitoring is now integrated in the Slow Controls. A "DataMgmt" box in the Alarm Panel is now present. If this box is red, please identify the PUBS/DataMgmt error responsible, log information related to it, and contact the on-call data management expert by phone.
What to monitor:
(1) Reset counters at the beginning of your shift, and make an elog entry at the end of the shift if any project's status "intermediate" or "error" counters accumulate above 100, and contact DM experts.
(2) If you cannot open the GUI following the instructions below, contact the on-call expert.
(3) If a large, bright-red window pops up and flashes around the screen telling you to call an on-call expert because a daemon has stopped working, call the on-call expert.
(4) If one of the projects has red portions (basically, if you see red anywhere), contact the on-call expert.
(5) If there is a SlowMonCon alarm for a variable of the type: uB_DataMgmt_PUBS_XXX/YYYYYY_queued, contact the on-call expert.
All PUBS monitoring tools are up-to-date and should have shifters' attention :)
If you have not, please read the DM - Overview before taking a shift ( April 8, 2018 Updated).
Further details are below.
There are two different online data management monitoring tools: a PUBS project monitor, which checks the status of the various PUBS projects running, and a resource usage monitor, which monitors resource usage on relevant server machines as well as network traffic that are critical to data management. The first is done through the PUBS GUI and the second is now done through SlowMonCon.
Instructions for both follow. Both should be checked as part of the regular microBooNE shift responsibilities under standard circumstances. *At the beginning of each shift, make sure both these monitoring tools are working and being displayed.
1) PUBS resource usage monitoring:¶
PUBS resource monitoring is now handled by the Slow Controls gui. We monitor disk occupancy and filling/draining rate (i.e. differential rate of disk occupancy) of near1 and evb machines. This differential should be near 0 when averaged over time. The disk occupancy on near1:/datalocal and evb:/data should be less than 50% at all times. Minor alarms for these variables are set to 50% occupancy, with major alarms at 60%. These variables can be accessed via the slow controls "DataMgmt-PUBS-table.opi" page. The variables are:
Additionally disk fill-draining rates are monitored. The variables, for the same areas, are denoted as:
Refer to the DM - Overview page to see how to access monitoring for these variables.
If any of the above variables go in alarm, please call the on-call DM expert.
(2) in a browser tab, go to http://mrtg.fnal.gov/weathermap/wx-uboone.html
This is a network weathermap webpage (see DM - Overview for details) from which you can interpret our data transfer rate (the allow going into r-dist-fcc2-1).
What to monitor: Make sure you can access the uB_DataMgmg_PCXX_XXX/XXX variables from the slow-controls. If these variables go into alarm, call the on-call expert.
2) PUBS project status and restarting PUBS GUI: follow these instructions to make sure that PUBS projects are actively running on ubdaq-prod-evb and ubdaq-prod-near1.¶
(1) open a new terminal window
(2) log into ubdaq-prod-ws01 as uboonedaq
ssh -X email@example.com
(3) log into ubdaq-prod-evb as uboonedaq
ssh -X firstname.lastname@example.org
(4) go to the PUBS repository
(5) setup the environment
(6) setup pyqtgraph
(7) start the monitoring gui
Expand the window that shows up and ignore messages to the terminal. The GUI is a purely monitoring entity, so don't be afraid to click/double-click on things, restart the GUI, resize the window, zoom in/out, etc.
You should now see a GUI that has displayed a tree of colored progress bars. The lines in the tree represent the flow of files, which start near the top (where they exit the DAQ system) and flow towards the bottom (where they are successfully deleted from the DAQ machines after being copied to tape storage). Each progress bar represents a project that processes the files in some way. When a project has successfully completed processing all files and has nothing in its queue, it will show up as a green bar (good!). When a project has pending files, those files show up as a blue portion of the progress bar (normal). When a project was unable to successfully process some files, those show up as red portions of the progress bar (bad!). Projects that have been disabled show up as a gray bar.
There is a button on the GUI labeled "Reset Counters". This is analogous to acknowledging an alarm on the Slowmon GUI. If you see something red and have notified an expert, he or she may instruct you to click the Reset button. This will reset all of the bars to green, and errors will no longer be displayed. Unchecking the "Use Relative Counters" checkbox will reveal that those errors are still present and need to be dealt with by a PUBS expert.