DM - Overview¶
- Table of contents
- DM - Overview
The objectives of online data management includes:
- Timely transfer of data files written by DAQ into ENSTORE (permanent tape storage)
- Extraction of checksum value and binary file metadata to register in file catalog database (SAM)
- When appropriate process online swizzling of incoming binary data file to generate ROOT file
Above actions are modularized into following PUBS projects:
- DAQ output file detection when data file is closed
- Binary file checksum calculation
- Binary file metadata extraction
- Registration into SAM
- Transfer to storage when and as soon as possible
- When permanent storage (dCache dropbox) or a transfer route (network) is down, drain files into a buffer server (ubdaq-prod-near1)
- Monitoring occupancy of SEB cache disks for SN Stream
- When appropriate, remove oldest SN Stream files
The actual number of project is more than what is listed above as some actions need redundancy.
All of projects currently run on one of three online server machines: evb, near1, or ws02 (actual server name ubdaq-prod-evb, ubdaq-prod-near1, and ubdaq-prod-ws02 respectively).
The evb server is a DAQ event builder machine where a raw binary file is written on the disk by DAQ (ubdaq-prod-evb:/data/uboonedaq/rawdata/).
The near1 server was originally an online monitoring machine which it is now shared by various PUBS projects that may use some CPU/Memory resource.
near1 also acts as a disk buffer: evb has a total disk space of 33TB whereas near1 has 35TB of capacity.
Usually a file transfer is done directly from evb to permanent storage. When dCache is unavailable (/pnfs/uboone/scratch), the near1 server is used as a temporary drain. Once dCache service is restored, files staged on near1 are transferred to dCache for archive on Enstore.
The ws02 server was originally used as a gateway machine for accessing the DAQ internal network, but is now used for monitoring and cleaning up the SuperNova Stream files located on each of the SEB (1-10) servers in LArTF.
It is essential to keep the draining of files from evb to our permanent storage or near1 as smooth as possible to avoid a pile up of disk occupancy on evb machine.
When the disk is full on evb, the DAQ can no longer run to collect data.
Project Status Monitoring¶
PUBS provides a GUI that shows cumulative summary of running projects' status.
Steps to start the GUI can be found in DM - Shifters.
Once started, GUI shows all relevant project status bars for the online data management. When a new (run,subrun) is defined by DAQ, PUBS automatically detects it and all projects get an initial status which is classified as "intermediate" state. The GUI displays a summary of 3 states: "Good", "Int.", and "Error" which represents completed state, processing state, and a problematic state respectively. Each state is shown with a counter that represents number of (run,subrun) ID that belong to those states. Projects may carry various intermediate state, but is not all different intermediate states are shown in the GUI front panel for clarity. When all intermediate steps are complete to the successful final state, a counter of "Good" state increases.
For easy shifting purpose, the GUI comes with a "use relative counter" option. When the check box is marked, GUI records its state and only show each relevant counter ("Good", "Int", and "Error) since the last time the counter was reset. This is useful so that shifters can easily see all subrun files that are in error state, or just the new errors that have occurred since the start of shift. You should reset the counter at the beginning of your shift so that you can focus on increasing counters during your shift instead of some backlog error or intermediate states that occurred in the past. Here is what the GUI will look like immediately after clicking "Reset Counters".
If the "Use Relative Counters" box is unchecked, then all processed, queued, and error state files will be listed. Here is the same GUI as the first picture, but without the "Use Relative Counters" button clicked. Note the 44 errors listed for Binary Transfer Near1.
Branched Process Chain¶
The diagram below is a complementary one to complete the somewhat complicated process chain that involves branching and merging of data stream.
Previous diagrams omit fine details of data stream chain that is shown in this picture (and is also seen in monitoring GUI).
Online Computing Resource Monitoring¶
The online computing resource is monitored in two ways:
- evb and near1 Disk monitored by PUBS projects
- Fermilab MicroBooNE Network Weather Map
PUBS resource monitoring are now handled by the Slow Controls gui. We monitor disk occupancy and filling/draining rate (i.e. differential rate of disk occupancy) of near1 and evb machines. This differential should be near 0 when averaged over time. The disk occupancy on /data/local should be less than 50% at all times, and the occupancy on /data/ should be less than 70%.
You can access the current plots by going to the slow-controls page "DataMgmt-PUBS-table.opi" (shown below) and clicking on the "evb data disk plot" and "near1 data disk plot" buttons in the bottom-left of the page.
Network Weather Map¶
Fermilab provides another very handy tool to monitor MicroBooNE's network weather map. It shows network I/O at each network switch and on each port on the network switch. In particular network speed to "r-dist-fcc2-1" shows the file transfer rate out of LArTF and to the permanent storage which is our figure of merit.
You can access the current weather map here: http://mrtg.fnal.gov/weathermap/wx-uboone.html
Feel free to contact either Kirby (firstname.lastname@example.org) or Wei Tang (email@example.com) for any question/comment about this overview and/or PUBS.
For more technical details, expert documentation is maintained here: https://www.overleaf.com/3384459hxbyns#/9541217/