Project

General

Profile

Nearline Contacts

The Full Story of the Nearline

Gather round children to hear the tale of how data is processed through all of the steps that comprise the nearline system. This description will tell the story of the flow of data starting with the 35t detector all the way through posting plots demonstrating detector performance to the web.

Stage 1: Data Collection

As data pours out from the detector, it passes through the DAQ and ends up being written to disk in the "online" format. Through various transfer scripts, these files make their way to the /data/lbnedaq/data/transferred_files/ area which can be seen from the lbne35t-gateway02 machine (technically, the files in this directory are actually hard links to the real files.) From here, these data files can be seen and processed by the nearline scripts.

Stage 2: Nearline Processing

The nearline processing currently takes place on the lbne35t-gateway02 machine, where data is processed as it appears in temporary storage directory /data/lbnedaq/data/transferred_files/. The main script is ProcessNewFiles.sh (described below) which is run as a cron jobs every 5 minutes. This script looks for new files (older than 10 minutes) that either haven't already been processed or aren't currently being processed, and for each file it finds, it fire up another script ProcessSingleFile.sh (described below) that handles running the nearline processing job over this file. ProcessNewFiles will stop adding ProcessSingleFile jobs when either it runs out of new unprocessed files, or it hits a self imposed max number of jobs allowed to run on one machine at a time (depending on the conditions of the machine, this number is typically less than the number of available CPU cores - currently on the lbne35t-gateway02 machine it is set to 4.) The ProcessSingleFile script is extremely simple. All it does is set up the appropriate LArSoft offline environment and run the nearline processing job(s) over the input file found by ProcessNewFiles. The output of this job is a single root file which contains the histograms representing the computed nearline metrics from that input file. Note, each input file represents the data from one subrun so one set of nearline metrics is generated for each subrun.

Stage 3: Data Transfer and Storage

Data is stored in the directory /lbne/data2/users/lbnedaq/nearline for the regular nearline output files, and /lbne/data2/users/lbnedaq/nearline_evd for the event display output files. The structure within that directory is as follows:

~/{LArSoft version - ex: v04_32_01}/{big run number - ex: 006}/{small run number - ex: 006938}/

Each of these directories represents the output from one runs's worth of processing, and these directories can be seen from the dunegpvms (at least some of them anyway.) There are several types of files that you will find in these directories:

  • *.LOCK or *.DONE files: files with zero size used to indicate to any scripts running that processing for a particular subrun is either currently in progress or finished
  • *.log files: text output from running ProcessSingleFile.sh (currently suppressed since the text output is big, so you won't see these files.)
  • *.root files: root output from processing data

Lastly, there is a script running on the gateway02 machine called CleanupNearlineFiles.sh that deletes files from stalled or failed jobs.

Stage 4: Posting Final Plots to the Web

Once the nearline root files have made their way to storage in the directories described in Stage 3, they can be looped over to make nice looking plots for the web. Currently, the web plot making is also handled on the lbne35t-gateway02 machine because this machine has access to both the disk where the raw output files from the DAQ are kept AND the disk area where the nearline web site is hosted (/web/sites/lbne-dqm.fnal.gov/htdocs/NearlineMonitoring/). The main script run as a cron job is MakeNearlineWebPlots.sh. This script is run once every 10 minutes to make the plots showing data from the last 24 hours, every hour to make the plots showing data from the last week, and once per day (at 5:30 AM) to make the plots showing data from the past month. The first thing this script does is to make a list (saved as a text file) of all of the nearline files on disk that are less than a {day, week, month}. Then this script runs NearlinePlotMaker.C which simply opens each nearline file listed in the text file, checks to see if the data really is within the right time window (day, week, month) makes plots out of that data, and then saves all of its canvases to a directory of png files that are pointed to by the nearline website. Note that this means that the previous pictures in that directory are overwritten (archives of old plots are not kept, but the plots from any time period can always be regenerated from the nearline files on disk which are kept forever, using the NearlinePlotMaker.C root macro.) The total time from when a subrun ends to when the data from that subrun first shows up on the nearline website (in the 24 hour plots) varies but currently appears to be around 60-90 minutes. Most of this appears to be a lag in the amount of time it takes for files to show up in the /data/lbnedaq/data/transferred_files/ area. Once a file shows up in that area, it will take about 10-15 minutes to show up in the nearline web plots (a few minutes to process the file combined with the fact that the 24 hour plots only update every 10 minutes.)

So that's the story. It starts with raw data from one of our detectors and ends up being posted on the nearline website:

http://lbne-dqm.fnal.gov/NearlineMonitoring/

Nearline Scripts

Here is a detailed description of what each of the scripts running as part of the nearline does. They all reside in the NearlineMonitoring package in the dunetpc code. The most up-to-date version of these scripts are kept in the feature branch jpdavies_nearlineMonitoring.

ProcessNewFiles.sh

  • It runs as a cron job once every 2 minutes on the lbne35t-gateway02 machine.
  • It runs a series of safety checks including checking to make sure it has been passed the right number of input arguments, checking that the local disk space is not already more than 95% full, and checking to make sure that it is not already running. If any of these checks fail, then the script quits and doesn't attempt to process any new files.
  • It will look for new files between 10 minutes and 60 days old and check to see if they have already been processed or are currently being processed. If both of those things are false, then it will pass the file to the script ProcessSingleFile.sh to be processed.
  • It will keep adding ProcessSingleFile jobs until it hits a limit of the maximum number of jobs it will allow running at one time (currently set to 4) or it runs out of files to be processed.
  • The list of possible files to process is sorted in reverse order by run number so that the newest files are always given processing priority.
  • Files to be processed are looked for in the /data/lbnedaq/data/transferred_files directory.
  • Usage: ./ProcessNewFiles.sh {maximum number of jobs to run} {LArSoft version: ex - v04_32_01} {compiler: ex - e9}

ProcessNewFilesEVD.sh

  • It runs as a cron job once every 2 minutes on the lbne35t-gateway02 machine.
  • It does the same thing that ProcessNewFiles.sh does except that it handles the processing of raw files to create the event display output files.

ProcessSingleFile.sh

This script isn't run on its own. It is only called by ProcessNewFiles.sh

  • It sets up a hard link to the file it is going to process in the /data/lbnedaq/data/nearline-monitoring-links/ directory (which prevents the file from being deleted while the nearline is accessing it.)
  • It sets up the appropriate LArSoft offline environment.
  • Then it runs lar -c test_stitcher_nearlineana.fcl over the input file and dumps all of its output to the appropriate directory. All of the screen output is currently NOT kept as a log file but instead is piped to /dev/null since this LArSoft job is VERY verbose and the log files are too big.
  • This script also handles the creation of files called *.LOCK and *.DONE that ProcessNewFiles.sh looks for to determine if the file has already been or is currently being processed.
  • Lastly the hard link set up at the beginning is removed.

ProcessSingleFileEVD.sh

This script isn't run on its own. It is only called by ProcessNewFilesEVD.sh

  • It does the same thing that ProcessSingleFile.sh does except that it runs the event display job ctreeraw35t_trigTPC.fcl.

MakeNearlineWebPlots.sh

This script is currently run on the lbne35t-gateway02 machine and does a few simple things followed by calling the root macros that make the OnMon plots for the nearline webpage. It is run in three separate cron jobs for the three separate time periods over which the nearline plots are made. It runs once every 10 minutes to make the 24 hour plots, once every hour to make the 1 week plots, and once per day to make the 1 month plots.

  • It makes three temporary text files that contain lists of the most recent files on disk within 24 hrs, one week, and one month.
  • Then it runs NearlinePlotMaker.C
  • Usage: ./MakeNearlineWebPlots.sh {dunetpc release: ex - v04_32_01} {compiler: ex - e9} {time period in days over which to make the plots}

NearlinePlotMaker.C

This is the root macro that does the heavy lifting of making the plots for the nearline webpage. It can be run on its own but it is intended to be run by being called from the MakeNearlineWebPlots script.

  • It opens the text file created by MakeNearlineWebPlots and loops over the list of nearline output files.
  • Opening each file, it lifts out the histograms that contain the nearline metrics and makes the plots that will be posted to the webpage.
  • The final canvases are saved to disk in a directory that can be seen by the nearline website.
  • Usage: root -l "NearlinePlotMaker.C+($N)" (where N = the number of days)

CleanUpNearlineFiles.sh

  • runs as a cron job every 6 hours.
  • deletes old files and cleans up files from stale jobs.
  • deletes EVD files older than 12 hours.

Quick Trouble Shooting

Here lies a list of common problems and how to solve them:

problem diagnosis solution
Nearline web plots aren't updating The time stamps on the nearline web plots are older than {10 minutes,1 hour,1 day} for the {24 hour,1 week,1 month} plots. Refresh your browser, then check that the scripts are running as part of the crontab on the gateway02 machine. Try running the scripts manually to see if there are any obvious errors printed to the screen.
New data isn't being processed by the nearline scripts Runs have been going for a couple of hours and the nearline web plots are updating but without any new information for those runs. Check that the ProcessNewFiles script is being run on the gateway02 machine, and try running it manually. It may be that there is a stale lock file in the /tmp/ area on gateway-02 that is preventing this job from running (this can happen if the machine was improperly shutdown.) Check the /data/lbne/data/transferred_files/ area for new data files. If nothing new is there, then a problem may exist with the data transfer scripts (contact Tom Junk - .)

Interpreting the Plots

Nearline Machine Maintenance