Project

General

Profile

Production Tools and Procedures

Introduction

This page is intended to serve as an introduction to production procedures, chiefly aimed at new group members looking to learn the ropes.

Getting the permissions to run as production

Getting production permissions has proven to be not so straightforward in the past, so hopefully these steps will speed up the process. Provided you already have the usual computing privileges, these are the steps to follow (please remember that these are just some guidelines developed through past experience, by no means a full "how-to"):

  • From the website https://fermi.service-now.com, click on "Service Request Catalog" , then go to "Accounts" and select "Affiliation\Experiment Computing Accounts". Choose "E-929 (NOvA)" as the experiment and "Production" as the Roles.
    The keyword here is "VOMS novapro accounting group", which should take you to the right people. Again, try, and if it still doesn't work:
  • Send an email to Arthur Kreymer quoting which kind of error you are having and any other information you can think of

If none of that succeed, you are essentially hopeless.

Production Overview

The NOvA Production Group handles the processing of files through many data tiers along a few processing paths.

Raw to ROOT conversion

NOvA data is written in a special raw format. The first step for production is to convert the files to the ART-ROOT format to make further processing possible. This conversion is applied to all ND and FD files, regardless of trigger stream. The input data-tier is raw, the output is artdaq. The job fcl for this processing is prod_artdaq_job.fcl, which lives in the Production/fcl package.

Simulation

To produce Production group files there are two steps. First one must make the fcls for simulation. We produce one fcl file for each simulation file that we are to generate. Instructions at the wiki page below:

Once you've produced the fcl files one can then submit jobs. Instructions for submission are at the wiki page below:

Simulated files will be of the artdaq data tier. These act as input to produce the putput reco tier files. Subsequent information is in the next section.

Analysis Path (reco, PID, CAF)

NOvA analysis groups primarily depend on CAFs to produce physics results, but CAFs are built from the output of many ART modules. These modules are conceptually split into two categories: reconstruction(reco) and particle identification (PID). Reconstruction modules cluster hits into prongs and tracks. PID modules determine higher level physics quantities, typically first by identifying particle type for reconstructed tracks and prongs, then determining the energy momentum assuming a particular particle hypothesis. From a logistical standpoint, it makes sense to split up reconstruction and PID processing. Reconstruction is slightly more stable and generally requires more computing resources than PID, so splitting off PID can allow iterations based on the same reconstructed input.

Reconstruction takes as input either data or simulated (MC) files of the artdaq data tier to produce the output reco tier. After reconstruction, the pidpart job produces two output data tiers: pidpart and lemsum. The lemsum files (stripped of all but the essential information) are shipped off to Caltech for LEM processing, and the output lempart files are transferred back to FNAL. LEM results in in the lempart files are then merged with the rest of the PID in pidpart output to produce pid files. CAF processing takes the pid files as input to produce the output caf data tier.

Job fcl Purpose
prod_reco_cosmics_job.fcl Reconstruction for cosmic trigger data or CRY simulation. Input tier: artdaq; output tier: reco
prod_reco_numi_job.fcl Reconstruction for NuMI trigger data or GENIE simulation. Input tier: artdaq; output tier: reco
prod_pidpart_job.fcl Primary PID processing job. Input tier: reco; output tiers: pidpart, lemsum
prod_pid_lempart_pidpart_mixer_job.fcl LEM mixing job, combines pidpart and lempart files. This job runs over pidpart files and automatically fetches the corresponding lempart file, assuming it exists, otherwise it crashes. In the case that the Input tier: pidpart; Output tier: pid
prod_caf_job.fcl Standard CAF job for all but FD NuMI trigger files. Input tier: pid; output tier: caf
prod_caf_blinded_and_unblinded_job.fcl CAF job for FD NuMI trigger files, produces two CAFs, one which (caf) has had analysis information removed. The other set of files (restrictedcaf) retains all of the analysis information. Input tier: pid; output tier caf, restrictedcaf

PID Library Access -- Must use CVMFS

The "library" files used by PID modules are large. The standard Fermilab setup scripts point to locations on bluearc with a limited file access rate. For any grid running, the libraries should be pulled in from CVMFS instead of bluearc. More details can be found in the CVMFS section of this page.

LEM Mixing Datasets

The LEM mixing job (prod_pid_lempart_pidpart_mixer_job.fcl) requires a lempart file to exist corresponding to the input pidpart file, otherwise it crashes. This means that input pidpart datasets must have a special check for the existing lempart file. It is possible to use write SAM constraints to check metadata parameters of both parent and child files, thus allowing an arbitrary ancestry tree to be traversed. The prototype of a pidpart dataset which checks for a pidpart cousin is as follows:

data_tier pidpart and ischildof:( data_tier reco and isparentof:(isparentof:(data_tier lemsum and isparentof:(data_tier lempart))))

That is of course a mouthful, but the defman utility can create these definitions (and a draining counterpart) when passed the -p, --prodCAF argument.

Calibration Path (pclist, pcliststop, timecal)

The calibration processing path produces slimmed down files with minimal reconstruction. Three different data tiers are produced as output -- pclist, pcliststop and timecal -- all produced by a single job fcl. Drop statements are applied to the output to eliminate superfluous information and keep files small. Cosmic pulser trigger files are processed for the Far Detector; for the Near Detector, cosmic pulser, DDActivity1 and DDCalMu triggers are processed. There are two different jobs used for calibration processing, both found in the Production/fcl package.

Job fcl Purpose
prod_pclist_job.fcl Basic calibration job, used for FD.
prod_pclist_removebeamspills_job.fcl Calibration job used for ND, includes RemoveBeamSpills module.

Keep-up Reconstruction Path

Keep-up reconstruction aims to provide access to limited reconstruction on recent data, primarily for data quality assessment. These jobs produce data tiers reco and caf, distinguished from the analysis path by including the string keepup in the nova.special metadata parameter. For the FD, an additional set of CAFs (data tier restrictedcaf) is produced which includes all information which was stripped from the standard set in compliance with the NOvA blinding policy. Separate jobs are used for ND and FD, both found in the Production/fcl package.

Job fcl Purpose
prod_reco_keepup_fd_numi.fcl FD keep-up job
prod_reco_keepup_nd_numi.fcl ND keep-up job

Grid Processing and Submission.

Jobs

ART reconstruction jobs (i.e. any job which runs producer modules on an existing data or MC file) are submitted using submit_nova_art.py. The jobs themselves use art_sam_wrap.sh as a wrapper which sets up the environment and fetches files and runNovaSAM.py to run the nova ART executable and handle output. Those tools are documented on a dedicated page: Submitting NOvA ART Jobs.
Off-site users must find a way to pretend they are on-site in order to see these pages.

CVMFS

Tagged releases of novasoft (and externals) are available through the CERN Virtual Machine File System (CVMFS). In essence, this can be thought of as software in the cloud. The releases are compiled, then all source code and binaries are uploaded to CVMFS. The novasoft setup can then be directed to set-up the software from the CVMFS location. As an added bonus, Open Science Grid (OSG) nodes (including FermiGrid) are smart enough to cache a copy of the software on a local disk in case it need be used again by another job.

For jobs running off-site, CVMFS is a necessity since the build stored on bluearc is inaccessible. Jobs running on FermiGrid don't necessarily need CVMFS, except for when large files would be read from bluearc. PID (pidpart) jobs are an example of those which use large files, since the PID event libraries typically total hundreds of MB.

Jobs submitted through submit_nova_art.py can simply supply the --cvmfs flag to set up novasoft from CVMFS.

SAM

SAM is documented in the SAM Web Cookbook and a few other places, but there are some production specific details of which new production group members should be aware.

In no particular order, for now.

Convention for Dataset Definitions

The naming convention is as follows:

prod_DATATIER_RELEASE_DETECTOR_FLAVORSET_SPECIAL

For MC, FLAVORSET corresponds to GENIE/CRY. For data, it is replaced with the trigger stream. The SPECIAL bit can be as long as required, and even add more underscores, but it should reflect any other constraints which give the definition meaning.

Draining Dataset Definitions

So called "draining datasets" are frequently used for top-off submissions. The intention of a draining dataset is to isolate files which satisfy a primary set of constraints but do not have children which satisfy a secondary set of constraints. In other words, draining datasets isolate files which have not successfully been processed. The datasets are referred to as "draining" because these datasets will shrink as files are processed, but there is nothing to prevent them from growing if more files are produced which match the primary set of constraints.

The simplest example of a draining dataset would includes minimal constraints:

data_tier artdaq minus isparentof:(data_tier reco)

That example is hardly useful, however, since there are so many files which match both sets of constraints. A more realistic example uses constraints on more metadata parameters to precisely select a restricted set of files. For example,

defname: prod_pid_S15-05-22_nd_numi minus isparentof:( defname: prod_caf_S15-05-22a_nd_numi )

Draining datasets are automatically created by defman if the --linear (-l) or --prodCAF (-p) flags are provided. The draining datasets are based on the children parameter in the configuration of each tier. If one or more children are specified, a draining dataset is made for each of them.

Definitions from defman

There is a definition management application called defman which handles dataset creation based on a simple JSON formatted input file. Since much of the effort in production comes from managing datasets, defman can greatly streamline the process.

Documentation for defman

File Transfer Service (FTS) and Dropboxes

This section should describe the motivation for all of our nova-specific use patterns for FTS and link to any generic (experiment agnostic) documentation. The former includes hash/hex directories, configuration patterns, VM names, monitoring links, etc. We'll probably have to do some digging for the latter, or contact Andrew and Robert to see what they have for us.

What is a dropbox?

In UNIX terms, a dropbox is just a directory. The difference is that there an FTS instance monitoring that dropbox. The role of the FTS is to declare (check-in) files to SAM and transfer them to their final locations. In other words, the dropboxes allow production to drop off the files and have FTS handle the rest. For ART-ROOT files, the final locations are on dCache and tape. CAFs go to dCache and tape, but also get locations on bluearc for convenient access.

Dropbox locations

The dropbox directories are located in the dCache scratch (non-volatile) area within the following directory:

/pnfs/nova/scratch/fts/

In that directory, there are a variety of dropbox directories, each with a particular purpose.

Dropbox Purpose
CAF_dropbox Analysis path CAFs
Calibration_dropbox Calibration path
FCL_dropbox FCL files for simulation
General_dropbox Unused?
Keepup_dropbox Keep-up reco and @caf@s
MCdaq_dropbox Base simulation artdaq files
Nearline_dropbox Nearline files. Unused?
ParticleID_dropbox Analysis-path PID files, including LEM.
Raw2Root_dropbox ROOT, aka artdaq files converted from raw format
Reconstruction_dropbox Analysis-path reconstructed files.

Hash/hex directories

During large production runs, it is possible for the FTS dropboxes to contain several tens of thousands of files. If those files were all in a single directory, listing those files would become a time consuming process. In order to prevent this, production jobs sort files into a three-tiered directory structure, each level with 16 directories with names from the characters [0-9, a-f]. The directory to which a particular file is sent is determined by a taking the first three digits of the MD5 checksum of the file name, which happens to be a unique but reproducible solution. Both Bash and Python versions have been implemented, both of which give the same answer.

The Python version (used in runNovaSAM.py) takes the first three characters of md5.new(filename). The complete implementation looks something like this:

import md5, os

def hash_path(pathname):
    head, tail = os.path.split(pathname)
    hash = md5.new(tail)
    dirs = [head] + list(hash.hexdigest()[:3]) + [tail]
    return os.path.join(*dirs)

The Bash version relies on the md5sum command line utility... keep that in your back pocket for when you need to find the subdirectory in which a particular file landed. The full implentation, of course, looks like a button mash (the name Bash actually comes from merging the beginning and end of the words "button" and "mash," respectively [citation-needed]).

 hashname() { local fname=$(basename $1); local newpath=$(dirname $1)$(echo -n $fname | md5sum | cut -b1-3 | sed 's;.;/&;g' )/$fname ; echo $newpath ; }

For a quick check of a file's MD5 checksum visit this link , and type the file's name in the white box. The first 3 digits of the output below MD5 Hash will determine the directory where the file is going to land, or is already stored in the DropBox.

FTS Configuration

FTS uses the python ConfigParser mini language, which is based on the Microsoft INI format. The standard documentation mostly resolves around writing code, but it provides a little bit of details on the config syntax. Users can find additional details on the FTS redmine project wiki or see the dedicated page describing the configurable parameters. The NOvA-specific File Transfer Service (FTS) for Offline page provides documentation for interacting with the FTS, e.g. starting or stopping the processes, changing the configuration, etc.

Debugging FTS issues

  1. Pick a job. Look in the log file for that job to determine the name of the output file and the dropbox location. Assuming you know the output file name, you could actually do the hash procedure to determine the directory, but it's always nice to steal that from the log file if you can.
  2. Once you know the file name and target locations, check the dropbox. Is it there? If so, proceed. If not, the problem is upstream of FTS.
  3. Determine which instance of the FTS should have found the file. Go to the web monitor for that FTS instance and try to find your file in the error, new, or pending group. If it's there, try to digest the state/error. If you can't find it, proceed.
  4. Check the FTS configuration at the bottom of the monitor page. Is there a configuration block which matches your file? Does one of the scan patterns actually find your file? A good way to check this is some find command action. Copy/paste the dropbox and scan pattern into a find command and let 'er rip. Does find find it? If not, figure out what's wrong with the scan pattern and fix it in the config. If so, proceed.
  5. This is where it gets murkier. If the web monitor doesn't have any information, it might be buried in the FTS logs. In the novapro home directory, there is a "logdir" directory for each FTS instance. The logs are separated into a chunk for each day, so knowing which day your file landed in the dropbox is helpful. Open up the log with less or whatever and grep around for the file name. Often you can find the file in there with an error message. Your best bet at this point is to send an email to the sageliest of nova computing gurus, as well as the FTS experts. Robert Illingworth is a hero, and always super helpful.

Restarting FTS

To restart the FTS, run the restart_all.sh script in the latest version of the NovaFTS package. You do not need to login as novapro to run this script. The script logs in as novapro on the various novasamgpvm machines and restarts the instances of FTS running on each of them.

Production testing

Nightly jobsub_client based tests are used to verify that the production processes described above work in latest development and tagged builds of novasoft. The output of these tests is collated in the testing webpage and provide a comprehensive breakdown of the status of each production process on each of the important data streams and Monte Carlo flavour sets.

The results of these tests are summarised in terms of the identified success and failures of each chain of production tiers, with detailed breakdowns of the tier by tier and datastream by datastream results and benchmarks also provided.

Significantly more information on what is run and how it is run can be found in the overview and configuration pages.

Datasets

Files produced by production are provided to the collaboration through SAM definitions. The available "official" datasets are described in the official datasets page.

Crontabs

It is important to keep a record of what crontabs are installed and run on different machines as these often don't survive system maintenance at Fermilab. To this end we keep crontabs (named by machine they are run on) in our subversion repository.

These crontabs are kept at https://cdcvs.fnal.gov/redmine/projects/novaart/repository/entry/trunk/Production/cron

How to create and install crontab files

In order to create a file from the current crontab one can do:

crontab -l >> <filename>

In order to install the contents of a file as the current crontab for a machine do:

crontab <filename>

For example on novasamgpvm01 a directory containing the relevant svn controlled package is located at ~/crontabs/


[novapro@novasamgpvm01 ~]$ crontab -l >> ~/crontab/cron/novasamgpvm01
[novapro@novasamgpvm01 ~]$ echo "* * * * * bash ~/user_dirs/jpdavies/do_something_usefull.sh" >> ~/crontab/cron/novasamgpvm01
[novapro@novasamgpvm01 ~]$ cd ~/crontab/cron
[novapro@novasamgpvm01 ~]$ svn commit -m "Updated crontab on novasamgpvm01 to do_something_useful once a minute" 
[novapro@novasamgpvm01 ~]$ crontab ~/crontab/cron/novasamgpvm01