Project

General

Profile

How to Configure (and Run) Production Jobs

Largely for production, though some help for non-productioneers too!

Introduction

The configuration files for submit_nova_art.py are simply made up of arguments to the program. The key is the that the .cfg files can include other configuration file fragments, labeled .inc with

-f <somefile>.inc

which allows a configuration file to be built up from pre-existing fragments. The contents of the files are inserted in-place in the order, and later settings override earlier settings so defaults can be over-written by placing them later in the file. Those predefined fragments that can be used are in novaart:source:trunk/NovaGridUtils/configs.

Configure the Job

Template for a standard production job like full chain reconstruction

This is a template for a simple production job, like running the full-chain reconstruction.

# Not production? Delete this line
-f production.inc

# Choose where the job will run
# Most jobs can and should run everywhere
-f everywhere.inc
#-f onsite.inc
#-f offsite.inc

# Properties of this job
--jobname <name of job>
--defname <name of input definition>
--tag <release to be used>
-c <fhicl file>

# Define output tiers. 
-f outputs_Prod3_PID.inc # This inc has the pre-sets for Prod3 reconstruction. Likely don't want CAFs anymore.
-f outputs_repid.inc         # This inc has the pre-sets for Prod4 reconstruction.

# Specify the number of jobs to submit
--njobs <N jobs requested> 

# Default in production.inc is 4 hours
# If your job will run for > 4 hours/file or you are not running as production, specify the time/file below
#--dynamic_lifetime <time in seconds for 1 file>

# If your job requires > 1950 MB of ram, specify it here
#--memory <memory in MB>

Templates for other common jobs

Template for respinning CAF and deCAF files

# Not production? Delete this line
-f production.inc

# This is an io-intenstive job best run at FNAL
-f onsite.inc

# Include sets fhicl, output, and resource settings
-f caf_respin.inc

# Properties of this job
--jobname <name of job>
--defname <name of input definition>
--tag <release to be used>

# Specify the number of jobs to submit
--njobs <N jobs requested> 

MC Generation templates, along with instructions for making fhicl files, can be found here:
So You Want to Make Some Monte Carlo (A Qwik Start Guide for non-Productioners)
So You Want to Make Some Monte Carlo (A Qwik Start Guide for Production)

Prestaging your input dataset

Most of our data files aren't readily accessible-- they are permanently archived in Fermilab's tape storage system (Enstore), but to be used in jobs they must be retrieved. There is a caching system that sits in front of Enstore so that files that have been recently requested can be accessed quickly.

When files need to be manually staged

Any data files you need to use as input for jobs must be in the disk cache BEFORE you submit any jobs. This includes:
  • Data or MC artdaqs you plan to run reconstruction on
  • Files used in an overlay procedure (simulated ND rock singles GEANT4 files, FD cosmic data artdaqs, simulated ND neutrino singles GEANT4 files, etc.)

A good rule of thumb is that if you've written the name of a definition in your config above, you need to make sure it's staged. FD cosmics are a bit trickier; see the section below.

Checking if files are in the cache

The tool cache_state.py can be used to check the cache status of any files. To check a definition, try something like:

$ cache_state.py -d prod_pid_R17-03-01-prod3reco.l_fd_genie_fluxswap_rhc_nova_v08_full_v1_evaluate
Retrieving file list for SAM dataset definition name: 'prod_pid_R17-03-01-prod3reco.l_fd_genie_fluxswap_rhc_nova_v08_full_v1_evaluate'...  done.
Checking 2064 files:
 0%  10%  20%  30%  40%  50%  60%  70%  80%  90%  100%
Cached: 1269 (61%)    Tape only: 795 (39%)

(For other uses see its --help, which is reasonably 'help'ful.)

If less than 95% of the files in the definition are cached, you need to manually stage them per the next section.

Manually staging files

For the most part, prestaging files is straightforward. (Overlay files can be an exception; see next section.) Run the following command in a screen or tmux session (it will likely take hours or perhaps even days depending on the size of your definition):

samweb prestage-dataset --defname=<your definition here> --parallel=5

Please note that the the disk caching system's performance degrades quickly with the number of outstanding requests, so it's advisable to only prestage a single dataset at once if at all possible. Simultaneous prestaging needs should be coordinated in serial on #production.

Manually staging overlay files

In some cases, overlay files are drawn from a definition that is significantly larger than the number of files that will be used in the overlay procedure. (FD cosmic overlay onto FD GENIE MC is a good example of this: there are ~500K cosmic data artdaqs, while we typically only make ~5-6K FD GENIE MC files, each of which only requires a single data cosmic file.) In that case, you don't want to prestage the entire overlay dataset, since it's too large to fit in the cache at once and you don't need most of the files anyway. In this case, you can use the tool overlay_prestage_def.py to create a new definition which is a subset of the overlay file definition corresponding to the same (run,subrun) pairs as the FCLs in your simulation definition:

$ overlay_prestage_def.py -o <my_output_defn> <overlay_defn> <fcl_defn>

Once this finishes (it may take a while depending on how large your <fcl_defn> is), you can use the instructions in "Manually staging files" above to prestage your output defn <my_output_defn>.

Testing your Configuration

Before submitting large-scale production, you should always run a test job. The ECL form will ask for a directory containing the output of a test job.

Running a test job is simple. Set up your configuration as you want for the final submission, but when you run do the following:

submit_nova_art.py -f myconfig.cfg --test
submit_nova_art.py -f myconfig.cfg --test_submission

The first command will just show you the jobsub_submit command, which you should always check first.

The second command will submit a short job that you can use to test your configuration. It will put output files in a directory like:

/pnfs/nova/scratch/users/<user>/test_jobs/<timestamp>

The job submission will print out the directory. This is what you should put in the ECL entry.

Once the test job is done (see Monitoring) you should check the following things in the output directory:

  • Are there any output files in your test job directory?
    • There not being any doesn't necessarily indicate a failure on your part, but you NEED to check the job logs to make sure!
  • Are the correct output files there, and do they have sensible file names?
    • Make sure that if a special name or systematic was specified that it ends up in the file name.
  • Check that the file has the proper metadata
    • Use sam_metadat_dumper to confirm that the metadata is correct.
      sam_metadata_dumper `pnfs2xrootd /path/to/thefile.root`
      
    • Be careful to check the skim, special, and systematic fields are set correctly. Also be sure it has all of the required metadata (if only 10-15 lines show up something went wrong).

If the files produced by the test job fail any of these criteria then you will need to have a look at the job logs from your submission

jobsub_fetchlog --jobid <jobid-of-your-job> --role=Production
tar -zxf <copied-back-tar-file>

You will then see files which end with .sh, .cmd, wrap.sh, .log, .out and .err. The most important file for most purposes are the .out and .err files.

The full list of reasons why your jobs may have failed are too long to list here, and so unfortunately you'll have to figure that out yourself (the error is normally at the bottom of the files though).

However, if you are running a Reco/PID job and no files got copied back, then it might be that no events passed the filter (this is reasonably likely in a test job of only 3 events). If this is the case then the job will exit with exit status 0, and near the end of the .out file you will see a line which says something like No events passed filter, returning.

Real Submission

Submit the job by removing --test_submission.

submit_nova_art.py -f myconfig.cfg

If you are submitting an official production job, make sure to put the full contents of your configuration file as well as the full output of the submit_nova_art.py command in the ECL form.

Common Additional Settings

Add additional jobs to an existing project

Because the jobsub servers can get overloaded if too many jobs are submitted at once, the maximum number of jobs we are allowed to submit at once is 5000. However, we often have datasets that are substantially larger than that (particularly when processing ND simulation). Fortunately, once the first set of 5000 jobs for a dataset is submitted, then one can submit the same configuration to submit_nova_art.py with one modification in order to add additional jobs.

The process is as follows:

  1. Submit the configuration and make an ECL entry as usual.
  2. Note the name of the SAM project that is created. In the output from the submission, towards the top, you'll notice a line like:
    Station monitor: http://samweb.fnal.gov:8480/station_monitor/nova/stations/nova/projects/<the name you gave this submission>-<date>_<time>
    

    The project name is everything following the last slash character. (If you're unsure, you can also copy-paste this link into a web browser and open it; the SAM monitor page that comes up will have the project name at the very top of the page.)
  3. Modify the configuration for submit_nova_art.py to include the following line:
    --continue_project <project name>
    

    (where obviously you substitute the project name you noted in the previous step in for <project name>).
  4. Submit the configuration again as many more times as needed to add to the total number of jobs required. (The --njobs argument in the submit configuration can also be modified as needed.) Be sure to respect the time interval between submissions noted in the submit_nova_art.py output (e.g., if it asks you to wait 5 minutes before submitting any more jobs, please wait 5 minutes) -- otherwise jobsub can get overloaded and we get angry emails from the Computing Division.
  5. Add an ECL entry for each set of additional jobs. Instead of starting a new checklist entry, however, click "add related entry" at the bottom of the entry you made for the first submission, and choose the "Additional Jobs" form. (This form is much shorter and only requires tags and the submission output.)

Don't know the time per file?

If you are unsure of your job's run time, do the following with the test job produced above.

  • Examine the log file of the job by copying the .log.bz2 file out of dCache and then examining it with bzcat file.log.bz2 | less.
  • Scroll to the bottom of the file to the TimeTracker section, and look for the time/event:
    TimeReport ---------- Event  Summary ---[sec]----
    TimeReport CPU/event = 6.022724 Real/event = 3.552459
    
    TimeReport ---------- Path   Summary ---[sec]----
    TimeReport             per event          per path-run
    TimeReport        CPU       Real        CPU       Real Name
    TimeReport   3.422480  5.891995   3.422480  5.891995 pid
    TimeReport        CPU       Real        CPU       Real Name
    TimeReport             per event          per path-run
    
  • Note that the CPU and Real times are swapped in the Event Summary.
  • The final runtime you want to use is (Nevents/file) * (time/event) * (1.5)
    • You can determine the typical number of events by checking the metadata of the files in the input dataset with samweb get-metadata
    • The factor of 1.5 is a safety margin since you can only measure typical run time but must specify maximum run time.