Submitting NOvA ART Jobs¶
- Table of contents
- Submitting NOvA ART Jobs
- Generic Grid Executables
- Job Submission
Many analyses will require users to run a
nova ART job on the grid. A suite of tools has been developed to make this task simple and easy. This page describes those tools and their usage. While the tools make the task relatively simple, users should be aware that they are about to harness a sophisticated suite of technology. A solid understanding of this suite will help you and the experts to debug any problems; it may also adequately prepare you for a software engineer position at Google.
In an ideal (i.e. dream) world, users would be able to write a simple configuration and get their work done with little thought. The
submit_nova_art.py utility was created as an effort to realize that dream. In reality, however, that submission script relies on a host of other software which users should attempt to understand. Usage of
submit_nova_art.py is described in a dedicated section on this page. New users in a hurry are welcome, but not encouraged, to skip straight to that section; but those who do will be ill prepared to resolve errors if they occur.
Novice users have often demonstrated poor understanding of the interplay between grid jobs and SAM. The systems are in fact two distinct components which are used in conjunction. Later sections will assume an understanding of these concepts, so new users should read this section carefully.
A grid is large cluster of worker nodes controlled by a submission (or head) node. Each worker node is a CPU, each with a local disk for temporary file storage. The submission node maintains a queue of jobs which need to be run and distributes those jobs to worker nodes based on a user priority system. Submitting jobs to the grid means adding jobs to the queue. Jobs must be configured to run a specific executable along with any required arguments. On Fermigrid, these configurations are transmitted to the submission node using the
jobsub_client system. More details on
jobsub_client can be obtained through the official FIFE documentation.
Sequential Access Metadata (SAM) is a data handling solution developed by Fermilab's Scientific Computing Division (SCD) to efficiently deliver tape-archived files. The tape archive is supplemented by the large dCache disk array which stores recently used files. Technically, SAM is just a database of file names, locations and metadata; in practice, it's the bit of machinery that ties everything together. One of the key features of SAM is that it obfuscates users from nitty-gritty file details like names and locations in favor of higher-level information cataloged by the file metadata. Metadata classify files based on their key features, like processing tier, run number, trigger stream, generator type, etc. Files can be grouped using constraints on the file metadata, for instance:
data_tier reco and online.detector fardet and online.stream 0 and online.runnumber 12942
The interactive SAM Web Cookbook provides a host of examples involving metadata constraints.
SAM dataset definitions can be used to package up a set of constraints. Once a definition exists which suits a user's needs, a SAM project can be created. A project uses a snapshot of a dataset definition as a list of input files. Each job (established as a process in SAM language) communicates with a project to request file locations and provide status updates. Job status can be tracked using the SAM Station Monitor. The Station Monitor lists all recent projects with a link to a page which displays a myriad of information, including the status of each process. The wiki-based SAM Web Cookbook provides other examples involving project functionality.
Generic Grid Executables¶
Prior to addressing job submission, which is covered in a later section, we must discuss the executables which will run on the worker nodes. A simple example of such an executable would be a shell script which sets up the novasoft environment, copies a file to the local disk and run a
nova ART job over it. In order to prevent scores of users from writing custom scripts which do the same thing, a pair of scripts have been developed to handle a wide variety of processing scenarios. The first of these scripts is called
art_sam_wrap.sh, which generally serves to set up the environment, fetch files from SAM and run a sub-executable over each file. The prescription described in the submission section uses
runNovaSAM.py as that sub-executable, since it carefully handles the naming of output files prior to running a
nova ART job. A more complete description for each of those scripts can be found below.
Initially developed by SCD,
art_sam_wrap.sh is a general purpose wrapper for retrieving files cataloged by SAM. A copy of the script was committed the nova offline repository (currently stored in the Metadata/samUtils package in the novasoft repository) in early 2014 and modified slightly, effectively branching the software from the SCD maintained version. (At some point we may move back to the SCD supplied version. If you notice that we already have, please modify this text accordingly.)
This script is commonly submitted to the grid as the primary executable that will be the run on the grid node. The script picks up a SAM project using the
$SAM_PROJECT_NAME environment variable, typically exported to the grid nodes using the condor
-e argument. A process is established with the SAM station so that file status can be reported and monitored using the Station Monitor web interface. File locations and are obtained via SAM and fetched to the local scratch space, then passed to an executable supplied through the -X argument. The original intention was for this executable to be an ART executable (e.g.
nova) so the -c option is passed along the path to a fcl job configuration file. The job fcl file is supplied to
art_sam_wrap.sh through the --config argument. Common usage for NOvA is to instead use
runNovaSAM.py (described below) as the executable, which picks up the
-c option and eventually passes it on to the
nova ART job.
Other convenient arguments include:
|| Sources an arbitrary bash script. This can be used to set up the software. Arguments can be supplied, but spaces must be replaced with a colon (
|| Exports an environment variable. Usage
|| Run the executable (
|| Maximum number of files to process in
||Fetch fcl configuration files from SAM project, useful for MC generation.|
Even more arguments can be found using
Although it is possible to use
art_sam_wrap.sh to run a
nova ART job, there are frequent operations which must be performed before and after it is run. A python script called
runNovaSAM.py has been developed as a general platform for performing these operations. It features options for naming output files, copying them to a destination and sorting those files within that destination. Copy-out functionality is enabled with the --copyOut argument. File names are controlled with the --outTier, --cafTier and --histTier options.
|| Enables copyback to output directory, determined from
|| Enables naming and copy-out for ART event ROOT files . Format is
|| Enables naming and copy-out for CAF files. Format is
|| Enables naming and copy-out for ART "hist" files from the TFileService. Format is
Submitting jobs to run over files in a SAM project involves two steps. First, the project must be started using
samweb start-project. After that, jobs can be submitted which will establish themselves as processes under that project. There exists a script,
submit_nova_art.py which wraps up these two steps into one configurable script. Users are encouraged to use that script and work with experts on adding any functionality which may be missing. This section will cover the basic features of
submit_nova_art.py, then move on to describe the generic submission script and show some examples. An appendix breaks down the gory details of an example
jobsub_client submission block.
Please note: before you submit any jobs using a SAM definition, you must check whether the files in the definition are cached, and pre-stage them if they aren't. The following instructions explain why.
Most of NOvA's files are not readily available; they are stored on archive tapes in the SCD tape library. To use files in permanent storage like this, they need to be recovered and put into a fast disk cache ("staged") from which they can be accessed in real time. This happens automatically when the files are requested, but it's usually a slow process because a robot has to go and physically pull a tape out of a library of physical tapes and read it to obtain the file. If you submit a grid job over a definition with files that aren't cached, the grid job waits, doing nothing, until the robot can read the tape and transfer the file to the cache. This wastes a lot of grid resources that otherwise would be available for somebody else to use, and if you do this, you'll likely get an automated email from FIFE warning you that your job efficiency was too low. (More information on the tape and cache system is at Tape_and_Cache.) Instead, the time spent re-caching files should be done in advance, prior to job submission.
If you haven't used a particular definition in less than 30 days, before submitting any jobs with it:
- Check the cache status of your definition using the
cache_state.pyscript with the
-dflag, as documented on the Tape_and_Cache page.
- If your definition isn't 100% cached, it will needed to be prestaged. Instructions for prestaging datasets are at the bottom of the SAM_web_cookbook page.
If your definition contains more than 1000 files, please consult the Production conveners before starting a prestaging process for it; restaging of a large dataset without coordinating with Production can interfere with the Production schedule.
The general purpose script
submit_nova_art.py can be used to submit nova art jobs that require SAM input. This script does extensive error checking to ensure that the arguments supplied are valid (for example that the specified output directory exists), starts up a SAM project and submits jobs to the grid. It uses the new
jobsub_client suite for job submission. It is possible to supply all required information through command line arguments or a configuration file. Required and optional arguments are described in an extensive help message (use
--help to view). In general, the help message is always the most up-to-date documentation.
Usage: $ submit_nova_art.py <arguments> OR $ submit_nova_art.py -f CONFIG_FILE
Arguments can passed to the submitter through the command line. Some users may opt to write these arguments in a shell script. Arguments can also be written in a plain, whitespace-insensitive text file, as will be shown in the example.
At a minimum, you must specify a job name, the input dataset definition, job fcl, novasoft tagged release (software version) and output destination. Note: submit the jobs from an environment which has the same tagged release set up as the jobs are configured, since the script will check for sanity. You specify this minimum information with the following options:
--jobname JOBNAME Job name --defname DEFNAME SAM dataset definition to run over --config CONFIG, -c CONFIG FHiCL file to use as configuration for nova executable. The path given should be relative to the $SRT_PRIVATE_CONTEXT of any test release you submit using' --tag TAG Tag of novasoft to use --dest DEST Destination for output files
You can use the
--print_jobsub option to print the jobsub command. The
--test option is used to run error checking and print the jobsub command, but does not actually start the SAM project or do the job submission.
--print_jobsub Print jobsub command --test Do not actually do anything, just run tests and print jobsub cmd --gdb Run nova executable under gdb, print full stack trace, then quit gdb. --test_submission Override other arguments given to submit a test to the grid. It will run 1 job with 3 events and write the output to /pnfs/nova/scratch/users/<user>/test_jobs/<d ate>_<time>
Job Control Options¶
For realistic cases, you will most likely want to split the processing into several jobs
--njobs NJOBS Number of jobs to submit --maxConcurrent MAXCONCURRENT Run a maximum of N jobs simultaneously --files_per_job FILES_PER_JOB Number of files per job - if zero, calculate from number of jobs --nevts NEVTS Number of events per file to process --no_multifile Do not use art_sam_wrap.sh multifile mode, which is on by default --txtfiledef Use if the input definition is made up of text files, each containing a list of file names --opportunistic Run opportunistically on the fermigrid --offsite Allow to run on offsite resources as well. Implies --opportunistic and --cvmfs. --offsite_only Allow to run solely on offsite resources. Implies --cvmfs. --amazon Run at amazon. Implies --cvmfs. --site SITE Specify allowed offsite locations. Omit to allow running at any offsite location --recommended_sites Specify known working offsite locations. --os OS Specify OS version of worker node --disk DISK Local disk space requirement for worker node in MB. --memory MEMORY Local memory requirement for worker node in MB. --expected_lifetime EXPECTED_LIFETIME Expected job lifetime (default is 10800s=3h). Valid values are an integer number of seconds or one of "short" (6h), "medium" (12h) or "long" (24h, jobsub default) --dynamic_lifetime LIFETIME Dynamically determine whether a new file should be started based on glidein lifetime. Specify the maximum length expected for a single file to take to process in seconds. --group GROUP, -G GROUP Specify batch group GROUP -- mainly used to set job priority. At present, only supportable value is nova --role ROLE Specify role to run on the grid. Can be Analysis (default) or Production. This option is no longer supported --continue_project CONTINUE_PROJECT Don't start a new samweb project, instead continue this one. --snapshot_id ID Use this existing snapshot instead of creating a new one. --mix MIX Pass a mixing script to the job to pull in a files for job mixing.
art_sam_wrap.sh multifile mode is turned on by default, but can be turned off using the
--no_multifile option if desired.
--no_multifile Do not use art_sam_wrap.sh multifile mode, which is on by default
The following options control nova software.
--maxopt Run in maxopt mode --testrel TESTREL Use a test release at location TESTREL. It will be tarred up, and sent to the worker node. --user_tarball USER_TARBALL Use existing test release tarball in specified location rather than having jobsub make one for you (conflicts with --testrel) --reuse_tarball Do you want to reuse a tarball that is already in resilient space? If using this option avoid trailing slash in --testrel option. (conflicts with --user_tarball) --cvmfs Does nothing (always true), but retained for compatibility: pull software from CVMFS. --novasoftups Use the ups build of novasoft, must be used with source to setup. --ngu_test Setup the test version of NovaGridUtils in the grid jobs. --ngu_version NGU_VERSION Setup a specific NovaGridUtils version in the grid jobs. --lemBalance Choose lem server based on (CLUSTER+PROCESS)%2 to balance load --lemServer LEMSERVER Specify lem server
File Output Options¶
Most use cases require a method to copy back output. You can either use the built in copyOut method by supplying the
--copyOut option. Or you can use
--copyOutScript COPYOUTSCRIPT to specify a script to copy your output back. If you use the builtin copyOut method, you must also specify at least one of
--copyOutScript COPYOUTSCRIPT Use script COPYOUTSCRIPT to copy back your output --copyOut Use the built in copy out mechanism. If used, you must specify --outTier, --cafTier or --histTier --logs Return .log files corresponding to every output --zipLogs Format logs as .bz2 files. Implies --logs --outTier OUTTIER Data tier of the output file, multiple allowed, formatted as <name_in_fcl_outputs>:<data_tier> --cafTier CAFTIER Module label for CAF output, multiple allowed. Format as <cafmaker_module_label>:<data_tier> --histTier HISTTIER File identifier string for TFileService output, only one allowed. Supply as --histTier <id> for output_name.<id>.root, where output_name is assembled based on the input file. --outputNumuDeCAF Make standard numu decafs for all CAF files produced during the job --outputNueDeCAF Make standard nue decafs for all CAF files produced during the job --outputNumuOrNueDeCAF Make standard nue or numu decafs for all CAF files produced during the job --outputNusDeCAF Make standard nus decafs for all CAF files produced during the job --npass NPASS To specify npass (aka nova.subversion) --skim SKIM To specify nova.skim --systematic SYSTEMATIC To specify nova.systematic --specialName SPECIALNAME To specify nova.special name --hashDirs Use hash directory structure in destination directory. --runDirs Use run directory structure in destination directory, 000XYZ/XYZUW for run number XYZUW. --noCleanup Pass --noCleanup argument to runNovaSAM.py. Necessary when using a postscript for copyout. --jsonMetadata Create JSON files with metadata corresponding to each output file, and copy them to the same destinations --declareFiles Declare files with metadata on worker node --production Submit production style jobs. Implies "-- role=Production --hashDirs --jsonMetadata --zipLogs", and checks that other settings needed for production are specified --calibration Submit calibration style jobs. Implies "-- role=Production", and checks that other settings needed for calibration are specified --declareLocations Declare the file output locations to SAM during the copy back of the files
Job Environment Options¶
There are a handful of methods for controlling the job environment.
--export EXPORT Export variable EXPORT to art_sam_wrap.sh --source SOURCE Source script SOURCE --prescript PRESCRIPT Execute script PRESCRIPT before executing runNovaSAM.py --postscript POSTSCRIPT Execute script POSTSCRIPT after executing runNovaSAM.py --inputfile INPUTFILE Copy this extra input file into job area before running executable
To export any environment variables, make sure to export that variable in your environment before submitting the job. For instance, to set the version number (Nova.SubVersion metadata parameter), do export NPASS=2 in your terminal, and add --export NPASS to your job configuration.
-h, --help Show this help message and exit -f FILE, --file FILE Text file containing any arguments to this utility. Multiple allowed. Arguments should look just like they would on the command line, but the parsing of this file is whitespace insenstive. Comments will be identified with the # character and removed.
Use a custom fhicl file¶
An alternative to the quoted text above (--config/-c) when using your very own fhicl file is to pass it the job yourself. In this case, first ensure your fhicl file is copied into dCache somewhere (/pnfs/nova/scratch/users/<your username> is probably the best choice). Then, add these lines to your submit_nova_art.py configuration:
--inputfile /pnfs/path/to/fcl/<fclname>.fcl -c <fclname>.fcl
Example Submission Configuration¶
For this example, we will use the text file input method, where the file is passed to
submit_nova_art.py using the
--file) option. The parsing of the file is whitespace insensitive and allows comments escaped with
# Example configuration for submit_nova_art.py # Usage: submit_nova_art.py -f <this file> # Use --test to run sanity checks without creating project and submitting # Job and project options --jobname davis_count_argon_atoms # Name of your project/jobs, be creative --defname prod_reco_S14-11-25_homestake_genie_nonswap # SAM dataset definition, defines files to be processed --njobs 1500 # Number of jobs to run --files_per_job 20 # Maximum number of files to be processed by each job --opportunistic # Run in opportunistic mode, i.e. steal non-NOvA nodes, optional --print_jobsub # Print jobsub submission block, good for records # novasoft options -c argoncounterjob.fcl # Job fcl for nova executable --testrel /nova/app/users/davis/dev_2014-02-08_chlorine # Path to test release, optional. Note lack of trailing slash --reuse_tarball # Option to reuse the newest tarball for the above test release which is stored in /pnfs/nova/resilient/... --tag development # Tagged release of novasoft to use --maxopt # Run in maxopt, optional # Copy-back: options for built-in runNovaSAM.py . # Advanced usage can replace this block with the --copyOutScript option --dest /nova/ana/users/davis/SolarAnomaly/ # Output directory --copyOut # Copy back output to --dest location --runDirs # Sort output by run number --outTier out1:arcount # Extension for ART-ROOT output stream out1: arcount.root --histTier argon_hist # Extension for hist (TFileService) output: argon_hist.root --cafTier=cafmaker:caf # Extension for CAFMaker with module lable cafmaker: .caf.root
My job is submitted. Now what?¶
Information on monitoring jobs can be found here: Monitoring Grid Jobs
Anatomy of a
This section does not serve as a replacement for the full jobsub_client documentation, but it does attempt to describe all of the components in an ART/SAM job using the
runNovaSAM.py scripts. The
jobsub_submit executable is used for submission. A fully configured submission is as follows:
jobsub_submit \ -N 800 \ --resource-provides=usage_model=DEDICATED,OPPORTUNISTIC \ -G nova \ -e SAM_PROJECT_NAME -e SAM_STATION -e IFDH_BASE_URI -e IFDH_DEBUG -e EXPERIMENT \ --role=Analysis \ file:///grid/fermiapp/nova/novaart/novasvn/releases/FA14-11-25/Metadata/samUtils/art_sam_wrap.sh \ --multifile \ --export EXTERNALS='/nusoft/app/externals' \ --export DEST=/pnfs/nova/scratch/fts/ParticleID_dropbox/ \ --export CVMFS_DISTRO_BASE='/cvmfs/oasis.opensciencegrid.org/nova' \ --config Production/fcl/prod_pidpart_job.fcl \ --source /grid/fermiapp/nova/novaart/novasvn/setup/setup_nova.sh:-r:FA14-11-25:-b:maxopt \ --limit 100 \ -X runNovaSAM.py \ --copyOut \ --outTier out1:pid
That is a bit of a mouthful, so we can take a look at the arguments one by one.
Submit 800 jobs:
-N 800 \
Run on NOvA dedicated nodes as well as opportunistically on other nodes:
Specify the nova group for accounting purposes:
-G nova \
Export a few necessary environment variables:
-e SAM_PROJECT_NAME -e SAM_STATION -e IFDH_BASE_URI -e IFDH_DEBUG -e EXPERIMENT \
SAM_PROJECT_NAMEtells @art_sam_wrap.sh which project to talk to.
Specify role for grid proxy/authentication:
(Default is Analysis, this is pedantic.)
jobsub_submit which executable to use, art_sam_wrap.sh in this case:
Note, the things which follow are no longer arguments to
jobsub_submit, they are arguments for
Run over more than one file per job:
Export the location of the external software packages.
--export EXTERNALS='/nusoft/app/externals' \
Export the output destination, passed to
--export DEST=/pnfs/nova/scratch/fts/ParticleID_dropbox/ \
Specify the fcl configuration to run:
--config Production/fcl/prod_pidpart_job.fcl \
art_sam_wrap.sh to run the novasoft setup.
--source /grid/fermiapp/nova/novaart/novasvn/setup/setup_nova.sh:-r:FA14-11-25:-b:maxopt \
Set the limit for number of files per job in
--limit 10 \
art_sam_wrap.sh to use
runNovaSAM.py as the executable.
-X runNovaSAM.py \
Note, the remaining arguments are not arguments for
art_sam_wrap.sh, but instead
runNovaSAM.py to copy output files to $DEST@ and specify which sort of files should be copied out. Note,
--cafTier are also valid options.
--copyOut \ --outTier out1:pidpart
submit_nova_art.py script supports running offsite using the
--offsite_only options. Use
--offsite if you don't care where your jobs run. Use
--offsite_only if you want to force your jobs to run only on offsite grid nodes. You can target specific sites by using the
--site option. The following are the sites available:
- FZU (Prague).
- Harvard. (DO NOT use)
- MIT. (DO NOT use)
- MWT2 (Mid-West Tier 2). (DO NOT use)
- OSC (Ohio SuperComputing Center).
- UChigaco. (DO NOT use)
- TTU. (DO NOT use)
--offsite --site Harvard (type the name of the site as listed above) forces your jobs only to run at Harvard. You can specify the
--site options multiple times. So:
--offsite --site Harvard --site FZU would force your jobs to run at Harvard or at FZU (Prague), but nowhere else.
The performance plot (find it at the bottom of this page) aids the user to decide which offsite locations are more likely to successfully complete jobs sent offsite. The performance score takes into account: the fraction of jobs successfully completed, the total time used to complete the full set of submitted jobs, the idle time taken to start the first job, and the average time to process an individual file. The score runs continuously from 1 to 16, where the best possible score is 1. The Offsite bin in the vertical axis indicates the performance of jobs sent using the
--offsite_only option. The performance plot, average version, presents the average performance of the last week. The performance plot, latest version, presents the latest test. NOVA-doc-14304 has more detailed metrics of the latest test and the average of the last week. The performance plot is updated regularly.
The following link:
presents the configurations required to run in each of the non-Fermilab sites. Most sites allocate 2500MB of memory, except for MWT2 and UChicago that allocate 2000MB, and UCSD and Omaha that allocate 4096MB and 4000MB respectively. To meet the memory requirement for each site use the:
--memory, option indicating the requested memory value. NOvA submission scripts have a default memory value of 4000MB.
The Blue-arc disks are not visible at offsite nodes. This means that test releases will not work offsite. It also means that if you want to use a custom fcl file, you will need to use some extra magic. Keep your fcl file in the directory you are submitting from. Then also add the option
--inputfile /absolute/path/to/fcl. In the future, this should be made more user-friendly.
By default, the script only allows you to submit your jobs to a predefined list of sites. If you want to submit to a site not on the list, define the environment variable
EXTRA_ALLOWED_SITES as a colon delimited list of additional sites you want to be allowed to use. This is intended as an expert feature to allow testing of new sites without maintaining locally modified copies of
submit_nova_art.py. If there are additional sites you want added to the list, you should contact