Project

General

Profile

Information from UsingJobSub (old ifront information)

All information on this page should be considered deprecated and/or obsolete. It is maintained for historical purposes only.

7/14/14 update: This information only applies to the gpsn01.fnal.gov submission node. When gpsn01 is replaced by the jobsub client/server package running on fifebatch.fnal.gov cluster, this usage of jobsub_tools will be depreciated. Consequently, the following information has NOT been significantly updated since 2013.

Background

Intensity Frontier experiments use condor to manage batch jobs on pools of worker nodes. Two of these pools are on fermigrid , which is part of the Open Science Grid . A third pool, called the 'local' pool, is also available to run user analysis jobs. These 3 pools have different characteristics which will be discussed in more detail below.

jobsub_tools

A suite of tools has been developed for managing user applications and data i/o named simply enough, jobsub_tools.
See the JOBSUB COMMAND REFERENCE page for details.

Important parts of jobsub_tools are:
  • jobsub, which is used to submit jobs to these condor pools, various input options steer jobs to the different pools and control input and output of data files to the user application running on the worker nodes.
  • ifront_q , a command line tool which is used to see how many jobs are in the queue and running
  • dagNabbit.py , a DAG generator to help with chaining and submitting dependent jobs
  • the suite of condor command line tools, appropriately wrappered to keep them from doing denial of service attacks on the grid infrasturcture. Condor_q, condor_hold, condor_rm, etc.

Other important tools which are not part of jobsub_tools but are important to know about are

  • the batch system monitoring pages
    • for most IF experiments gps01
    • for the MINOS experiment minos54
  • the IF Data Handling tool ifdh

Registering for access

  • Get grid access by following the instructions HERE

Jobsub setup

  1. log on to any GPCF machine (experimentgpvmnn) example: minervagpvm01, mu2egpvm02, etc
  2. source /grid/fermiapp/products/common/etc/setups.sh
  3. setup jobsub_tools

for example, a minerva user:

[myuserid@anymachine ~]$ ssh anyuser@minervagpvm01
[myuserid@minervagpvm02 ~]$ 
[myuserid@minervagpvm02 ~]$ source /grid/fermiapp/products/common/etc/setups.sh
[myuserid@minervagpvm02 ~]$ setup jobsub_tools

This will make all the jobsub tools and many important condor monitoring tools available in your command path.
jobsub -h is very informative at this point.

The "probably will not work" error, how to make it work

if you get this error:

[dbox@gpsn01 ~]$ setup jobsub_tools
tried to set up with gid=gpcf. this product will probably not work
dirname: missing operand
Try `dirname --help' for more information.
dirname: missing operand
Try `dirname --help' for more information.

This means that jobsub_tools couldn't figure out which experiment you were trying to submit as, it discovers this from your unix GID or the $GROUP environment variable.
There are 2 ways to fix this problem
  1. Easy way - submit from one of the gpcf virtual machines that you have access to as a member of your experiment (minervagpvm02, novagpvm03, mu2egpvm01,etc)
  2. Harder way - set some environment variables after 'setup jobusub_tools'
  • In the above example, the feckless user was trying to submit from gpsn01. DO NOT SUBMIT DIRECTLY FROM GPSN01 .
  • If you want to submit from, say, your linux desktop, this is possible. You need to set 3 environment variables after getting the 'probably will not work' message for your submissions to work:
    • X509_USER_PROXY,
    • CONDOR_TMP, and
    • CONDOR_EXEC.
  • X509_USER_PROXY needs to be set to /scratch/(user)/grid/(experiment)/(user).(experiment).proxy
    • so the above user, after acquiring some feck and realizing he wants to submit a mu2e job, will do the following
      export X509_USER_PROXY=/scratch/dbox/grid/dbox.mu2e.proxy
      
  • CONDOR_TMP and CONDOR_EXEC can be any directory that is group-readable and group-writable by the GID of your experiment

Alternate jobsub setup for more than one PRODUCTS database

You may already have products set up for your experiment, and not need to run the setups.sh .

source /path/to/your/other/ups/setups.sh
export PRODUCTS=$PRODUCTS:/grid/fermiapp/products/common/db
setup jobsub_tools

Mu2e

Mu2e users should run the following command, which sources /grid/fermiapp/products/mu2e/etc/setups.sh internally:

#set up art environment
 . /grid/fermiapp/products/mu2e/setupmu2e-art.sh
#then put jobsub in path
 setup jobsub_tools

The Intensity Frontier Grid Environment

As mentioned above, most jobsub users can submit jobs on three different condor pools,
  • the FNAL GP GRID (aka GPFarm) with about 5600 user slots
  • the CDF GRID with 5200 user slots
  • the local batch with 176 user slots

Condor maintains per user and per experiment quotas on these farms and allocates resources to jobs out of the queue based on its own priority calculations.

On the GPFarm, experiments are preferentially given access to ( NB its actually a fair-share algorithm for the 5600 slots, so these numbers are approximate) the following number of the 5600 available slots:

(these numbers correct as of 3/4/13)
Experiment Preferential Slots
gm2 200
argoneut 200
uboone 200
lbne 500
mu2e 500
minerva 800
nova 800
minos 1200

An experiments jobs cannot be evicted from their 'guaranteed' slots. If some other experiment is using this slot, which is called opportunistic usage, the other experiment is 'evicted'.

To submit to the 'guaranteed' slots, use the -g option in jobsub. Omitting the -g sends your jobs to the local batch pool.

Users can request above quota slots on both the GPFarm and CDF Farms by including the --opportunistic flag along with the -g flag in their jobsub incantation. This can potentially bring you lots of extra resources, but your jobs run the risk of being evicted by CDF or other IF experiments if they request their quota. If your jobs are evicted they will eventually restart on a different worker node.

Environment Variables - Testing that your setup, proxy etc are correct

An OSG compliant worker node sets many environment variables, some of which are important. To see the full list, and to test that you have set up your grid permissions correctly, use jobsub to run a probe job. Perform the following steps:

  1. source /grid/fermiapp/products/common/etc/setups.sh
  2. setup jobsub_tools
  3. sh $JOBSUB_TOOLS_DIR/test/Run_All_Tests.sh
  4. ifront_q to see your jobs status. It will eventually move from idle, to running, and will not show up in the ifront_q output at all when it has completed
  5. cd $CONDOR_TMP
  6. ls -lart you are looking for a file with a name like test_grid_env.sh_(stringofnumbers).out
  7. look in this file for a list of env variables and thier values on this machine. It will look something like this:
    Grid Job pid(11797) 20130303 194555: -------------------------------------------
    Grid Job pid(11797) 20130303 194555: JOB STARTED: Sun Mar  3 19:45:55 CST 2013
    Grid Job pid(11797) 20130303 194555: Running as......... uid=44591(mu2eana) gid=9914(mu2e) groups=65013(glexec13)
    Grid Job pid(11797) 20130303 194555: Program............ /grid/fermiapp/mu2e/condor-exec/dbox/test_grid_env.sh
    Grid Job pid(11797) 20130303 194555: PID................ 11797
    Grid Job pid(11797) 20130303 194555: Hostname........... fnpc5029.fnal.gov
    Grid Job pid(11797) 20130303 194555: OSG_SITE_NAME...... FNAL_GPGRID_2
    Grid Job pid(11797) 20130303 194555: HOME............... /grid/home/mu2eana
    Grid Job pid(11797) 20130303 194555: OSG_WN_TMP......... /local/stage1/disk16/dir_4777
    Grid Job pid(11797) 20130303 194555: All environmental variables:
    AGroup=group_mu2e
    ANT_HOME=/usr/local/grid/ant
    CATALINA_OPTS=-Dorg.globus.wsrf.container.persistence.dir=/usr/local/grid/vdt-app-data/globus/persisted
    CLUSTER=8342648
    CONDOR_CONFIG=/local/stage1/disk16/dir_4777/glide_CO4858/condor_config
    CONDOR_DIR_INPUT=/local/stage1/disk16/dir_4777/glide_CO4858/execute/dir_11526/no_xfer/0/TRANSFERRED_INPUT_FILES
    CONDOR_EXEC=/grid/fermiapp/mu2e/condor-exec/dbox
    CONDOR_INHERIT=11526 <131.225.167.82:48442> 0 0
    CONDOR_PARENT_ID=fnpc5029:11526:1362361552
    CONDOR_PROCD_ADDRESS=/local/stage1/disk16/dir_4777/glide_CO4858/log/procd_address
    CONDOR_PROCD_ADDRESS_BASE=/local/stage1/disk16/dir_4777/glide_CO4858/log/procd_address
    CONDOR_TMP=/grid/fermiapp/mu2e/condor-tmp/dbox
    CPN_DIR=/grid/fermiapp/products/common/prd/cpn/v1_3/NULL
    CVS_RSH=ssh
    DAEMON_DEATHTIME=1362363355
    DYLD_LIBRARY_PATH=/usr/local/grid/globus/lib
    

Environment Variables (Known To Be Important)

$_CONDOR_SCRATCH_DIR - the default area on the worker node where your output should go. On a 'stock' condor installation, any files left here are transferred back to the user, this can cause problems as condor floods some unsuspecting file system with unwanted files. Jobsub resets this variable so files written do not come back automatically, they have to be explicitly transferred with CPN or gridftp.

$TMP, $TEMP, $TMPDIR - synonyms for $_CONDOR_SCRATCH_DIR . Until recently these values were not reset to the new value when $_CONDOR_SCRATCH_DIR was changed, this caused an output file flood into the wrong directory on gpsn01. This undesirable behavior was fixed as of jobsub_tools v1_1q.

$OSG_WN_TMP - yet another temp area, unique to each worker node. As of this writing (3/3/2013) it has the wrong permissions to be used as a temp area. A service desk ticket will be created.

$CONDOR_TMP - this must exist on the node you submit through (gpsn01, minos54) . Your condor command file, and condor log, stderr and stdout all end up here

$CONDOR_EXEC - this must exist on the node you submit through (gpsn01, minos54) . A shell script wrapper is generated here that wraps the job you are submitting, as well as input/output directives to bring files to and from the worker nodes.

$CONDOR_DIR_INPUT - this directory exists on the worker node if the -f option was used in jobsub . jobsub -f /my/bluearc/dir/data.root will cause 'data.root' to be safely copied to $CONDOR_DIR_INPUT/data.root on the worker node . You can transport as many files as you want in one jobsub incantation , jobsub -f /foo/bar -f /foo/baz -f /foo/bang will result in $CONDOR_DIR_INPUT containing bar,baz,and bang, copied over in a manner guaranteed to be safe to bluearc

$CONDOR_DIR_(whatever) - this directory exists on the worker node if the -d (whatever) /path/to/where/you/want/output/ flag is specified in your jobsub incantation. When your job completes, the jobsub wrapper will safely copy back any files in $CONDOR_DIR_(whatever) to the /path/to/where/you/want/output/ directory using either CPN or gridftp. It is up to you to construct your grid job so the output you care about is written to $CONDOR_DIR_(whatever)

Bluearc mounts

The bluearc mountpoints follow this general pattern. They exist on all the worker nodes in GPFarm, CDFFarm, and local batch. <exp> stands for any of minerva, nova, gm2, etc.

Mount Point Permissions on Worker Permissions on <exp>vm01 Purpose
<exp>/data rw rw data
<exp>/app rx rwx executables, .so files
/grid/fermiapp/<exp> rx rwx executables, .so files
/grid/data/<exp> rw rw data

Note that the 'app' areas are writable on the vm nodes where users do development, but not on worker nodes. The 'data' areas are writable from worker nodes but, and this is a big but, DO NOT WRITE DIRECTLY TO THESE AREAS. Use CPN or ifdh cp commands to read or write data files to or from your $_CONDOR_SCRATCH_DIR to or from bluearc data areas. The -f and -d flags in jobsub, discussed elsewhere in this document, will do this automatically for you.

Using jobsub

*For a quick startup example go to: https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Getting_Started_on_GPCF#Interactive-Example-Running-Hello-World

The jobsub command takes many optional arguments. A complete jobsub options list can be found by running jobsub with the "-h" option.
Usage: jobsub [args] executable [exec_args]
Where args are jobsub options, executable is the full path of your job executable and exec_args are options to your job.
The executable must be in an area of Bluearc space to which both the local submission node(gpsn01.fnal.gov) and worker nodes have access.

To verify the environment is correctly set up, execute $JOBSUB_TOOLS_DIR/test/Run_All_Tests.sh

Using jobsub to execute DAGs

Condor supports the use of DAGs(Directed Acyclic Graphs) for submission of dependent jobs and flow control. dagNabbit.py takes a formatted input file, then generates Condor command files and a Condor DAG control file. dagNabbit.py uses jobsub "-n" option which generates a Condor job submission file, but does not execute the job. This file is submitted to Condor by the DAG control file.

dagNabbit.py instructions

tarball support

*jobsub has input flags to help create and submit tarballs


    --input_tar_dir=INPUT_TAR_DIR
                        create self extracting tarball from contents of
                        INPUT_TAR_DIR.  This tarball will be run on the worker
                        node with arguments you give to your_script
    --tar_file_name=TAR_FILE_NAME
                        name of tarball to submit, created from
                        --input_tar_dir if specified
    --overwrite_tar_file
                        overwrite TAR_FILE_NAME when creating tarfile using
                        --input_tar_dir
  • here is a usage example
    
    [dbox@novagpvm01 tartest]$ ls
    infile.root  process.sh
    [dbox@novagpvm01 tartest]$ ./process.sh 
    usage: ./process.sh infile outfile
    copy in file 'infile', process it and write it to 'outfile'
    [dbox@novagpvm01 tartest]$ ./process.sh infile.root outfile.root
    processing done.
    infile:infile.root
    outfile:outfile.root
    [dbox@novagpvm01 tartest]$ ls
    infile.root  outfile.root  process.sh
    
  • to run this same example, along with its input file infile.root, as a tarball:
[dbox@novagpvm01 tartest]$ rm outfile.root 
[dbox@novagpvm01 tartest]$ cd ..
[dbox@novagpvm01 dbox]$ jobsub -g --nowrapfile --overwrite_tar_file --input_tar_dir /nova/app/users/dbox/tartest --tar_file_name=/nova/app/users/dbox/mystuff.tar process.sh infile.root outfile.root
/nova/data/condor-tmp/dbox/process.sh_20130422_163600_21022_0_1.cmd
submitting....
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 9290617.

Using jobsub to submit to remote sites

  • show_entrypoints command
    -sh-3.2$ show_entrypoints | tail
    FNAL_seaquest_opportunistic    Up      All:All
    FNAL_uboone                    Up      All:All
    FNAL_uboone_opportunistic      Up      All:All
    HARVARD_nova                   Up      All:All
    KISTI                          Up      All:All
    MIT                            Up      All:All
    SMU_nova                       Up      All:All
    factory                        Up      All:All
    fermigrid                      Down    All:All
    fermigrid_SL5                  Down    All:All
    
  • --site=(some entry point from show_entrypoints command)
 jobsub --nowrapfile --site SMU_nova -g $CONDOR_EXEC/test_grid_env.sh

Condor commands included in jobsub tools

build_links            condor_dump_history  condor_q               condor_run                  condor_userprio
condor                 condor_findhost      condor_qedit           condor_stats                condor_vacate
condor_checkpoint      condor_glidein       condor_release         condor_status               condor_vacate_job
condor_check_userlogs  condor_history       condor_reschedule      condor_submit               condor_version
condor_cod             condor_hold          condor_rm              condor_submit_dag           condor_wait
condor_compile         condor_load_history  condor_router_history  condor_transfer_data        CVS
condor_config_val      condor_power         condor_router_q        condor_userlog              genCondorWrappers.pl
condor_dagman          condor_prio          condor_router_rm       condor_userlog_job_counter  generic_wrapper