Project

General

Profile

Grid Jobs

Offical jobsub support

Project Home
https://cdcvs.fnal.gov/redmine/projects/jobsub

Project Wiki (Admins & Users )
https://cdcvs.fnal.gov/redmine/projects/jobsub/wiki

Project Technical/Design Documentation
https://cdcvs.fnal.gov/redmine/projects/jobsub/documents

Mailing list for discussions and guidance from developers
jobsub

Monitoring is called FIFEMON, at https://fifemon.fnal.gov/monitor/
This uses single-signon, with your Services password.

ANNIE Usage

Annie will probably run entirely on local GPGrid resources.

If we use only software within /cvmfs/... we can run anywhere on the Open Science Grid.

Sample usage :

#    Set up the jobsub client

. /cvmfs/fermilab.opensciencegrid.org/products/common/etc/setups.sh
setup jobsub_client
export JOBSUB_GROUP=annie

#    Look at the commands available

jobsub

#    submit

PROBE=/grid/fermiapp/minos/scripts/probe
SELF=`whoami`
jobsub_submit -g file://${PROBE}
...

Use job id 14069968.0@fifebatch1.fnal.gov to retrieve output

#   Check the queue

jobsub_q --user=${SELF}

#   See what logs are available to fetch

jobsub_fetchlog --list

#    Fetch the logs

JID=14069968.0@fifebatch1.fnal.gov
TLD=/tmp/`whoami`/${JID}

jobsub_fetchlog --job=${JID} --unzip=${TLD};

#  *.log has the condorlog
#  *.err has stderr
#  *.out has stdout

An example

  • ssh to one of the ANNIE gpvm's:
    ssh user@anniegvm01.fnal.gov
  • source the common products setup script
    source /grid/fermiapp/products/common/etc/setup
  • setup the grid submission programs
    setup fife_utils
  • You can optionally also export some grid submission variables here:
    export JOBSUB_GROUP=annie
    export JOBSUB_SERVER=https://fifebatch-dev.fnal.gov:8443   # don't specify unless you know you need something specific
  • You next need a grid script that will copy in your code to run on the grid, along with any files to be processed (this may change with SAM implementation).
    As an example, I'll go through the submission script used for wcsim processing, describing each section. The complete grid.sh script is attached.

1. First you need to set any required environmental variables and call the setup commands to set up the required software. If all your required software is also on the OSG, you can use this as an alternative code base, and run jobs on the OSG. Note you also need to setup fife_utils during this step

echo "setting up software base" 
export CODE_BASE=/grid/fermiapp/products                             # <<< if running on the Fermi grid
#export CODE_BASE=/cvmfs/fermilab.opensciencegrid.org/products       # <<< if running on the OSG
source ${CODE_BASE}/common/etc/setup                                 # <<< sourcing 'ups setup' is required
export PRODUCTS=${PRODUCTS}:${CODE_BASE}/larsoft                     # <<< other products locations may be required depending on your products

setup geant4 v4_10_1_p02a -q e9:debug:qt                             # <<< setup Geant4 v10.01.p02
setup genie v2_12_0a -q debug:e10:r6                                 # <<< setup Genie  v2.12
setup genie_xsec v2_12_0 -q DefaultPlusMECWithNC                     # <<< and associated software
setup genie_phyopt v2_12_0 -q dkcharmtau
setup -q debug:e10 xerces_c v3_1_3
setup -q debug:e10:nu root v6_06_08                                  # <<< setup ROOT 6.06.08
source ${CODE_BASE}/larsoft/root/v6_06_08/Linux64bit+2.6-2.12-e10-nu-debug/bin/thisroot.sh
setup -q debug:e9 clhep v2_2_0_8                                     # ^^^ it may be necessary to source setup scripts for CMake to find everything:
setup cmake                                                          # test your compilation on the anniegpvm; if it works there, it should be good on the grid

setup fife_utils                                                     # you will also need fife_utils to copy files in and out 

# export any necessary environmental variables for your build
export XERCESROOT=${CODE_BASE}/larsoft/xerces_c/v3_1_3/Linux64bit+2.6-2.12-e10-debug
export G4SYSTEM=Linux-g++
export ROOT_PATH=${CODE_BASE}/larsoft/root/v6_06_08/Linux64bit+2.6-2.12-e10-nu-debug/cmake
export GEANT4_PATH=${GEANT4_FQ_DIR}/lib64/Geant4-10.1.2
export GEANT4_MAKEFULL_PATH=${GEANT4_DIR}/${GEANT4_VERSION}/source/geant4.10.01.p02
export ROOT_INCLUDE_PATH=${ROOT_INCLUDE_PATH}:${GENIE}/../include/GENIE
export ROOT_LIBRARY_PATH=${ROOT_LIBRARY_PATH}:${GENIE}/../lib

2. Your code needs to have been compiled for the correct architecture corresponding to the node on which it is to run. This may vary between jobs. In order to run on whichever architecture your node happens to be, there are a few options:
  • you can pre-compile for all suitable architectures, then setup them up as a product, so that the appropriate version will be automatically used by the node
  • you can compile versions for all suitable architectures and copy the correct one in
  • you can request nodes of a particular architecture
  • or you can compile on the node

This example uses the last option; compiling on the grid node. We'll first need to copy in the source files. Since /pnfs/annie is not mounted, you need to use ifdh cp rather than just cp to copy files to and from the grid.
(I borrowed some of Robert's code to make several copy attempts from backup locations in case one doesn't work, but this shouldn't really be necessary.)

# copy the source files
SOURCEFILEDIR=/pnfs/annie/scratch/users/moflaher
SOURCEFILEZIP=wcsim.tar.gz
echo "searching for source files in ${SOURCEFILEDIR}/${SOURCEFILEZIP}" 
echo "ifdh ls ${SOURCEFILEDIR}"       
ifdh ls ${SOURCEFILEDIR}                                              # it can be useful for debugging to show what files were available in case the copy fails.
ifdh ls ${SOURCEFILEDIR}/${SOURCEFILEZIP} 1>/dev/null 2>&1            # this will have exit status 0 if the file exists, or something else if it doesn't
if [ $? -eq 0 ]; then                                                 # check the exit status and proceed with the copy if it exists
  echo "copying source files" 
  ifdh cp -D ${SOURCEFILEDIR}/${SOURCEFILEZIP} .                      # by using the -D flag for ifdh cp, the originating filename will be used
else                                                                  # if only an output directory is specified
  echo "source file zip not found in ${SOURCEFILEDIR}!" 
fi
if [ ! -f ${SOURCEFILEZIP} ]; then                                    # check if the file exists, and if not try alternative locations. (possibly overkill).
  echo "could not copy source file zip from ${SOURCEFILEDIR}, trying alternatives" 
  for TARLOC in /pnfs/annie/persistent/users/moflaher \
                /annie/app/users/moflaher/wcsim
  do
    echo "try copy from ${TARLOC}" 
    ifdh cp -D ${TARLOC}/${SOURCEFILEZIP} .
    if [ ! -f ${SOURCEFILEZIP} ]; then continue; fi
  done
  if [ ! -f ${SOURCEFILEZIP} ]; then
    echo "source file zip not found in any accessible locations!!!" 
    exit 11
  fi
fi

3. having copied in the source files, extract them from the archive and execute the usual build commands

# extract and compile the application
echo "unzipping source files" 
tar zxvf ${SOURCEFILEZIP}

echo "compiling application" 
mkdir build
cd wcsim
make rootcint
make 
cp src/WCSimRootDict_rdict.pcm ./
cd ../build
cmake ../wcsim
make
rm libWCSimRootDict.rootmap
if [ ! -x ./WCSim ]; then                                             # check the application file exists and is executable
    if [ -a ./WCSim ]; then                                           # if the file isn't executable...
        chmod +x ./WCSim                                              # ...try to give it executable permissions
        hash -r                                                       # and update the $PATH of executables
    fi
fi
if [ ! -x ./WCSim ]; then
    echo "something failed in compilation?! WCSim not found!"         # bail if the compilation failed.
    exit 12
fi

4. Now we've built the executable, copy in the files to process. The process below should be made redundant with SAM projects. For the moment the $PROCESS variable (which is automatically populated with the Job number) and is used to extract a unique file number from a list of files to be processed.

# copy the list of input files
INPUTFILELISTDIR=/pnfs/annie/scratch/users/moflaher
INPUTFILELISTNAME=filenums.txt
echo "searching for input file list in ${INPUTFILELISTDIR}/${INPUTFILELISTNAME}" 
echo "ifdh ls ${INPUTFILELISTDIR}" 
ifdh ls ${INPUTFILELISTDIR}
ifdh ls ${INPUTFILELISTDIR}/${INPUTFILELISTNAME} 1>/dev/null 2>&1
if [ $? -eq 0 ]; then
    echo "copying input file list" 
    ifdh cp -D ${INPUTFILELISTDIR}/${INPUTFILELISTNAME} .   # list of input files
else
    echo "${INPUTFILELISTNAME} not found in ${INPUTFILELISTDIR}! Trying alternatives..." 
    for INLISTLOC in /pnfs/annie/persistent/users/moflaher \
                     /annie/app/users/moflaher/wcsim
    do
      echo "try copy from ${INLISTLOC}" 
      ifdh cp -D ${INLISTLOC}/${INPUTFILELISTNAME} .
      if [ ! -f ${INPUTFILELISTNAME} ]; then continue; fi
    done
    if [ ! -f ${INPUTFILELISTNAME} ]; then
      echo "input file list not found in any accessible locations!!!" 
      exit 13
    fi
fi

# calculate the input file to use
let THECOUNTER=${PROCESS}+${PROCESSOFFSET}+1
THENUM=`less filenums.txt | sed -n ${THECOUNTER},${THECOUNTER}p`
echo "this job has process ${PROCESS}, and will use file num ${THENUM}" 

# copy the input files
#DIRTDIR=/pnfs/annie/persistent/users/rhatcher/g4dirt
DIRTDIR=/pnfs/annie/persistent/users/moflaher/g4dirt
DIRTFILE=annie_tank_flux.${THENUM}.root
GENIEDIR=/pnfs/annie/persistent/users/rhatcher/genie
GENIEFILE=gntp.${THENUM}.ghep.root

echo "copying the input files ${DIRTDIR}/${DIRTFILE} and ${GENIEDIR}/${GENIEFILE}" 
ifdh cp -D ${DIRTDIR}/${DIRTFILE} .
ifdh cp -D ${GENIEDIR}/${GENIEFILE} .
if [ ! -f ${DIRTFILE} ]; then echo "dirt file not found!!!"; exit 14; fi
if [ ! -f ${GENIEFILE} ]; then echo "genie file not found!!!"; exit 15; fi

5. Finally generate any necessary on-the-fly files. For WCSim, the path to the directory containing the primary parent files is read from the primariesdirectory.mac macro, so we populate this file with the $PWD.

echo "writing primaries_directory.mac" 
echo "/mygen/neutrinosdirectory ${PWD}/gntp.*.ghep.root" >  primaries_directory.mac
echo "/mygen/primariesdirectory ${PWD}/annie_tank_flux.*.root" >>  primaries_directory.mac
echo "/run/beamOn 10000" >> WCSim.mac   # will end the run as rqd if there are fewer events in the input file

6. Finally we're ready to run the executable! For info, we print the start and end dates to stdout so that they get included in the grid logs.

# run executable here
NOWS=`date "+%s"`
DATES=`date "+%Y-%m-%d %H:%M:%S"`
echo "checkpoint start @ ${DATES} s=${NOWS}" 
echo " " 

./WCSim WCSim.mac                                                     # execution of the actual program!

echo " " 
NOWF=`date "+%s"`
DATEF=`date "+%Y-%m-%d %H:%M:%S"`
let DS=${NOWF}-${NOWS}
echo "checkpoint finish @ ${DATEF} s=${NOWF}  ds=${DS}" 
echo " " 

7. The last step is to copy any output files to a suitable destination directory, then clean up the present working directory

OUTDIR=/pnfs/annie/persistent/users/moflaher/wcsim
echo "copying the output files to ${OUTDIR}" 

# copy back the output files
DATESTRING=$(date)      # contains a bunch of spaces, don't use in filenames
for file in wcsim_*; do
        tmp=${file%.*}  # strip .root extension
        OUTFILE=${tmp}.${THENUM}.root
        echo "moving ${file} to ${OUTFILE}" 
        mv ${file} ${OUTFILE}
        echo "copying ${OUTFILE} to ${OUTDIR}" 
        ifdh cp -D ${OUTFILE} ${OUTDIR}
        if [ $? -ne 0 ]; then echo "something went wrong with the copy?!"; fi
done

# clean things up
cd ..
rm -rf wcsim
rm -rf build
rm -rf ${SOURCEFILEZIP}

8. And that's the full grid wrapper! You can test this on the anniegpvm just by creating a test directory and running the script in there. Make sure to test it from a fresh bash shell, so that any missing environmental variables will show up. Only the $PROCESS variable, if you use it, will need to be manually defined. It should work exactly as it will do on the grid.

9. To submit this to run on the grid, you'll need to have setup fife_utils, and have obtained a valid kerberos ticket using kinit. Then, execute the appropriate submission command:

jobsub_submit -N ${NUMJOBS} --resource-provides=usage_model=DEDICATED,OPPORTUNIST -M -G ${GROUP} file:///annie/app/users/moflaher/test_grid/grid.sh

where of course $NUMJOBS and $GROUP should be exported or given explicitly, and the file is your grid wrapper script. (Note the presence of three /'s at the start of the path.)

  • To cancel jobs, if you decide you made an error, use jobsub_rm with the appopriate job id; e.g.
    jobsub_rm --jobid 2516.0@fife-jobsub-dev01.fnal.gov
  • If you find your jobs gets held, it may be that you have run out of memory, or your job took too long. See the list of job_sub options for options on increasing node usage, time allocations, or requesting specific node architectures etc.