Project

General

Profile

Getting Started on GPCF

GPCF batch overview

The GPCF batch system is shared by various experiments at Fermilab. Many scripts depend on an environment variable $GROUP which should be set to the experiment you are working on when doing batch submissions, assuming you are working from an experiment-specific VM (mu2egpvm02, minervagpvm01, etc). The jobsub_client package can also use the JOBSUB_GROUP environment variable if it is set. Currently supported values for the $GROUP (and JOBSUB_GROUP) variable are argoneut, cdf, cdms, chips, coupp, des, darkside, dune, dzero, ebd, egp, gm2, icarus, lariat, lsst, minerva, miniboone, minos, mu2e, nova, numix, patriot, sbnd, seaquest, uboone. (use uboone for microboone.)

Grid computing guidelines.

  • Understand these guidelines before you register for access to the Grid.
  • You will be unable read or write NFS-mounted disks (e.g. /grid/data, /experiment/data, /experiment/app, etc ) from grid worker nodes, even with ifdh cp.
  • You also cannot directly read from or write to /pnfs areas. Use ifdh cp for data transfer between these areas and the worker node.
    • Write output files to local disk, then copy back to central storage.
    • Use jobsub_submit -f ... -d ... to handle data file movement or "ifdh cp" within your job script
      • This will invoke "ifdh cp"
        • which then invokes a direct or gridftp copy as appropriate.
  • Fermigrid is designed to serve compute-bound jobs.
    • The typical job reads about a GByte of data, produces little output, and takes a few hours to run.
    • Jobs that run under 15 minutes, or are I/O limited will not run efficiently.
    • Jobs that run more than a day may have trouble completing due to scheduled maintenance or preemption from opportunistic resources.

Set up Grid permissions and proxies before you submit a job

  1. Register with your Grid Virtual Organization

You are registered in VOMS automatically when your GPCF interactive account is created. If you don't have a GPCF account, or are requesting access to a new experiment, please fill out the Affiliation\Experiment Computing Account Request form in the Service Catalog.
If you are not registered in VOMS, please open a service desk ticket at https://fermi.service-now.com/ asking to add your Fermilab User ID to the experiments VO. (If you are a member of DUNE, follow the separate instructions above.)

Using the Grid resources at Fermilab

We use a jobsub command and related tools. See the a more complete description here -
https://cdcvs.fnal.gov/redmine/projects/jobsub/wiki/Using_the_Client

Setting up your environment :

  • IF you do not have any UPS products set up:
    • export GROUP=(e.g. mu2e, minerva, nova, dune, uboone, or some other experiment. See above for the full list.)
    • source /cvmfs/fermilab.opensciencegrid.org/products/common/etc/setup
    • setup jobsub_client
  • IF you DO have other UPS products set up:
    • export PRODUCTS=$PRODUCTS:/cvmfs/fermilab.opensciencegrid.org/products/common/etc/db
    • setup jobsub_client
  • HOW DO I KNOW IF I ALREADY HAVE A UPS PRODUCT SET UP?
  • echo $PRODUCTS (from the unix prompt)
  • if its anything other than a null string, you have a UPS products database already set up
  • The JobSub Script : After sourcing the above file and setup jobsub_client, the jobsub_submit command can be used to submit to either the local condor cluster or the grid. jobsub_submit -G (your experiment) -h will list all the possible command options.
  • Condor Logs/Output : The jobsub script and condor create at least 3 files: a cmd script that condor uses as input, an err script where any errors have gone, and an out script containing the jobs output to stdout. These log files reside on the jobsub server and can be copied to the node you're working on via the jobsub_fetchlog command. jobsub_fetchlog -h shows the options and usage.
  • Killing Condor Jobs To terminate a condor job, first use the jobsub_fetchlog --list-sandboxes command to find the Condor ID of the jobs you wish to terminate. Then use jobsub_rm to remove them. If you already know the job ID, you can skip the --list-sandboxes step and directly run jobsub_rm. Both of these commands are placed in your path when you run "setup jobsub_client".

To remove a particular job use

[myuserid@experimentgpvmnn]$ jobsub_rm -G (your experiment) --jobid=<Job_ID>

As an example of a jobid would be: which you can see from the hello world example below.

To kill all of a user's jobs, use
[myuserid@experimentgpvmnn]$ jobsub_rm -G (your experiment) --user=<username>

Interactive Example: Running 'Hello World'

  • Prerequisites:
    • You have set up Grid permissions and proxies as documented above
  • You are able to log in to one of your experiments development nodes. These nodes have names like (EXPERIMENT)gpvm(NUMBER) , such as novagpvm01 or minervagpvm03
  • Set up jobsub_client:
    • source /cvmfs/fermilab.opensciencegrid.org/products/common/etc/setup
    • setup jobsub_client
You can create a simple Hello World script in your directory and make sure it is executable. An example follows:
[myuserid@lbnegpvm01]$ cat hello_world.sh 
#!/bin/sh
echo running on host `uname -a`
echo running as user `whoami`
echo OS version `cat /etc/redhat-release`
echo sleeping for $1 seconds
sleep $1
[myuserid@lbnegpvm01]$chmod +x hello_world.sh
  • run hello_world.sh on a local batch node:
    [myuserid@lbnegpvm01]$ jobsub_submit -G dune --resource-provides=usage_model=DEDICATED,OPPORTUNISTIC,OFFSITE --OS=SL6,SL7 --role=Analysis file://hello_world.sh 120
    /fife/local/scratch/uploads/dune/kherner/2015-02-17_175758.355795_238
    
    /fife/local/scratch/uploads/dune/kherner/2015-02-17_175758.355795_238/hello_world.sh_20150217_175758_23893_0_1.cmd
    
    submitting....
    
    Submitting job(s).
    
    1 job(s) submitted to cluster 679411.
    
    JobsubJobId of first job: 679411.0@fifebatch1.fnal.gov
    
    Use job id 679411.0@fifebatch1.fnal.gov to retrieve output
    
    JOBSUB SERVER CONTACTED     : https://fifebatch.fnal.gov:8443
    JOBSUB SERVER RESPONDED     : UNKNOWN
    JOBSUB SERVER RESPONSE CODE : 200 (Success)
    JOBSUB SERVER SERVICED IN   : 0.712878227234 sec
    JOBSUB CLIENT FQDN          : minosgpvm01.fnal.gov
    JOBSUB CLIENT SERVICED TIME : 17/Feb/2015 17:57:58
    
    
  • Things to note about this example so far:
    • jobsub created a condor command file named hello_world.sh_(timestamp).cmd
    • jobsub wrapped hello_world.sh along with relevant condor information in a file named hello_world.sh_(timestamp)_wrap.sh.
    • The .err file contains whatever the job sent to stderr. This is the next place to check if something went wrong.
    • The .log file contains a condor log and is sometimes useful for experts.

More advanced examples

You can find some examples of running more complicated scripts that behave like a typical analysis workflow in this talk: https://fermipoint.fnal.gov/project/FIFE/Shared%20Documents/FIFE_Jobsub_tutorial.pdf

Switching Experiments

Simply change your experiment with the -G option to jobsub_submit (e.g. jobsub_submit -G nova instead of jobsub_submit -G lbne)

Switching Roles, Running Jobs, Keeping Everything Straight.

  • Sometimes it is desirable to run with the Production role (experiment)pro or the Calibration role (experiment)cal . The way this is done is:
    • create a servicedesk ticket requesting the new role be added to your cert.
    • Add "--role=Production"
source  /cvmfs/fermilab.opensciencegrid.org/products/common/etc/setups
setup jobsub_client
jobsub_submit -G minerva --resource-provides=usage_model=DEDICATED,OPPORTUNISTIC,OFFSITE --OS=SL7 --role=Production file:///grid/fermiapp/minerva/users/dbox/my_minerva_job.sh 

Note that when doing any log fetching or removal of jobs submitted with a non-standard role, it is necessary to add that same role (e.g. "--role=Production") option to those commands also.

Shared Accounts

See SHAREDJOBSUBSETUP