Project

General

Profile

Getting Started on GPCF

GPCF batch overview

The GPCF batch system is shared by various experiments at Fermilab. Many scripts depend on an environment variable $GROUP which should be set to the experiment you are working on when doing batch submissions, assuming you are working from an experiment-specific VM (mu2egpvm02, minervagpvm01, etc). The jobsub_client package can also use the JOBSUB_GROUP environment variable if it is set. Currently supported values for the $GROUP (and JOBSUB_GROUP) variable are argoneut, cdf, cdms, chips, coupp, des, darkside, dzero, gm2, lariat, lar1nd, lbne, lsst, marsgm2, marslbne, marsmu2e, minerva, miniboone, minos, mu2e, nova, numix, patriot, seaquest, uboone. (use uboone for microboone.)

Grid computing guidelines.

  • Understand these guidelines before you register for access to the Grid.
  • Never read or write NFS-mounted data files (/grid/data, /nova/data, etc ) directly from grid worker nodes.
    • This will overload the file servers, denying service to everyone.
      • When this happens your jobs will be stopped and your grid access removed.
  • Follow these practices instead for reading/writing NFS-mounted data:
    • Copy input data files to local disk on the worker node.
    • Write output files to local disk, then copy back to central storage.
    • Use jobsub_submit -f ... -d ... to handle data file movement or "ifdh cp" within your job script
      • This will invoke "ifdh cp"
        • which then invokes a direct or gridftp copy as appropriate.
  • Fermigrid is designed to serve compute-bound jobs.
    • The typical job reads about a GByte of data, produces little output, and takes a few hours to run.
    • Jobs that run under 15 minutes, or are I/O limited will not run efficiently.
    • Jobs that run more than a day may have trouble completing due to scheduled maintenance or preemption from opportunistic resources.
  • Grid jobs run on the workers under a group account such as novaana.
    • In order to manage these files later from your own account, at the beginning of your submitted script (before you create output files), you should include the command:
      umask 0002
      
    • Or you should let jobsub copy the files back to your own account,
      jobsub_submit -d ... --use_gftp
      

Set up Grid permissions and proxies before you submit a job

  1. Register with your Grid Virtual Organization

Using the Grid resources at Fermilab

We use a jobsub command and related tools. See the a more complete description here -
https://cdcvs.fnal.gov/redmine/projects/jobsub/wiki/Using_the_Client

Setting up your environment :

  • IF you do not have any UPS products set up:
    • export GROUP=(one of mu2e, minerva, nova, lbne, uboone, or some other experiment. See above for the full list.)
    • source /grid/fermiapp/products/common/etc/setups.sh
    • setup jobsub_client
  • IF you DO have other UPS products set up:
    • export PRODUCTS=$PRODUCTS:/grid/fermiapp/products/common/db
    • setup jobsub_client
  • HOW DO I KNOW IF I ALREADY HAVE A UPS PRODUCT SET UP?
  • echo $PRODUCTS (from the unix prompt)
  • if its anything other than a null string, you have a UPS products database already set up
  • The JobSub Script : After sourcing the above file and setup jobsub_client, the jobsub_submit command can be used to submit to either the local condor cluster or the grid. jobsub_submit -G (your experiment) -h will list all the possible command options.
  • BlueArc Shared Disk : A large disk pool is available from the local condor worker nodes and the grid worker nodes. The disk is mounted differently on the grid worker nodes than the local nodes for security reasons.
    It is very important to not have all of your hundreds of grid jobs all accessing the BlueArc disk at the same time. Use the 'ifdh cp' commands (just like the unix cp command, except they queue up to spare BlueArc the trauma of too many concurrent accesses) to copy data on to and off of the BlueArc disks.
  • Condor Logs/Output : The jobsub script and condor create at least 3 files: a cmd script that condor uses as input, an err script where any errors have gone, and an out script containing the jobs output to stdout. These log files reside on the jobsub server and can be copied to the node you're working on via the jobsub_fetchlog command. jobsub_fetchlog -h shows the options and usage.
  • Killing Condor Jobs To terminate a condor job, first use the jobsub_fetchlog --list-sandboxes command to find the Condor ID of the jobs you wish to terminate. Then use jobsub_rm to remove them. Both of these commands are placed in your path when you run "setup jobsub_client".

To remove a particular job use

[myuserid@experimentgpvmnn]$ jobsub_rm -G (your experiment) <Job_ID>

As an example of a jobid would be: which you can see from the hello world example below.

To kill all user's jobs, use
[myuserid@experimentgpvmnn]$ jobsub_rm -G (your experiment) <User_ID>

Interactive Example: Running 'Hello World'

  • Prerequisites:
    • You have set up Grid permissions and proxies as documented above
  • You are able to log in to one of your experiments development nodes. These nodes have names like (EXPERIMENT)gpvm(NUMBER) , such as novagpvm01 or minervagpvm03
  • Set up jobsub_client:
    • source /grid/fermiapp/products/common/etc/setups.sh
    • setup jobsub_client
You can create a simple Hello World script in your directory and make sure it is executable. An example follows:
[myuserid@lbnegpvm01]$ cat hello_world.sh 
#!/bin/sh
echo running on host `uname -a`
echo running as user `whoami`
echo OS version `cat /etc/redhat-release`
echo sleeping for $1 seconds
sleep $1
[myuserid@lbnegpvm01]$chmod +x hello_world.sh
  • run hello_world.sh on a local batch node:
    [myuserid@lbnegpvm01]$ jobsub_submit -G lbne --resource-provides=usage_model=DEDICATED,OPPORTUNISTIC --OS=SL6 --role=Analysis file://hello_world.sh 120
    /fife/local/scratch/uploads/lbne/kherner/2015-02-17_175758.355795_238
    
    /fife/local/scratch/uploads/lbne/kherner/2015-02-17_175758.355795_238/hello_world.sh_20150217_175758_23893_0_1.cmd
    
    submitting....
    
    Submitting job(s).
    
    1 job(s) submitted to cluster 679411.
    
    JobsubJobId of first job: 679411.0@fifebatch1.fnal.gov
    
    Use job id 679411.0@fifebatch1.fnal.gov to retrieve output
    
    JOBSUB SERVER CONTACTED     : https://fifebatch.fnal.gov:8443
    JOBSUB SERVER RESPONDED     : UNKNOWN
    JOBSUB SERVER RESPONSE CODE : 200 (Success)
    JOBSUB SERVER SERVICED IN   : 0.712878227234 sec
    JOBSUB CLIENT FQDN          : minosgpvm01.fnal.gov
    JOBSUB CLIENT SERVICED TIME : 17/Feb/2015 17:57:58
    
    
  • Things to note about this example so far:
    • jobsub created a condor command file named hello_world.sh_(timestamp).cmd
    • jobsub wrapped hello_world.sh along with relevant condor information in a file named hello_world.sh_(timestamp)_wrap.sh.
    • The .err file contains whatever the job sent to stderr. This is the next place to check if something went wrong.
    • The .log file contains a condor log and is sometimes useful for experts.

More advanced examples

You can find some examples of running more complicated scripts that behave like a typical analysis workflow in this talk: https://fermipoint.fnal.gov/project/FIFE/Shared%20Documents/FIFE_Jobsub_tutorial.pdf

Switching Experiments

Simply change your experiment with the -G option to jobsub_submit (e.g. jobsub_submit -G nova instead of jobsub_submit -G lbne)

Switching Roles, Running Jobs, Keeping Everything Straight.

  • By default, on fermigrid your grid jobs run under the the uid (Experiment)ana . 'Ana' is short for 'Analysis Role' Example minerva users run as minervaana, nova run as novaanana.
  • Sometimes it is desirable to run with the Production role (experiment)pro or the Calibration role (experiment)cal . The way this is done is:
    • create a servicedesk ticket requesting the new role be added to your cert. This request goes to Fermigrid department.
    • Add "--role=Production"
source /grid/fermiapp/products/common/etc/setups.sh
setup jobsub_client
jobsub_submit -G minerva --resource-provides=usage_model=DEDICATED,OPPORTUNISTIC --OS=SL5,SL6 --role=Production file:///grid/fermiapp/minerva/users/dbox/my_minerva_job.sh 

Note that when doing any log fetching or removal of jobs submitted with a non-standard role it is necessary to add the "--role=Production" option to those commands also.

Shared Accounts

See SHAREDJOBSUBSETUP