Project

General

Profile

Note: This information is obsolete (as of 05/27/14). Please refer to: https://cdcvs.fnal.gov/redmine/projects/fife/wiki/Getting_Started_on_GPCF

Getting Started on FermiGrid

GPCF batch overview

The GPCF batch system is shared by various experiments at fermilab. Many of the following scripts depend on an environment variable $GROUP which should be automatically set to the experiment you are working on when doing batch submissions, assuming you are working from an experiment-specific VM (mu2egpvm02, minervagpvm01, etc). Currently supported values for the $GROUP variable are mu2e, minerva, nova, lbne, uboone, gm2, lbnemars, minos, and minerva.

GPCF local pool

There is a limited local condor pool attached to CPGF. Jobs run under your own account there. Otherwise, all the Fermigrid guidelines apply. Submit to the local pool by omitting the '-g' from jobsub.

Grid computing guidelines.

  • Understand these guidelines before you register for access to the Grid.
  • Never read or write NFS-mounted data files (/grid/data, /nova/data, etc ) directly from grid worker nodes.
    • This will overload the file servers, denying service to everyone.
      • When this happens your jobs will be stopped and your grid access removed.
    • Copy input data files to local disk.
    • Write output files to local disk, then copy back to central storage.
    • Use jobsub -g -f ... -d ... to handle data file movement
      • This will invoke ifdh cp
        • which then invokes a direct or gridftp copy as appropriate.
  • Fermigrid is designed to serve compute-bound jobs.
    • The typical job reads about a GByte of data, produces little output, and takes a few hours to run.
    • Jobs that run under 15 minutes, or are I/O limited will not run efficiently.
    • Jobs that run more than a day may have trouble completing due to scheduled maintenance or preemption from opportunistic resources.
  • Grid jobs run on the workers under a group account such as novaana.
    • In order to manage these files later from your own account, you should have in your submitted script
      umask 0002
      
    • Or you should let jobsub copy the files back to your own account,
      jobsub -g -d ... --use_gftp
      

Set up Grid permissions and proxies before you submit a job

You will log in one time only to the Grid Submission node gpsn01, to set up the jobs that will keep your Grid proxies alive.
The proxies are based on a special 'kcron' version of your kerberos principal, kept alive by a cron job on gpsn01.
Once this is set up, you should not log into gpsn01 again.

  1. Register with your Grid Virtual Organization
    • LBNE ONLY
      • go to https://voms.fnal.gov:8443/voms/lbne/home/login.action and follow the instructions there
        • You will fill out a form, then receive an email to which you must respond.
        • You must have a fermilab KCA cert loaded into your browser before you go to the above web site.
          • If you don't it will act strangely and may appear to be broken.
          • more information Fermilab KCA certs can be found at Fermilab Security KCA
      • Wait for notification by email that this has been completed, which usually takes one working day.
        Once this step is complete, proceed with step 2 below.
    • EVERYONE BUT LBNE
      You are registered in VOMS automatically when your GPCF interactive account is created.
  2. Log into gpsn01
    • ssh gpsn01
  3. Run kcroninit so that kcron will work
    • kcroninit
  4. Verify that kcron will work for you
    • kcron
      GPSN01 > kcron klist
      Ticket cache: FILE:/tmp/krb5cc_1060_jfbxR24461
      Default principal: kreymer/cron/gpsn01.fnal.gov@FNAL.GOV
      
      Valid starting     Expires            Service principal
      10/03/13 12:02:23  10/03/13 22:02:23  krbtgt/FNAL.GOV@FNAL.GOV
          renew until 10/06/13 00:02:23
      ...
      
  5. Create a crontab entry to run kproxy
    GPSN01 > CRONTAB="07 1-23/2 * * *  /scratch/grid/kproxy nova" 
    GPSN01 > printf "${CRONTAB}\n" | crontab
    GPSN01 > crontab -l
    
    • It is possible to create and use proxies from more than one experiment.
      People often work for more than one at any given time. For example,
      GPSN01 > CRONTAB="07 1-23/2 * * *  /scratch/grid/kproxy lbne
      07 1-23/2 * * *  /scratch/grid/kproxy nova
      07 1-23/2 * * *  /scratch/grid/kproxy uboone
      07 1-23/2 * * *  /scratch/grid/kproxy gm2
      07 1-23/2 * * *  /scratch/grid/kproxy argoneut
      07 1-23/2 * * *  /scratch/grid/kproxy mu2e" 
      GPSN01 > printf "${CRONTAB}\n" | crontab
      GPSN01 > crontab -l
      
  6. Test kproxy manually on gpsn01, to be sure your registration is working.
    # create a nova proxy
    GPSN01 > /scratch/grid/kproxy nova
    
    # verify the nova proxy
    GPSN01 > /scratch/grid/kproxy -i
    lrwxrwxrwx 1 kreymer gpcf   51 Oct  3 12:23 /scratch/kreymer/grid/kreymer.nova.proxy -> /scratch/kreymer/grid/kreymer.nova.proxy.2013100312
    -rw------- 1 kreymer gpcf 7000 Sep 30 14:49 /scratch/kreymer/grid/kreymer.nova.proxy.2013093014
    ...
    /scratch/kreymer/grid/kreymer.nova.proxy
    attribute : /fermilab/nova/Role=Analysis/Capability=NULL
    Valid proxy expires in 39501 seconds (10 hours)
    Valid proxy expires at Fri Oct  4 00:23:03 CDT 2013
    

    If you are not registed in VOMS, follow these instructions.

Using the Grid resources at Fermilab
We use a jobsub command and related tools. See the a more complete description here - UsingJobSub

Setting up your environment :

  • IF you do not have any UPS products set up :
    • export GROUP=(one of mu2e, minerva, nova, lbne, uboone, or other experiment)
    • source /grid/fermiapp/products/common/etc/setups.sh
    • setup jobsub_tools
  • IF you DO have other UPS products set up:
    • export PRODUCTS=$PRODUCTS:/grid/fermiapp/products/common/db
    • setup jobsub_tools
  • HOW DO I KNOW IF I ALREADY HAVE A UPS PRODUCT SET UP?
  • echo $PRODUCTS (from the unix prompt)
  • if its anything other than a null string, you have a UPS products database already set up
  • The JobSub Script : After sourcing the above file, the jobsub command can be used to submit to either the local condor cluster or the grid. jobsub -h will list all the possible command options.
  • BlueArc Shared Disk : A large disk pool is available from gpsn01, the local condor worker nodes and the grid worker nodes. The disk is mounted differently on the grid worker nodes than the local nodes for security reasons.
    It is very important to not have all of your hundreds of grid jobs all accessing the BlueArc disk at the same time. Use the 'ifdh cp' commands (just like the unix cp command, except they queue up to spare BlueArc the trauma of too many concurrent accesses) to copy data on to and off of the BlueArc disks.
  • Condor Logs/Output : The jobsub script and condor have created at least 3 files: a cmd script that condor uses as input, an err script where any errors have gone, and an out script containing the jobs output to stdout. The output directory is set by $CONDOR_TMP. The location is experiment specific. Check the value of $CONDOR_TMP for your location.
  • Killing Condor Jobs To terminate a condor job, first use the condor_q command to find the Condor ID of the jobs you wish to terminate. Then use condor_rm to remove them. Both of these commands are placed in your path when you run "setup jobsub".
To remove a particular job use
[myuserid@experimentgpvmnn]$ condor_rm <Job_ID>

To kill all user's jobs, use
[myuserid@experimentgpvmnn]$ condor_rm <User_ID>

Interactive Example: Running 'Hello World' as an LBNE User

  • Prerequisites:
    • You have set up Grid permissions and proxies as documented above
  • ssh to lbnegpvm01 and setup the jobsub tools
    [myuserid@anymachine]$ ssh myuserid@lbnegpvm01
    [myuserid@lbnegpvm01]$ source /grid/fermiapp/products/common/etc/setups.sh
    [myuserid@lbnegpvm01]$ setup jobsub_tools
    
  • make a working directory to run your jobs from:
    [myuserid@lbnegpvm01]$ mkdir /grid/fermiapp/lbne/users/myuserid
    [myuserid@lbnegpvm01]$ cd /grid/fermiapp/lbne/users/myuserid
    
  • create a simple hello_world.sh script in that directory and make sure it is executable.
    [myuserid@lbnegpvm01]$ cat hello_world.sh 
    #!/bin/sh
    echo running on host `uname -a`
    echo running as user `whoami`
    echo OS version `cat /etc/redhat-release`
    echo sleeping for $1 seconds
    sleep $1
    [myuserid@lbnegpvm01]$chmod +x hello_world.sh
    
  • run hello_world.sh on a local batch node:
    [myuserid@lbnegpvm01]$ jobsub hello_world.sh 120
    /grid/fermiapp/lbne/condor-tmp/myuserid/hello_world.sh_20101122_113521_1.cmd
    /grid/fermiapp/lbne/condor-exec/myuserid/hello_world.sh_20101122_113521_1_wrap.sh
    submitting....
    Submitting job(s).
    Logging submit event(s).
    1 job(s) submitted to cluster 1828.
    [myuserid@lbnegpvm01]$ 
    
  • Things to note about this example so far:
    • jobsub created a condor command file named hello_world.sh_(timestamp).cmd in the $CONDOR_TMP directory
    • jobsub wrapped hello_world.sh along with relevant condor information in a file named hello_world.sh_(timestamp)_wrap.sh in the $CONDOR_EXEC directory.
    • output from this job will go to $CONDOR_TMP by default. Lets look:
      [myuserid@lbnegpvm01]$ ls -la $CONDOR_TMP/hello_world.sh_20101122_113521_1*
      -rw-r--r-- 1 myuserid lbne 675 Nov 22 11:35 $CONDOR_TMP/hello_world.sh_20101122_113521_1.cmd
      -rw-r--r-- 1 myuserid gpcf   0 Nov 22 11:35 $CONDOR_TMP/hello_world.sh_20101122_113521_1.err
      -rw-r--r-- 1 myuserid gpcf 680 Nov 22 11:37 $CONDOR_TMP/hello_world.sh_20101122_113521_1.log
      -rw-r--r-- 1 myuserid gpcf 226 Nov 22 11:36 $CONDOR_TMP/hello_world.sh_20101122_113521_1.out
      
    • The .out file contains whatever the job sent to stdout. Obviously this is the first place to check to make sure your job ran as intended.
      [myuserid@lbnegpvm01]$ cat $CONDOR_TMP/hello_world.sh_20101122_113521_1.out
      running on host Linux gpwn002.fnal.gov 2.6.18-194.26.1.el5 #1 SMP 
      Tue Nov 9 12:46:16 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
      running as user myuserid
      OS version Scientific Linux SLF release 5.4 (Lederman)
      sleeping for 120 seconds
      
    • The .err file contains whatever the job sent to stderr. This is the next place to check if something went wrong.
    • The .log file contains a condor log and is sometimes useful for experts.
  • Now run this job on the grid instead of the local batch. As you set up your proxy on gpsn01 earlier, its a simple matter of adding one more parameter (-g for grid) to jobsub:
    [myuserid@lbnegpvm01]$ jobsub -g hello_world.sh 120
    
    /grid/fermiapp/lbne/condor-tmp/myuserid/hello_world.sh_20101122_115337_1.cmd
    /grid/fermiapp/lbne/condor-exec/myuserid/hello_world.sh_20101122_115337_1_wrap.sh
    submitting....
    Submitting job(s).
    Logging submit event(s).
    1 job(s) submitted to cluster 1829.
    
  • Check on the status of your submitted job
    [myuserid@lbnegpvm01]$ condor_q myuserid
    
    -- Submitter: gpsn01.fnal.gov : <131.225.67.70:60205> : gpsn01.fnal.gov
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
    1829.0   myuserid           11/22 11:53   0+00:00:00 I  0   0.0  hello_world.sh_201
    
    1 jobs; 1 idle, 0 running, 0 held
    
    • I asked for only my jobs by feeding condor_q my user name as a parameter
    • The 'I' in the 6th column means my job is idle in the queue. When it matches up with an available worker node it will change to 'R' .
  • Check again a minute or so later:
    [myuserid@lbnegpvm01]$ condor_q
    
    -- Submitter: gpsn01.fnal.gov : <131.225.67.70:60205> : gpsn01.fnal.gov
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
    1829.0   myuserid       11/22 11:53   0+00:00:04 R  0   0.0  hello_world.sh_201
    1830.0   gfactory       11/22 11:54   0+00:00:00 I  0   0.0  glidein_startup.sh
    1831.0   gfactory       11/22 11:54   0+00:00:00 I  0   0.0  glidein_startup.sh
    1832.0   gfactory       11/22 11:54   0+00:00:00 I  0   0.0  glidein_startup.sh
    
    4 jobs; 3 idle, 1 running, 0 held
    [myuserid@lbnegpvm01]$ 
    
    • It's running now.

Switching Experiments, Running Jobs, Keeping Everything Straight.

  • Recall that it is possible to create and use proxies from more than one experiment, people often work for more than one at any given time. Here is an excerpt from a crontab file that does this:
    07 1-23/2 * * *  /scratch/grid/kproxy lbne
    07 1-23/2 * * *  /scratch/grid/kproxy mu2e
    07 1-23/2 * * *  /scratch/grid/kproxy uboone
    07 1-23/2 * * *  /scratch/grid/kproxy gm2
    
  • the kproxy script creates a proxy in the /scratch/{user}/grid/ directory, with name {user}.{experiment}.proxy.{timestamp}
    • a soft link is created from the previous file to {user}.{experiment}.proxy
    • a soft link is created from the previous link to {user}.proxy
  • by default, the jobsub script uses the proxy at /scratch/{user}/grid/{user}.{$GROUP}.proxy . If your $GROUP environnment variable is set to the wrong experiment, you could mess up the grid accounting that charges various experiments for grid time, and conceivably lose your data when it came back to a directory with incorrect gid permissions. Some users prefer to explicitly say which proxy they are using to keep this straight using the -X509_USER_PROXY option. For example, user dbox, working for mu2e experiment, would do the following:
    source /grid/fermiapp/products/common/etc/setups.sh
    jobsub  -g -X509_USER_PROXY /scratch/dbox/grid/dbox.mu2e.proxy  /grid/fermiapp/mu2e/users/dbox/mu2ejob.sh 
    

Switching Roles, Running Jobs, Keeping Everything Straight.

  • By default, on fermigrid your grid jobs run under the the uid (Experiment)ana . 'Ana' is short for 'Analysis Role' Example minerva users run as minervaana, nova run as novaanana.
  • Sometimes it is desirable to run with the Production role (experiment)pro or the Calibration role (experiment)cal . The way this is done is:
    • create a servicedesk ticket requesting the new role be added to your cert. This request goes to Fermigrid department.
    • edit your crontab to create a proxy with the new role. The non default role is controlled by the argument after the experiment name to the kproxy script. So, if my username is (for example) dbox and I am in the minerva experiment I already have an entry in my crontab like this, which creates a proxy file named /scratch/dbox/grid/dbox.minerva.proxy:
      41 1-23/2 * * * /scratch/grid/kproxy minerva
      
    • If I want to run with the production role as user minervapro, I add this entry, which creates a file named /scratch/dbox/grid/dbox.minerva.Production.proxy:
      51 1-23/2 * * * /scratch/grid/kproxy minerva Production
      
    • Once this proxy file is created, use the --X509_USER_PROXY command in jobsub to run on the grid as user 'minervapro'
source /grid/fermiapp/products/common/etc/setups.sh
jobsub  -g -X509_USER_PROXY /scratch/dbox/grid/dbox.minerva.Production.proxy  /grid/fermiapp/minerva/users/dbox/my_minerva_job.sh 

Shared Accounts

See SHAREDJOBSUBSETUP