Project

General

Profile

Getting started on GPCF

To request an interactive account

Request an account with the Service Now form Request CMS/Minos/Other Account

If your experiment or project is not listed, use a Service Now - Scintific Computing Request to have it added.

Grid computing guidelines.

  • Understand these guidelines before you register for access to the Grid.
  • Never read or write data files directly from grid worker nodes.
    • This will overload the file servers, denying service to everyone.
      • When this happens your jobs will be stopped and your grid access removed.
    • Copy input data files to local disk.
    • Write output files to local disk, then copy back to central storage.
    • Use jobsub -f ... -d ... to handle data file movement
      • This will invoke ifdh cp
        • which then invokes cpn or gridftp as appropriate.
  • Fermigrid is designed to serve compute-bound jobs.
    • The typical job reads about a GByte of data, produces little output, and takes a few hours to run.
    • Jobs that run under 15 minutes, or are I/O limited will not run efficiently.
    • Jobs that run more than a day may have trouble completing due to scheduled maintenance or preemption from opportunistic resources.
  • Grid jobs run under a group account such as novaana.
    • In order to manage these file offline, you should have done
      umask 0002.
      
    • Or you should let jobsub copy the files back to your own account,
      jobsub -d ... --use_gftp
      

Batch : Getting Grid permissions set up (first time)

  • Log on to gpsn01 once to set up your grid proxies and registration.
    • This should need to be done only once. You should not need to log into gpsn01 again.
    • Do not use gpsn01 for interactive work.
  • Run kcroninit
    • This will prompt for your kerberos password then initialize files so that kcron will work in your cron job.
    • This needs to be done only once.
    • Verify by typing kcron, then klist -5 | grep Default
  • REQUEST_ROBOT_CERT DOES NOT WORK SINCE THE LAST VOMS UPGRADE. OPEN A SERVICE DESK TICKET TO FERMIGRID REQUESTING A ROBOT CERT FOR GPSN01 AND VOMS MEMBERSHIP IN YOUR EXPERIMENT
  • Execute the script /grid/fermiapp/common/tools/request_robot_cert
    • When prompted 'please enter [p/c/q]:' enter 'p', proceed with registration.
    • Two manual steps have to be done by two different departments to fully enable robot certificate. This can take less than an hour or several days.
    • You should get email when the steps are complete.
    • If this takes too long, open a ServiceDesk ticket, selecting I'm having a problem with Scientific Computing
      • Select your experiment, and the 'Batch Submisson' category.
  • THIS PART IS STILL VALID SINCE VOMS UPGRADE
  • establish a crontab on gpsn01 to keep your proxy alive.
    • Use your VO group name, such as minos, minerva, nova, mu2e, etc.
      GROUP=<your VO group name >
      echo "07          1-23/2 * * * /usr/krb5/bin/kcron  /scratch/grid/kproxy ${GROUP}"  | crontab
      
  • Once the registration is done, verify that your cron job is making a valid proxy, with
    /scratch/grid/kproxynew -i
    

Using the Grid resources at Fermilab

  • Setting up your environment :
  • source /grid/fermiapp/products/common/etc/setups.sh
    • Your experiment's framework setup may already have done this
    • If you have already have UPS set up, you can get the common tools by doing
      • PRODUCTS=$PRODUCTS:/grid/fermiapp/products/common/db
  • setup jobsub_tools

h3.

  • The JobSub Script : After sourcing the above file, the command jobsub can be used to submit to either the local condor cluster or the grid.
    • The '-h' option on this script lists the input flags and options.
  • Condor Logs/Output : The jobsub script and condor have created at least 3 files: a cmd script that condor uses as input, an err script where any errors have gone, and an out script containing the jobs output to stdout. The output directory is $CONDOR_TMP which translates to /grid/data/(experiment)/condor-tmp/(username).
  • Killing Condor Jobs To terminate a condor job, first use the condor_q command to find the Condor ID of the jobs you wish to terminate. Then use condor_rm to remove them. Both of these commands are placed in your path when you run setup jobsub.

To remove a particular job use

% condor_rm <Job_ID>

To kill all user's jobs, use

% condor_rm <User_ID>