Project

General

Profile

Getting started on gpsn01

*MINERVA USERS: TO GET AN ACCOUNT ON THE MINERVA CLUSTER YOU NEED TO
  • SEND AN EMAIL TO REQUESTING ACCESS
  • YOU HAVE TO HAVE AN FNALU ACCOUNT FIRST AND
  • YOU NEED TO INCLUDE YOUR KERBEROS NAME SO THEY KNOW WHAT ACCOUNT TO ADD.*

Getting Grid permissions set up (first time)

NOTE - if you have already done this on if01, you need to do it again on gpsn01. Everyone needs to do this.

  • Log on to gpsn01 if you are planning to submit MINERVA or NOvA grid jobs. DO NOT USE THESE NODES FOR INTERACTIVE WORK
  • Execute the script /grid/fermiapp/common/tools/request_robot_cert . When prompted 'please enter [p/c/q]:' enter 'p', proceed with registration. At this time, two manual steps have to be done by two different departments to fully enable robot certificate you are requesting, this can take less than an hour or several days.

.....

You should get email when the steps are complete, if its taking too long send email to either or inquiring as to whats going on.
  • A way to verify that all parties above have done their job correctly is to execute the script
    /scratch/grid/kproxyv minerva
    

    which is the verbose sibling of /scratch/grid/kproxy. Both kproxy and kproxyv create a grid proxy which is a file granting you access to the grid until it expires. Kproxyv spits out a lot of output, somewhere in the middle will be these two warnings:
    
    Warning: your certificate and proxy will expire (some date here)
    which is within the requested lifetime of the proxy
    WARNING: Unable to verify signature! Server certificate possibly not installed.
    Error: Cannot verify AC signature!
    

The first warning tells you what you are trying to find out: your certificate and proxy are good until (some date) and you can submit to the grid until they expire. The second warning can be ignored. If you get some other result, send email to the above mentioned addresses.

Getting Grid permissions (every other time)

  • You could remember to run kproxy before submitting to the grid, but if your jobs are still running when the proxy expires your job will almost certainly fail. To set up a cron job to keep your proxy unexpired do the following:
  • Make sure you have a valid keberos ticket (klist will tell you)
  • Run the kcroninit script (usually at /usr/krb5/bin/kcroninit) and follow the instructions
  • Now run crontab with the -e option and add the following line to your cron list:
07          1-23/2 * * * /usr/krb5/bin/kcron  /scratch/grid/kproxy minerva

Note again the use of kproxy, if you put kproxyv here you will get annoying mail every time the cron job runs.

Using the Grid resources at Fermilab

(following instructions under construction, please ask for now)

  • Setting up your environment : source the file /grid/fermiapp/(experiment)/condor-scripts/setup_(experiment)_condor.[c]sh

For example:

gpsn01(NOvA): (bash) : $ . /grid/fermiapp/nova/condor-scripts/setup_nova_condor.sh
            (csh)  : $ source /grid/fermiapp/nova/condor-scripts/setup_nova_condor.csh
gpsn01(minerva): (bash) : $ . /grid/fermiapp/minerva/condor-scripts/setup_minerva_condor.sh
               (csh)  : $ source /grid/fermiapp/minerva/condor-scripts/setup_minerva_condor.csh

  • The JobSub Script : After sourcing the above file, the command (experiment)_jobsub can be used to submit to either the local condor cluster or the grid. This is modeled after Minos' example. See'Condor at MINOS' http://www-numi.fnal.gov/condor/index.html for the general flavor. The'minos_jobsub' script becomes 'nova_jobsub' on gpsn01, or 'minerva_jobsub' on gpsn01. The '-h' option on this script lists all the possible input flags and options.
  • BlueArc Shared Disk : A large disk pool is available on the interactive machines, gpsn01, the local condor worker nodes and the grid worker nodes. The disk is mounted differently on the grid worker nodes than the local nodes for security reasons.
    It is very important to not have all of your hundreds of grid jobs all accessing the BlueArc disk at the same time. Use the MVN and CPN commands (just like the unix mv and cp commands, except they queue up to spare BlueArc the trauma of too many concurrent accesses) to copy data on to and off of the BlueArc disks.
  • Condor Logs/Output : The jobsub script and condor have created at least 3 files: a cmd script that condor uses as input, an err script where any errors have gone, and an out script containing the jobs output to stdout. The output directory is $CONDOR_TMP which translates to /grid/data/(experiment)/condor-tmp/(username).
  • Killing Condor Jobs To terminate a condor job, first use the condor_q command to find the Condor ID of the jobs you wish to terminate. Then use condor_rm to remove them. Both of these commands are placed in your path when you run setup_nova_condor.sh

To remove a particular job use

% condor_rm <Job_ID>

To kill all user's jobs, use

% condor_rm <User_ID>