Getting started on gpsn01¶
- The General Physics Computing Facility (GPCF) is replacing the if01 and gpsn01 clusters.
- MINERVA uses gpsn01 for batch and minervagpvm01-5 for interactive, NOvA uses gpsn01.
- SEND AN EMAIL TO MINERVA-COMPUTING@FNAL.GOV REQUESTING ACCESS
- YOU HAVE TO HAVE AN FNALU ACCOUNT FIRST AND
- YOU NEED TO INCLUDE YOUR KERBEROS NAME SO THEY KNOW WHAT ACCOUNT TO ADD.*
- Send mail including your FNALU user name to either email@example.com or firstname.lastname@example.org, for your appropriate experiment. For LBNE, MU2E, and all other experiments open a Service Desk ticket at https://fermi.service-now.com/com.glideapp.servicecatalog_cat_item_view.do?sysparm_id=bf82e793a9b3dc008638daab111d544d
- MINERVA users should also send mail to email@example.com to request CVS access.
Getting Grid permissions set up (first time)¶
NOTE - if you have already done this on if01, you need to do it again on gpsn01. Everyone needs to do this.
- Log on to gpsn01 if you are planning to submit MINERVA or NOvA grid jobs. DO NOT USE THESE NODES FOR INTERACTIVE WORK
- Execute the script /grid/fermiapp/common/tools/request_robot_cert . When prompted 'please enter [p/c/q]:' enter 'p', proceed with registration. At this time, two manual steps have to be done by two different departments to fully enable robot certificate you are requesting, this can take less than an hour or several days.
.....You should get email when the steps are complete, if its taking too long send email to either firstname.lastname@example.org or email@example.com inquiring as to whats going on.
- A way to verify that all parties above have done their job correctly is to execute the script
which is the verbose sibling of /scratch/grid/kproxy. Both kproxy and kproxyv create a grid proxy which is a file granting you access to the grid until it expires. Kproxyv spits out a lot of output, somewhere in the middle will be these two warnings:
Warning: your certificate and proxy will expire (some date here) which is within the requested lifetime of the proxy WARNING: Unable to verify signature! Server certificate possibly not installed. Error: Cannot verify AC signature!
The first warning tells you what you are trying to find out: your certificate and proxy are good until (some date) and you can submit to the grid until they expire. The second warning can be ignored. If you get some other result, send email to the above mentioned addresses.
Getting Grid permissions (every other time)¶
- You could remember to run kproxy before submitting to the grid, but if your jobs are still running when the proxy expires your job will almost certainly fail. To set up a cron job to keep your proxy unexpired do the following:
- Make sure you have a valid keberos ticket (klist will tell you)
- Run the kcroninit script (usually at /usr/krb5/bin/kcroninit) and follow the instructions
- Now run crontab with the -e option and add the following line to your cron list:
07 1-23/2 * * * /usr/krb5/bin/kcron /scratch/grid/kproxy minerva
Note again the use of kproxy, if you put kproxyv here you will get annoying mail every time the cron job runs.
Using the Grid resources at Fermilab¶
(following instructions under construction, please ask for now)
- Setting up your environment : source the file /grid/fermiapp/(experiment)/condor-scripts/setup_(experiment)_condor.[c]sh
gpsn01(NOvA): (bash) : $ . /grid/fermiapp/nova/condor-scripts/setup_nova_condor.sh (csh) : $ source /grid/fermiapp/nova/condor-scripts/setup_nova_condor.csh gpsn01(minerva): (bash) : $ . /grid/fermiapp/minerva/condor-scripts/setup_minerva_condor.sh (csh) : $ source /grid/fermiapp/minerva/condor-scripts/setup_minerva_condor.csh
- The JobSub Script : After sourcing the above file, the command (experiment)_jobsub can be used to submit to either the local condor cluster or the grid. This is modeled after Minos' example. See'Condor at MINOS' http://www-numi.fnal.gov/condor/index.html for the general flavor. The'minos_jobsub' script becomes 'nova_jobsub' on gpsn01, or 'minerva_jobsub' on gpsn01. The '-h' option on this script lists all the possible input flags and options.
- BlueArc Shared Disk : A large disk pool is available on the interactive machines, gpsn01, the local condor worker nodes and the grid worker nodes. The disk is mounted differently on the grid worker nodes than the local nodes for security reasons.
It is to not have all of your hundreds of grid jobs all accessing the BlueArc disk at the same time. (just like the unix mv and cp commands, except they queue up to spare BlueArc the trauma of too many concurrent accesses) to copy data on to and off of the BlueArc disks.
- Condor Logs/Output : The jobsub script and condor have created at least 3 files: a cmd script that condor uses as input, an err script where any errors have gone, and an out script containing the jobs output to stdout. The output directory is $CONDOR_TMP which translates to /grid/data/(experiment)/condor-tmp/(username).
- Killing Condor Jobs To terminate a condor job, first use the condor_q command to find the Condor ID of the jobs you wish to terminate. Then use condor_rm to remove them. Both of these commands are placed in your path when you run setup_nova_condor.sh
To remove a particular job use
% condor_rm <Job_ID>
To kill all user's jobs, use
% condor_rm <User_ID>
- Luke Corwin has written a guide for NOvA job submission on gpsn01 at 'Using Condor on IF Cluster Machines' https://cdcvs.fnal.gov/redmine/wiki/nova-cvs/Using_Condor_on_IF_Cluster_Machines#Running-on-the-Grid . I tested substituting 'minerva' everywhere it says 'nova' in Lukes document, and the recipe appears to work for registering on and submitting minerva jobs from gpsn01.
- A monitoring page for the minerva condor pool is at http://gpsn01.fnal.gov:8080/condor_monitoring/index_day_all.html .
another useful link is http://gpsn01.fnal.gov:8080/condor_view/ . The links on the right side, the user jobs, are the most interesting and useful.
- A monitoring page for the gpsn01 NOvA condor pool is at http://gpsn01.fnal.gov/condor_monitor/index_day_all.html .