Admin Questions

Administrator Questions (General)

How do I add a user account?

  • to add cvs access for minerva:
    1. ssh -l minervacvs cdcvs
    2. adduser "username" . (you must already be in the .k5login to do this.)

How can I test the SAM database repositories, after an upgrade for example?

  • See documentation here

How does an admin setup ups/upd for a new experiment?

  • See instructions here .

Is there a list of afs commands somewhere?

Group accounts and authentication

Service principals allow group accounts to do things

  • sharepoint doc on service principals here"

For more information, see Shared Accounts

Administrator Questions (GRID)

Getting approval for adding people to the mars groups.

You have to email Nikolai Mokhov () and get his approval before you add anyone to a "mars" VO group. Adding people to one of these groups gives them access to the MARS software which is controlled.

User (someuser) requested robot cert. What do I do?

  • User (someuser) just ran the request_robot_cert script, which generated a jira ticket requesting a robot cert. You need to add them to the proper group in vomrs, and wait for the entry to propagate to voms database. Perform these steps:
    • kill your browser, -i to get a kx509 certificate, restart your browser
    • go to . You will get a popup asking for your KCA cert, click OK
    • click on the plus on the [+]Members on left hand menu column
    • click on the plus on the [+]Certificates on the left hand menu column
    • click on Manage Groups and Group Roles
    • In the form that appears, enter (someuser) in the first line, in the second line select /DC=gov/DC=fnal/O=Fermilab/OU=Certificat Authorites/CN=Kerberized CA HSM from the list of over 100 possible selections. The '%' is an SQL wildcard that you surround (someuser)'s name with to make the form work properly. Hit the 'search' button.
    • you need to know which experiment (someuser) is requesting a cert for, if it went to minerva-computing or nova-computing its obvious, otherwise you might need to ask. A grey table with checkboxes on the right hand side should appear filled with a list of Experiments and Roles. Check both the box by the /fermilab/experiment row and the Analysis row. Hit the 'submit' button
    • you should get a green status bar across the top saying 'You have successfully assigned member(s) to group/role!'
    • go to the jira ticket created when (someuser) ran the script, add (someuser) as a watcher
      • you may have to add (someuser) as a jira user first
    • comment in the ticket that you are waiting for your changes to propagate to voms, it supposedly takes about an hour
  • How to check that the ticket has propagated to voms
    • go to you will be asked for your KCA ticket like with vomrs
    • click 'Administer the VO' from the Left Hand Menu
    • click 'Search for Users' from the Left Hand Menu
    • enter (someuser) in the form without the % wildcards. Hit search.
    • a list of entries will appear. You are looking to see that the robot cert requested earlier has propagated. If you see it, close the ticket and inform the user that their cert has been created and they can submit to the grid.

Where is condor, glideinWMS, VDT on machine X?

  • the version of condor being used is always soft-linked to /opt/condor/
    • set up condor by sourcing /opt/condor/condor.(c)sh
  • GlideinWMS information
    • glideinWMS 1.6.2 is split into two parts, running under id gfactory and gfrontend
      • the 'Glidein Factory' of glideinWMS lives at /home/gfactory
        • /home/gfactory/.globus contains usercert.pem and userkey.pem, the gfactory cert
        • /home/gfactory/scripts/contains all the scripts needed to start/stop the factory, refresh the cert, etc.
      • the 'VO Frontend' lives at /home/gfrontend
      • /home/gfrontend/scripts again shows and documents how to do most common tasks
    • glideinWMS 2.5.1 was significantly reorganized.
      • everything runs as user 'gfactory' (gfrontend not used)
      • see ~gfactory/scripts directory for most common tasks
        [gfactory@gpsn01 scripts]$ ls
  • VDT can live in various places, multiple versions can be on the machine due to upgrades
    • /usr/local/vdt is often a soft link to /usr/local/current-vdt-version, if VDT was installed as root
    • /home/grid/vdt is often a soft link to /home/grid/current-vdt-version if it was installed as user 'grid'
      • source $VDT_LOCATION/setup.(c)sh to put all the vdt tools in your path

What are the basics I need to know about condor on these machines?

  • to start condor: sudo /etc/init.d/condor start ; to stop: sudo /etc/init.d/condor stop
  • source /opt/condor/ to set up environment so condor tools work
  • condor_config_val is your friend. For example, condor_config_val LOG shows where log directories are, condor_config_val -config shows where config files are, condor_config_val -dump shows the settings for everything
  • to turn on debugging for a certain part of condor (example the collector) edit a config file and add the following line:
  • another example, if01 is heavy on the debug logging right now, this means log files turn over very quickly. It should be dialed back a bit.
    [dbox@if01 ~]$ condor_config_val -dump | grep D_FULLDEBUG

User complains that submitted grid jobs are just sitting idle. What should I check?

  • Log on as the gfactory account. ssh gfactory@gpsn01
  • Did the factory die? /home/gfactory/scripts/ will tell you.
  • Did the frontend die? /home/gfactory/scripts/ will tell you.
  • Did any of the certs expire? /home/gfactory/scripts/ will tell you
  • NB ALL of the above problems will usually be fixed by /home/gfactory/scripts/
  • an old problem, fixed as of v2_5_0. kept for historical info for the sorts of weird things you might need to check
    • Sometimes (well twice as of this writing) the factory will not restart due to the presence of a 0-length file of the form 'condor_activity_(date)_(entry_pt_name)(factory_version)(node_name)@(more_crap).log '. I found and removed one named condor_activity_20100706_gp-general@v1_0@gpcf026@fnal_gpcf026.log on gpcf026 in directory /home/gfactory/glideinsubmit/glidein_v1_0_1/entry_fermigrid/log. Use the find command to find it, then REMOVE THIS FILE and start the factory using the script. Yes, glideinWMS-support is aware of this issue, the errors in the log files are less than helpful.

How do I check that stuff is running, and stop/start it?

General health of gpcf machines can be probed at

General load and health of gpsn01 submitter node can be seen at

GlideinWMS has several important processes,
For glideinWMS 1.6 :
the factory running as user gfactory, the frontend running as user gfrontend, and the web server running as user apache.

For glideinWMS 2.5.1: the factory and the frontend both run as user gfrontend.

To check if the frontend is running: ssh gfactory@gpsn01 ; scripts/
To stop the frontend: ssh gfactory@gpsn01 ; scripts/
To start the frontend: ssh gfrontend@gpsn01 ; scripts/
Quick shortcut: ssh gfactory@gpsn01; scripts/

To check if the factory is running: ssh gfactory@gpsn01 ; source scripts/
To stop the frontend: ssh gfactory@gpsn01 ; scripts/
To start the frontend: ssh gfactory@gpsn01 ; scripts/
Quick shortcut: ssh gfactory@gpsn01 ; ./

To check if the web server is running: ps auxww | grep httpd
To stop the web server: sudo /etc/init.d/httpd stop
To start the web server sudo /etc/init.d/httpd start

To check if condor is running: ps auxww | grep condor
To stop condor: sudo /etc/init.d/condor stop
To start condor sudo /etc/init.d/condor start

Remote admins

Amit Kumar

John Brunelle