Project

General

Profile

Administrator Questions (General)

How do I add a user account?

  • to add cvs access for minerva:
    1. ssh -l minervacvs cdcvs
    2. adduser "username" . (you must already be in the .k5login to do this.)

How can I test the SAM database repositories, after an upgrade for example?

  • See documentation here

How does an admin setup ups/upd for a new experiment?

  • See instructions here .

Is there a list of afs commands somewhere?

Group accounts and authentication

Service principals allow group accounts to do things

  • how to get a service principal here .

You and the sys-admins need to agree on a location - and it has to be only readable by the group account.

Then to kinit you need to do

kinit -k -t <location of the keytab file> <groupaccount>/<nodename>.fnal.gov/FNAL.GOV

  • sharepoint doc on service principals here"

Administrator Questions (GRID)

Getting approval for adding people to the mars groups.

You have to email Nikolai Mokhov () and get his approval before you add anyone to a "mars" VO group. Adding people to one of these groups gives them access to the MARS software which is controlled.

User (someuser)%40fnal.gov requested robot cert. What do I do?

  • User (someuser) just ran the request_robot_cert script, which generated a jira ticket requesting a robot cert. You need to add them to the proper group in vomrs, and wait for the entry to propagate to voms database. Perform these steps:
    • kill your browser, getcert.sh -i to get a kx509 certificate, restart your browser
    • go to https://vomrs.fnal.gov:8443/vomrs/vo-fermilab/vomrs . You will get a popup asking for your KCA cert, click OK
    • click on the plus on the [+]Members on left hand menu column
    • click on the plus on the [+]Certificates on the left hand menu column
    • click on Manage Groups and Group Roles
    • In the form that appears, enter (someuser) in the first line, in the second line select /DC=gov/DC=fnal/O=Fermilab/OU=Certificat Authorites/CN=Kerberized CA HSM from the list of over 100 possible selections. The '%' is an SQL wildcard that you surround (someuser)'s name with to make the form work properly. Hit the 'search' button.
    • you need to know which experiment (someuser) is requesting a cert for, if it went to minerva-computing or nova-computing its obvious, otherwise you might need to ask. A grey table with checkboxes on the right hand side should appear filled with a list of Experiments and Roles. Check both the box by the /fermilab/experiment row and the Analysis row. Hit the 'submit' button
    • you should get a green status bar across the top saying 'You have successfully assigned member(s) to group/role!'
    • go to the jira ticket created when (someuser) ran the script, add (someuser) as a watcher
      • you may have to add (someuser) as a jira user first
    • comment in the ticket that you are waiting for your changes to propagate to voms, it supposedly takes about an hour
  • How to check that the ticket has propagated to voms
    • go to https://voms.fnal.gov:8443/voms/fermilab/ you will be asked for your KCA ticket like with vomrs
    • click 'Administer the VO' from the Left Hand Menu
    • click 'Search for Users' from the Left Hand Menu
    • enter (someuser) in the form without the % wildcards. Hit search.
    • a list of entries will appear. You are looking to see that the robot cert requested earlier has propagated. If you see it, close the ticket and inform the user that their cert has been created and they can submit to the grid.

Where is condor, glideinWMS, VDT on machine X?

  • the version of condor being used is always soft-linked to /opt/condor/
    • set up condor by sourcing /opt/condor/condor.(c)sh
  • GlideinWMS information
    • glideinWMS 1.6.2 is split into two parts, running under id gfactory and gfrontend
      • the 'Glidein Factory' of glideinWMS lives at /home/gfactory
        • /home/gfactory/.globus contains usercert.pem and userkey.pem, the gfactory cert
        • /home/gfactory/scripts/contains all the scripts needed to start/stop the factory, refresh the cert, etc.
      • the 'VO Frontend' lives at /home/gfrontend
      • /home/gfrontend/scripts again shows and documents how to do most common tasks
    • glideinWMS 2.5.1 was significantly reorganized.
      • everything runs as user 'gfactory' (gfrontend not used)
      • see ~gfactory/scripts directory for most common tasks
        [gfactory@gpsn01 scripts]$ ls
        factory_status.sh         restart_factory.sh   stop_factory.sh
        frontend_status.sh        restart_frontend.sh  stop_glideinWMS.sh
        glideinWMS_status_all.sh  setup.sh
        refresh_proxy.sh          start_glideinWMS.sh
        
        
  • VDT can live in various places, multiple versions can be on the machine due to upgrades
    • /usr/local/vdt is often a soft link to /usr/local/current-vdt-version, if VDT was installed as root
    • /home/grid/vdt is often a soft link to /home/grid/current-vdt-version if it was installed as user 'grid'
      • source $VDT_LOCATION/setup.(c)sh to put all the vdt tools in your path

What are the basics I need to know about condor on these machines?

  • to start condor: sudo /etc/init.d/condor start ; to stop: sudo /etc/init.d/condor stop
  • source /opt/condor/condor.sh to set up environment so condor tools work
  • condor_config_val is your friend. For example, condor_config_val LOG shows where log directories are, condor_config_val -config shows where config files are, condor_config_val -dump shows the settings for everything
  • to turn on debugging for a certain part of condor (example the collector) edit a config file and add the following line:
    COLLECTOR_DEBUG =  D_FULLDEBUG D_SECURITY
    
  • another example, if01 is heavy on the debug logging right now, this means log files turn over very quickly. It should be dialed back a bit.
    [dbox@if01 ~]$ condor_config_val -dump | grep D_FULLDEBUG
    COLLECTOR_DEBUG = D_FULLDEBUG D_COMMAND D_SECURITY D_PROTOCOL
    CREDD_DEBUG = D_FULLDEBUG
    LEASEMANAGER_DEBUG = D_FULLDEBUG
    NEGOTIATOR_DEBUG = D_MATCH D_FULLDEBUG D_COMMAND D_SECURITY D_PROTOCOL
    SCHEDD_DEBUG = D_COMMAND D_PID D_SECURITY D_PROTOCOL D_FULLDEBUG
    SHADOW_DEBUG = D_FULLDEBUG D_COMMAND D_SECURITY D_PROTOCOL
    STARTD_DEBUG = D_COMMAND D_FULLDEBUG D_SECURITY D_PROTOCOL
    STARTER_DEBUG = D_NODATE D_FULLDEBUG D_SECURITY D_PROTOCOL
    STORK_DEBUG = D_FULLDEBUG
    TOOL_DEBUG = D_FULLDEBUG
    

User complains that submitted grid jobs are just sitting idle. What should I check?

  • Log on as the gfactory account. ssh gfactory@gpsn01
  • Did the factory die? /home/gfactory/scripts/factory_status.sh will tell you.
  • Did the frontend die? /home/gfactory/scripts/frontend_status.sh will tell you.
  • Did any of the certs expire? /home/gfactory/scripts/check_proxy.sh will tell you
  • NB ALL of the above problems will usually be fixed by /home/gfactory/scripts/restart_glideinWMS.sh
  • an old problem, fixed as of v2_5_0. kept for historical info for the sorts of weird things you might need to check
    • Sometimes (well twice as of this writing) the factory will not restart due to the presence of a 0-length file of the form 'condor_activity_(date)_(entry_pt_name)(factory_version)(node_name)@(more_crap).log '. I found and removed one named condor_activity_20100706_gp-general@v1_0@gpcf026@fnal_gpcf026.log on gpcf026 in directory /home/gfactory/glideinsubmit/glidein_v1_0_1/entry_fermigrid/log. Use the find command to find it, then REMOVE THIS FILE and start the factory using the restart_factory.sh script. Yes, glideinWMS-support is aware of this issue, the errors in the log files are less than helpful.

How do I check that stuff is running, and stop/start it?

General health of gpcf machines can be probed at http://fefganglia.fnal.gov/?c=GPCF

General load and health of gpsn01 submitter node can be seen at http://fefganglia.fnal.gov/?m=load_one&r=day&s=descending&c=GPCF&h=gpsn01.fnal.gov&sh=1&hc=4&z=small

GlideinWMS has several important processes,
For glideinWMS 1.6 :
the factory running as user gfactory, the frontend running as user gfrontend, and the web server running as user apache.

For glideinWMS 2.5.1: the factory and the frontend both run as user gfrontend.

To check if the frontend is running: ssh gfactory@gpsn01 ; scripts/frontend_status.sh
To stop the frontend: ssh gfactory@gpsn01 ; scripts/stop_frontend.sh
To start the frontend: ssh gfrontend@gpsn01 ; scripts/start_frontend.sh
Quick shortcut: ssh gfactory@gpsn01; scripts/restart_frontend.sh

To check if the factory is running: ssh gfactory@gpsn01 ; source scripts/factory_status.sh
To stop the frontend: ssh gfactory@gpsn01 ; scripts/stop_factory.sh
To start the frontend: ssh gfactory@gpsn01 ; scripts/start_factory.sh
Quick shortcut: ssh gfactory@gpsn01 ; ./restart_factory.sh

To check if the web server is running: ps auxww | grep httpd
To stop the web server: sudo /etc/init.d/httpd stop
To start the web server sudo /etc/init.d/httpd start

To check if condor is running: ps auxww | grep condor
To stop condor: sudo /etc/init.d/condor stop
To start condor sudo /etc/init.d/condor start

Remote admins

SMU
Amit Kumar

Harvard
John Brunelle