Project

General

Profile

Frequently Asked Questions

NOTE! This page is obsolete. Please go to Intensity Frontier infrastructure page for current info.

User Questions

Help! Something doesn't work like I think it should!

  • send email to , or if you are part of one of these experiments. Otherwise send mail to . Doing this opens a jira ticket, which is much better than opening a servicedesk ticket. A) it gets to the experts faster and B) it leaves behind a searchable archive for the experts when they see this problem 6 months later and forgot what they did last time.

I submitted a bunch of jobs, they exited immediately. What went wrong?

  • when you set up your condor environment using setup_condor.sh, setup_minerva_condor.sh, or setup_(someExperiment)_condor.sh, you created environment variables $CONDOR_TMP and $CONDOR_EXEC. Change directories to $CONDOR_TMP, do an ls -lart. There should be a bunch of files ending in .log, .out, and .err. The .err files are the most useful for debugging these kind of problems, everything that went to stderr on your job execution is in these files. The .out files contain everything that went to stdout and are also a good place to look for clues.

I am having trouble with ups/upd, how do I set it up for my experiment.

  • Each experiment has their own customized ups area. An example of how to set this up in your .profile or .bash file is shown here : set up ups. You can do a quick check to see if you are using the correct set up by typing "which ups". If the path has your experiments name in it, it is OK. For example, on a minerva login,
<if02> which ups
/grid/fermiapp/products/minerva/prd/ups/v4_7_4a/Linux-2/bin/ups

I need a version of a package that is not in the ups setup. How can I get this installed?

  • Open a problem ticket by sending a mail to [minerva|nova|ifront]-. Specify the package and version you need, and which experiment for which you need it.

How can I log into a Fermilab kerberized machine from my non-kerberized laptop?

Administrator Questions (General)

How do I add a user account?

  • to add cvs access for minerva:
    1. ssh -l minervacvs cdcvs
    2. adduser "username" . (you must already be in the .admin file to do this. If you are not, open a service desk ticket and request to be added.)

How does an admin setup ups/upd for a new experiment?

  • See instructions here .

Is there a list of afs commands somewhere?

Administrator Questions (GRID)

User (someuser)%40fnal.gov requested robot cert. What do I do?

  • User (someuser) just ran the request_robot_cert script, which generated a jira ticket requesting a robot cert. You need to add them to the proper group in vomrs, and wait for the entry to propagate to voms database. Perform these steps:
    • kill your browser, getcert.sh -i to get a kx509 certificate, restart your browser
    • IF the request is for the lbne group (NOT lbne/mars however):
    • ELSE request for any other group including lbne/mars:
    • click on the plus on the [+]Members on left hand menu column
    • click on the plus on the [+]Certificates on the left hand menu column
    • click on Manage Groups and Group Roles
    • In the form that appears, enter (someuser) in the first line, in the second line select /DC=gov/DC=fnal/O=Fermilab/OU=Certificat Authorites/CN=Kerberized CA HSM from the list of over 100 possible selections. The '%' is an SQL wildcard that you surround (someuser)'s name with to make the form work properly. Hit the 'search' button.
    • you need to know which experiment (someuser) is requesting a cert for, if it went to minerva-computing or nova-computing its obvious, otherwise you might need to ask. A grey table with checkboxes on the right hand side should appear filled with a list of Experiments and Roles. Check both the box by the /fermilab/experiment row and the Analysis row. Hit the 'submit' button
    • you should get a green status bar across the top saying 'You have successfully assigned member(s) to group/role!'
    • go to the jira ticket created when (someuser) ran the script, add (someuser) as a watcher
      • you may have to add (someuser) as a jira user first
    • comment in the ticket that you are waiting for your changes to propagate to voms, it supposedly takes about an hour
  • How to check that the ticket has propagated to voms
    • go to https://voms.fnal.gov:8443/voms/fermilab/ you will be asked for your KCA ticket like with vomrs
    • click 'Administer the VO' from the Left Hand Menu
    • click 'Search for Users' from the Left Hand Menu
    • enter (someuser) in the form without the % wildcards. Hit search.
    • a list of entries will appear. You are looking to see that the robot cert requested earlier has propagated. If you see it, close the ticket and inform the user that their cert has been created and they can submit to the grid.

Where is condor, glideinWMS, VDT on machine X?

  • the version of condor being used is always soft-linked to /opt/condor/
    • set up condor by sourcing /opt/condor/condor.(c)sh
  • GlideinWMS information
    • glideinWMS 1.6.2 is split into two parts, running under id gfactory and gfrontend
      • the 'Glidein Factory' of glideinWMS lives at /home/gfactory
        • /home/gfactory/.globus contains usercert.pem and userkey.pem, the gfactory cert
        • /home/gfactory/scripts/contains all the scripts needed to start/stop the factory, refresh the cert, etc.
      • the 'VO Frontend' lives at /home/gfrontend
      • /home/gfrontend/scripts again shows and documents how to do most common tasks
    • glideinWMS 2.5.1 was significantly reorganized.
      • everything runs as user 'gfactory' (gfrontend not used)
      • see ~gfactory/scripts directory for most common tasks
        [gfactory@gpsn01 scripts]$ ls
        factory_status.sh         restart_factory.sh   stop_factory.sh
        frontend_status.sh        restart_frontend.sh  stop_glideinWMS.sh
        glideinWMS_status_all.sh  setup.sh
        refresh_proxy.sh          start_glideinWMS.sh
        
        
  • VDT can live in various places, multiple versions can be on the machine due to upgrades
    • /usr/local/vdt is often a soft link to /usr/local/current-vdt-version, if VDT was installed as root
    • /home/grid/vdt is often a soft link to /home/grid/current-vdt-version if it was installed as user 'grid'
      • source $VDT_LOCATION/setup.(c)sh to put all the vdt tools in your path

What are the basics I need to know about condor on these machines?

  • to start condor: sudo /etc/init.d/condor start ; to stop: sudo /etc/init.d/condor stop
  • source /opt/condor/condor.sh to set up environment so condor tools work
  • condor_config_val is your friend. For example, condor_config_val LOG shows where log directories are, condor_config_val -config shows where config files are, condor_config_val -dump shows the settings for everything
  • to turn on debugging for a certain part of condor (example the collector) edit a config file and add the following line:
    COLLECTOR_DEBUG =  D_FULLDEBUG D_SECURITY
    
  • another example, if01 is heavy on the debug logging right now, this means log files turn over very quickly. It should be dialed back a bit.
    [dbox@if01 ~]$ condor_config_val -dump | grep D_FULLDEBUG
    COLLECTOR_DEBUG = D_FULLDEBUG D_COMMAND D_SECURITY D_PROTOCOL
    CREDD_DEBUG = D_FULLDEBUG
    LEASEMANAGER_DEBUG = D_FULLDEBUG
    NEGOTIATOR_DEBUG = D_MATCH D_FULLDEBUG D_COMMAND D_SECURITY D_PROTOCOL
    SCHEDD_DEBUG = D_COMMAND D_PID D_SECURITY D_PROTOCOL D_FULLDEBUG
    SHADOW_DEBUG = D_FULLDEBUG D_COMMAND D_SECURITY D_PROTOCOL
    STARTD_DEBUG = D_COMMAND D_FULLDEBUG D_SECURITY D_PROTOCOL
    STARTER_DEBUG = D_NODATE D_FULLDEBUG D_SECURITY D_PROTOCOL
    STORK_DEBUG = D_FULLDEBUG
    TOOL_DEBUG = D_FULLDEBUG
    

User complains that submitted grid jobs are just sitting idle. What should I check?

  • Did the gfactory die? /home/gfactory/glideinsubmit/glidein_v1_0/factory_startup status will tell you.
  • Did the cert expire? source $VDT_LOCATION/setup.sh ; voms-proxy-info -all will tell you
  • NB BOTH of the above problems will be fixed by /home/gfactory/scripts/restart_factory.sh which is on a kcron job from dbox's account on the same machine.
  • Sometimes (well twice as of this writing) the factory will not restart due to the presence of a 0-length file of the form 'condor_activity_(date)_(entry_pt_name)(factory_version)(node_name)@(more_crap).log '. I found and removed one named condor_activity_20100706_gp-general@v1_0@gpcf026@fnal_gpcf026.log on gpcf026 in directory /home/gfactory/glideinsubmit/glidein_v1_0_1/entry_gpgeneral/log. Use the find command to find it, then REMOVE THIS FILE and start the factory using the restart_factory.sh script. Yes, glideinWMS-support is aware of this issue, the errors in the log files are less than helpful.

How do I check that stuff is running, and stop/start it?

General health of gpcf machines can be probed at http://fefganglia.fnal.gov/?c=GPCF

General load and health of gpsn01 submitter node can be seen at http://fefganglia.fnal.gov/?m=load_one&r=day&s=descending&c=GPCF&h=gpsn01.fnal.gov&sh=1&hc=4&z=small

GlideinWMS has several important processes,
For glideinWMS 1.6 :
the factory running as user gfactory, the frontend running as user gfrontend, and the web server running as user apache.

For glideinWMS 2.5.1: the factory and the frontend both run as user gfrontend.

To check if the frontend is running: ssh gfactory@gpsn01 ; scripts/frontend_status.sh
To stop the frontend: ssh gfactory@gpsn01 ; scripts/stop_frontend.sh
To start the frontend: ssh gfrontend@gpsn01 ; scripts/start_frontend.sh
Quick shortcut: ssh gfactory@gpsn01; scripts/restart_frontend.sh

To check if the factory is running: ssh gfactory@gpsn01 ; source scripts/factory_status.sh
To stop the frontend: ssh gfrontend@gpsn01 ; scripts/stop_factory.sh
To start the frontend: ssh gfrontend@gpsn01 ; scripts/start_factory.sh
Quick shortcut: ssh gfactory@gpsn01 ; ./restart_factory.sh

To check if the web server is running: ps auxww | grep httpd
To stop the web server: sudo /etc/init.d/httpd stop
To start the web server sudo /etc/init.d/httpd start

To check if condor is running: ps auxww | grep condor
To stop condor: sudo /etc/init.d/condor stop
To start condor sudo /etc/init.d/condor start