Project

General

Profile

Gpsn01 common tasks

Overview

  • Gpsn01 is a general IF batch submission node servicing
    • a local condor pool (gpwn001-007)
    • an OSG condor pool via glideinWMS 2.5.1 running as uid 'gfactory'

check current condor version

[dbox@gpsn01 ~]$ . /opt/condor/condor.sh
[dbox@gpsn01 ~]$ condor_version
$CondorVersion: 7.4.2 Mar 29 2010 BuildID: 227044 $
$CondorPlatform: X86_64-LINUX_RHEL5 $

find current condor config files

[dbox@gpsn01 ~]$ . /opt/condor/condor.sh
[dbox@gpsn01 ~]$ condor_config_val -config
Configuration source:
        /opt/condor-7.4.2/etc/condor_config
Local configuration source:
        /opt/condor-7.4.2/etc/condor_config.local.worker.node

Stop new submissions without disturbing running jobs

  1. condor_off -peaceful to stop submission of new jobs
         [dbox@gpsn01 ~]$ sudo /opt/condor/sbin/condor_off -peaceful
         
  2. stop glidein factory
  • ssh gfactory@gpsn01 ; cd scripts
    [gfactory@gpsn01 scripts]$ pwd
    /home/gfactory/scripts
    [gfactory@gpsn01 scripts]$ ls
    factory_status.sh         refresh_proxy.sh~    start_glideinWMS.sh
    frontend_status.sh        restart_factory.sh   stop_factory.sh
    glideinWMS_status_all.sh  restart_factory.sh~  stop_glideinWMS.sh
    refresh_proxy2.sh         restart_frontend.sh
    refresh_proxy.sh          setup.sh
    
    [gfactory@gpsn01 scripts]$ ./stop_factory.sh
    Shutting down glideinWMS factory v2_5_1@factory:           [OK]
    

    It is important to make sure there are no cron jobs enabled that will restart the factory if you use this method. Here is a listing of the crontab with relevant entry commented out:

    
    [gfactory@gpsn01 ~]$ crontab -l
    @reboot /home/gfactory/scripts/start_glideinWMS.sh
    #08 0-23/3 * * *  /home/gfactory/scripts/restart_factory.sh
    19 0-23/3 * * *  /home/gfactory/restart_frontend.sh
    0,10,20,30,40,50 * * * * /home/gfactory/monitor_gviz/condor_q
    23 * * * * . /home/gfactory/scripts/refresh_proxy.sh
    */3 * * * * /home/gfactory/monitor/monitor_collectdata.sh
    */5 * * * * /home/gfactory/monitor/monitor_makegraphs.sh
    

Resume Submission

The safest way to do this is to shut down all still running condor/glideinWMS processes and restart them

Shutdown condor/glideinWMS

Shutdown order: glideinWMS, then condor

  • ssh gfactory@gpsn01 ; cd scripts
    [gfactory@gpsn01 scripts]$ pwd
    /home/gfactory/scripts
    [gfactory@gpsn01 scripts]$ ls
    factory_status.sh         refresh_proxy.sh~    start_glideinWMS.sh
    frontend_status.sh        restart_factory.sh   stop_factory.sh
    glideinWMS_status_all.sh  restart_factory.sh~  stop_glideinWMS.sh
    refresh_proxy2.sh         restart_frontend.sh
    refresh_proxy.sh          setup.sh
    
    [gfactory@gpsn01 scripts]$ ./stop_glideinWMS.sh
    (lots of messages)
    
  • ssh (someone on sudo list) gpsn01
    [dbox@gpsn01 ~]$ sudo /etc/init.d/condor stop     
    
  • now restart everything as documented in the section that immediately follows:

Startup condor/glideinWMS

Order is unimportant

  • ssh gfactory@gpsn01 ; cd scripts ; ./start_glideinWMS.sh
  • ssh (someone on sudo list) gpsn01; sudo /etc/init.d/condor start

If the schedd on gpsn01 gets hung

SSS is giving us root on gpsn01 so we can debug schedd problems. If you login as yourself you should be able to type "ksu" and become root.

If the schedd is non-responsive on gpsn01 look at it with ps and see if it's owned by a user:

  • ps -ef | grep condor_schedd | egrep -v 'gfactory|grep'

If it's well it should be running under the user "condor". It's valid for it to change ownership to some other user for milliseconds to write something to a user log file but not long enough to see it constantly. You can tail -f the SchedLog file also and see that it's not making progress. At this point, you can run:

  • strace -f -F -p <pid of schedd found above>

and you should see it hung waiting for a lock. If you're getting ANY lines scrolling by with that command then the schedd isn't stuff and you'll have to figure out what it's busy doing. It may not be possible by looking at this output.

If it's stuck on a lock you can see which file it's trying to lock with:

  • /usr/sbin/lsof -p <pid of schedd found above>

One of the last lines will be some user file. Keep track of that. If it's stuck, at this point you can try a "kill -9" on it and it should exit. The condor_master will start it back up within a minute or so. You can try this a few times and see if it keeps getting stuck on the same user/file. If you can't get it to start then I'd recommend running:

  • su <username of person who owns the file it is getting hung on>
  • mv <filename> <filename>.bak

And then kill -9 the schedd again. This time it won't get hung on that file since it does not exist and hopefully it does not find a new one to get stuck on. I'm not sure what happens to that job with the job.log missing. You should notify the user that you had to do this.

Another thing that's very good to figure out what the schedd is busy doing if it's not waiting on a lock is:

  • watch -n 1 'gstack <pid of schedd>'

You'll get a stack trace of the schedd process updated every second. You're watching to see if it's constantly doing the same thing. Maybe it's constantly spending time in authentication code or something. Stare at this for a while and look for trends.