Project

General

Profile

Installing GlideinWMS 2 5 1 on SL5 GPCF Production using ini file installer

Get GlideinWMS v2_5_1

  • tar xzvf glideinWMS_v2_5_1.tgz
  • cd glideinWMS/install ; export src=`pwd`

Edit Config File

Run Installation Software

  • have to do install in this order, if its all on the same machine you can do it this way:
    • $src/manage-glideins --ini $src/glideinWMS.ini --install wmscollector; $src/manage-glideins --ini $src/glideinWMS.ini --install factory; $src/manage-glideins --ini $src/glideinWMS.ini --install usercollector ; $src/manage-glideins --ini $src/glideinWMS.ini --install submit; $src/manage-glideins --ini $src/glideinWMS.ini --install vofrontend
  • A screen capture for GPCF of how I answered these questions. When it asks if you want entry points from RESS, I say yes, then follow up saying I only want the entry point ress_ITB_INSTALL_TEST_3. After testing that this works, I edit glideinWMS.xml to add additional entry points. If you say you want them all you end up with a zillion entry points that you will never use that just make site maintenance more complex.

h2 Test Installation

  • now test your setup to see if you can submit to ress_ITB_INSTALL_TEST_3
    • you have installed 4 condors. Much confusion results from using the wrong one for a given task. The submitcondor is used for submission, oddly enough.
    • . /home/gfactory/v2_5_1/submitcondor/condor.sh
    • my test files live in /home/gfactory/v2_5_0/test_dir , in this example they are test.cmd (a condor submit file) and test.sh (what I want to run on the worker node)
                  [gfactory@gpsn01 test_dir]$ condor_submit test.cmd
                  Submitting job(s).
                  Logging submit event(s).
                  1 job(s) submitted to cluster 2.
                  [gfactory@gpsn01 test_dir]$ condor_q
      
                  -- Submitter: gpsn01.fnal.gov : <131.225.67.18:46314> : gpsn01.fnal.gov
                   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
                     2.0   gfactory        3/15 10:30   0+00:00:00 I  0   0.0  test.sh           
      
                  1 jobs; 1 idle, 0 running, 0 held
      
    • check that the job ran
                  [gfactory@gpsn01 test_dir]$ condor_q
      
                  -- Submitter: gpsn01.fnal.gov : <131.225.67.18:46314> : gpsn01.fnal.gov
                   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
      
                  0 jobs; 0 idle, 0 running, 0 held
      
                  [gfactory@gpsn01 test_dir]$ ls -lart test.sh.2*
                  -rw-r--r-- 1 gfactory gpcf 15959 Mar 15 10:33 test.sh.2.0.output
                  -rw-r--r-- 1 gfactory gpcf  1298 Mar 15 10:33 test.sh.2.0.log
      
  • edit original glideinWMS.xml file to add fermigrid entry
          [gfactory@gpsn01 glidein_v2_5_1.cfg]$ pwd
          /home/gfactory/v2_5_1/factory/glidein_v2_5_1.cfg
          [gfactory@gpsn01 glidein_v2_5_1.cfg]$ ls
          glideinWMS.xml  glideinWMS.xml~
          [gfactory@gpsn01 glidein_v2_5_1.cfg]$ cp glideinWMS.xml glideinWMS.xml.orig
          [gfactory@gpsn01 glidein_v2_5_1.cfg]$ vi glideinWMS.xml
    
  • copy the xml section that starts: <entry name="ress_ITB_INSTALL_TEST_3" enabled="True" gatekeeper="cms-xen9.fnal.gov/jobmanager-condor" all the way through its xml close '</entry>'
  • paste this copy back in and change to name="fermigrid" enabled="True" gatekeeper="fnpcfg1.fnal.gov/jobmanager-condor". Make sure that the <entry and </entry> attributes match up so that they both open and close correctly, xml-wise.
  • now enable the entry point you just created in the edited glideinWMS.xml
          [gfactory@gpsn01 glidein_v2_5_1.cfg]$ cd ..
          [gfactory@gpsn01 factory]$ ls
          client_files  factory_logs  factory.sh  glidein_v2_5_1  glidein_v2_5_1.cfg
          [gfactory@gpsn01 factory]$ cd glidein_v2_5_1
          [gfactory@gpsn01 glidein_v2_5_1]$ ls
          attributes.cfg                 glidein_startup.sh     monitor
          client_log                     glideinWMS.b3ehJu.xml  params.cfg
          client_proxies                 glideinWMS.xml         rsa.key
          entry_ress_ITB_INSTALL_TEST_3  job_submit.sh          signatures.sha1
          factory_startup                local_start.sh         update_proxy.py
          frontend.descript              lock
          glidein.descript               log
          [gfactory@gpsn01 glidein_v2_5_1]$ . ../factory.sh
          [gfactory@gpsn01 glidein_v2_5_1]$ ./factory_startup reconfig ../glidein_v2_5_1.cfg/glideinWMS.xml
          Shutting down glideinWMS factory v2_5_1@factory:             [OK]
          Reconfigured glidein 'v2_5_1'
          Active entries are:
            ress_ITB_INSTALL_TEST_3
            fermigrid
          Submit files are in /home/gfactory/v2_5_1/factory/glidein_v2_5_1
          Reconfiguring the factory                                  [OK]
          Starting glideinWMS factory v2_5_1@factory:                  [OK]
    
  • turn off the old entry point so you can test submission to fermigrid:
      ./factory_startup down -entry ress_ITB_INSTALL_TEST_3 -delay 0
      Setting downtime...                                        [OK]
      [gfactory@gpsn01 glidein_v2_5_1]$ 
  • submit a test job
          [gfactory@gpsn01 ~]$ cd test_dir/
          [gfactory@gpsn01 test_dir]$ which condor_submit
          /scratch/gfactory/wmscollectorcondor/bin/condor_submit
          [gfactory@gpsn01 test_dir]$#remember about all those condor installations? change back to submitcondor
          [gfactory@gpsn01 test_dir]$ . $HOME/v2_5_1/submitcondor/condor.sh
    
          [gfactory@gpsn01 test_dir]$ condor_submit test.cmd
          Submitting job(s).
          Logging submit event(s).
          1 job(s) submitted to cluster 3.
    
  • when the job has run, verify that it ran on a fermigrid node
      [gfactory@gpsn01 test_dir]$ grep Host test.sh.3.0.output 
      Grid Job pid(22942) 20110315 115155: Hostname........... fnpc2072.fnal.gov

At this point, you are done installing glideinWMS 2.5.1 on a single node installation.¶

Modifying GlideinWMS to incorporate local pool

  • What follows is how to connect your glideinWMS installation to an existing local condor pool installed via rpms and run as root. The idea is this: the rpm condor is originally running these daemons:COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
  • However, glideinWMS installed as a non-privileged user is running all these daemons as well.
  • To get these two installations to play together nicely, the rpm condor must run the MASTER, SCHEDD, STARTD , and glideinWMS must run the COLLECTOR and NEGOTIATOR, and they all have to agree on what ports to use and authenticate with each other (glideinWMS is techy about authentication).
    • the glideinWMS GSI DN is DN="/DC=org/DC=doegrids/OU=Services/CN=gfactory/gpsn01.fnal.gov"
    • the rpm condor GSI DIN is DN="/DC=org/DC=doegrids/OU=Services/CN=gpsn01.fnal.gov"
      This is important to know in understanding the edits to config files I made

Heres' how I got it all to work.

  1. stop glideinWMS
          $src/manage-glideins --stop all --ini $src/glideinWMS.ini
    
  2. edit the (rpm) condor config files
    • COLLECTOR_HOST = $(CONDOR_HOST) becomes COLLECTOR_HOST = $(CONDOR_HOST):9640 on schedd machine and all local worker nodes
    • all local worker nodes get /etc/grid-security hostcert.pem, hostkey.pem , condor-mapfile, and certificates directory to make sure they can talk to the glideinWMS user collector running on port 9640 . To get a certificates directory you have to install VDT as user grid and have FEF soft-link the TRUSTED_CA directory back to /etc/grid-security/certificates
    • DAEMON_LIST = MASTER, SCHEDD, STARTD
    • I ended up putting a ton of GSI_ and SEC_ related stuff in the condor_config_files, I am still trying to figure out how much of this is necessary:
  3. edit the glideinWMS usercollector condor-mapfile
          [gfactory@gpsn01 v2_5_1]$ cd usercollectorcondor/
          [gfactory@gpsn01 usercollectorcondor]$ . condor.sh
          [gfactory@gpsn01 usercollectorcondor]$ condor_config_val -dump | grep MAP
          CERTIFICATE_MAPFILE = /home/gfactory/v2_5_1/usercollectorcondor/certs/condor_mapfile
          [gfactory@gpsn01 usercollectorcondor]$ #need to add schedd and local worker nodes to mapfile so find them first
          [gfactory@gpsn01 usercollectorcondor]$ openssl x509 -in /etc/grid-security/hostcert.pem -subject -noout
          subject= /DC=org/DC=doegrids/OU=Services/CN=gpsn01.fnal.gov
          [gfactory@gpsn01 usercollectorcondor]$ ssh sngpvm02 openssl x509 -in /etc/grid-security/hostcert.pem -subject -noout
          subject= /DC=org/DC=doegrids/OU=Services/CN=sngpvm02.fnal.gov
    
          [gfactory@gpsn01 usercollectorcondor]$ vi /home/gfactory/v2_5_1/usercollectorcondor/certs/condor_mapfile
    
  4. edit the generated frontend.xml file to know about the rpm schedd. Resultant frontend.xml
          [gfactory@gpsn01 usercollectorcondor]$ cd
          [gfactory@gpsn01 ~]$ cd v2_5_1/frontend
          [gfactory@gpsn01 frontend]$ ls
          frontend_frontend-v2_5_1  frontend.sh  instance_v2_5_1.cfg
          [gfactory@gpsn01 frontend]$ cd instance_v2_5_1.cfg/
          [gfactory@gpsn01 instance_v2_5_1.cfg]$ ls
          frontend.xml  frontend.xml~
          [gfactory@gpsn01 instance_v2_5_1.cfg]$ vi frontend.xml
    
    • change all instances of <schedd DN="/DC=org/DC=doegrids/OU=Services/CN=gfactory/gpsn01.fnal.gov"
      to <schedd DN="/DC=org/DC=doegrids/OU=Services/CN=gpsn01.fnal.gov"
  5. now reconfigure the frontend with your edited frontend.xml
          [gfactory@gpsn01 instance_v2_5_1.cfg]$ cd ../frontend_frontend-v2_5_1/
          [gfactory@gpsn01 frontend_frontend-v2_5_1]$ . ../frontend.sh
          [gfactory@gpsn01 frontend_frontend-v2_5_1]$ ls
          frontend.b3ehVC.xml     frontend.mapfile  group_main  monitor
          frontend.condor_config  frontend_startup  lock        params.cfg
          frontend.descript       frontend.xml      log         signatures.sha1
          [gfactory@gpsn01 frontend_frontend-v2_5_1]$ ./frontend_startup reconfig 
          Usage: frontend_startup reconfig <fname>
          [gfactory@gpsn01 frontend_frontend-v2_5_1]$ ./frontend_startup reconfig  ../instance_v2_5_1.cfg/frontend.xml
          Reconfigured frontend 'frontend-v2_5_1'
          Active entries are:
            main
          Work files are in /home/gfactory/v2_5_1/frontend/frontend_frontend-v2_5_1
          Reconfiguring the frontend                                 [OK]
          [gfactory@gpsn01 frontend_frontend-v2_5_1]$ 
    
  6. Start everything up !
    • as user gfactory
                  export src=/home/gfactory/v2_5_1/glideinWMS/install
                  $src/manage-glideins --start wmscollector --ini $src/glideinWMS.ini
                  $src/manage-glideins --start factory --ini $src/glideinWMS.ini
                  $src/manage-glideins --start usercollector --ini $src/glideinWMS.ini
      
    • as user who has sudo to run condor commands
                  sudo /etc/init.d/condor start
      
    • as user gfactory again:
                  $src/manage-glideins --start vofrontend --ini $src/glideinWMS.ini
      

Gotchas during install

log directory overwrite

If the log files for the factory or frontend are specified to be inside a directory where factory or frontend main are to be installed overwrites and file deletions can happen i.e.

;--------------------------------------------------
;  VOFrontend
;--------------------------------------------------
[VOFrontend]
install_location = %(home_dir)s/v2_5_1/frontend
logs_dir     = %(home_dir)s/v2_5_1/frontend/logs
instance_name = v2_5_1
condor_location = %(home_dir)s/v2_5_1/frontendcondor

Then the installer will complain that directories are not empty blow away parts of your install unpredictably. The initial symptom in the install will be this:

... validating install_location: /home/gfactory/v2_5_1/frontend^M
... directory (/home/gfactory/v2_5_1/frontend) already exists and must be empty.^M
... can the contents be removed (y/n>? (y/n): y^M

and it will ultimately fail like so
======== VOFrontend install complete ==========^M
^M
Do you want to create the frontend now? (y/n) [n]: y^M
... running: source /home/gfactory/v2_5_1/frontend/frontend.sh;/home/gfactory/v2_5_1/glideinWMS/creation/create_frontend /home/gfactory/v2_5_1/frontend/instance_v2_5_1.cfg/frontend.xml^M
Usage: create_frontend [-writeback yes|no] [-debug] cfg_fname|-help^M
^M
Failed to create log dir: [Errno 2] No such file or directory: '/home/gfactory/v2_5_1/frontend/logs/frontend_frontend-v2_5_1'^M
ERROR: Script failed with non-zero return code^M


The fix is to move your log directories in the ini file and reinstall:
;--------------------------------------------------
;  VOFrontend
;--------------------------------------------------
[VOFrontend]
install_location = %(home_dir)s/v2_5_1/frontend
logs_dir     = %(home_dir)s/v2_5_1/frontend_logs
instance_name = v2_5_1
condor_location = %(home_dir)s/v2_5_1/frontendcondor

Problems with grid-mapfile shipped to worker node

After installation, my 'local' jobs ran, but when I submitted grid jobs, they sat idle. Glideins were being sent, but no matching was taking place. Using the 'cat_(daemon_name).py tools to investigate the logs of glideins that came back:

export td=/home/gfactory/v2_5_1/glideinWMS/factory/tools/
export ld=/home/gfactory/v2_5_1/client_logs/user_gfactory/glidein_v2_5_1/entry_fermigrid
$td/cat_MasterLog.py $ld/job.217.6.err
$td/cat_StarterLog.py $ld/job.217.6.err
$td/cat_StartdLog.py $ld/job.217.6.err

Eventually turned this up in the startd logs:


03/18 13:06:42 (pid:13094) ZKM: successful mapping to anonymous
03/18 13:06:42 (pid:13094) PERMISSION DENIED to anonymous@fnpc3061 from host 131.225.67.70 for command 442 (REQUEST_CLAIM), access level DAEMON: reason: DAEMON authorization policy denies IP address 131.225.67.70
03/18 13:07:43 (pid:13094) PERMISSION DENIED to anonymous@fnpc3061 from host 131.225.67.70 for command 442 (REQUEST_CLAIM), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason

Email from Parag:
If the error message is from startd, the gridmapfile used by startd does
not contain DN for user collector and/or user submit node

This is supposed to be taken care of in the step above where I edit frontend.xml and reconfig. Recall that
  • (USER COLLECTOR) the glideinWMS GSI DN is DN="/DC=org/DC=doegrids/OU=Services/CN=gfactory/gpsn01.fnal.gov"
  • (SCHEDD) the rpm condor GSI DIN is DN="/DC=org/DC=doegrids/OU=Services/CN=gpsn01.fnal.gov"
    And if I understand whats going on this step should be taken care of by changing this part of frontend.xml
    from
            <schedds>
                <schedd DN="/DC=org/DC=doegrids/OU=Services/CN=gfactory/gpsn01.fnal.gov" fullname="gpsn01.fnal.gov"/>
                <schedd DN="/DC=org/DC=doegrids/OU=Services/CN=gfactory/gpsn01.fnal.gov" fullname="schedd_jobs1@gpsn01.fnal.gov"/>
                <schedd DN="/DC=org/DC=doegrids/OU=Services/CN=gfactory/gpsn01.fnal.gov" fullname="schedd_jobs2@gpsn01.fnal.gov"/>
                <schedd DN="/DC=org/DC=doegrids/OU=Services/CN=gfactory/gpsn01.fnal.gov" fullname="schedd_jobs3@gpsn01.fnal.gov"/>
                <schedd DN="/DC=org/DC=doegrids/OU=Services/CN=gfactory/gpsn01.fnal.gov" fullname="schedd_jobs4@gpsn01.fnal.gov"/>
             </schedds>
    
    

    to
    <schedds>
                <schedd DN="/DC=org/DC=doegrids/OU=Services/CN=gpsn01.fnal.gov" fullname="gpsn01.fnal.gov"/>
    
    </schedds>
    
    

    However, the jobs weren't matching and the errors described above were happening in the schedd. On Parags advice I changed this as well in the frontend.xml
    from
      <collectors>
          <collector DN="/DC=org/DC=doegrids/OU=Services/CN=gfactory/gpsn01.fnal.gov" node="gpsn01.fnal.gov:9640" secondary="False"/>
          <collector DN="/DC=org/DC=doegrids/OU=Services/CN=gfactory/gpsn01.fnal.gov" node="gpsn01.fnal.gov:9641-9645" secondary="True"/>
       </collectors>
    
    

    to
       <collectors>
          <collector DN="/DC=org/DC=doegrids/OU=Services/CN=gpsn01.fnal.gov" node="gpsn01.fnal.gov:9640" secondary="False"/>
          <collector DN="/DC=org/DC=doegrids/OU=Services/CN=gpsn01.fnal.gov" node="gpsn01.fnal.gov:9641-9645" secondary="True"/>
       </collectors>
    
    

    This is not the DN of the user collector, but doing a frontend_startup reconfig after these changes made a working grid-map-file to be shipped with the glideins which then started matching the user jobs

Default work dir from RESS for fermilab worker nodes

If you fetch entries from RESS to populate your glideinWMS.xml, they look like so:

    <entry name="ress_FNAL_FERMIGRID_1" gridtype="gt2" gatekeeper="d0cabosg2.fnal.gov/jobmanager-pbs" rsl="(queue=osg)(jobtype=single)" work_dir="OSG">

Unfortunately 'work_dir="OSG"' is wrong for fermigrid nodes, glideins unspooling in this directory fill up the disks pretty quickly. Change your configuation to 'work_dir="Condor"' like so:

    <entry name="ress_FNAL_FERMIGRID_1" gridtype="gt2" gatekeeper="d0cabosg2.fnal.gov/jobmanager-pbs" rsl="(queue=osg)(jobtype=single)" work_dir="Condor">

I don't know if this is a case of RESS giving out bad information or the installer not reading it.