Project

General

Profile

Support #18992

Install GlideinWMS framework

Added by Marco Mambelli almost 2 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
Start date:
02/12/2018
Due date:
% Done:

0%

Estimated time:
Stakeholders:
Duration:

Description

Install GlideinWMS framework on fermicloud using OSG RPMs.

1. You will need one host for the Factory (including the condor Factory collector and schedd) and one for the Frontend (including the condor User collector/negotioator and a schedd).
Take note of the 2 host addresses (hostname) you will need them for the configuration
Take note also of the certificates DNs. Certificates are in /etc/grid-security/host*

2. The frontend installation instructions are in the OSG documentation. Follow them to install the :
https://opensciencegrid.github.io/docs/other/install-gwms-frontend/

3. The factory installation instructions are no more available online. This is the latest version in markup (twiki language). Start the generic setup like for the Frontend (RPM repos setup) and then follow these instructions to install the factory:
https://github.com/opensciencegrid/docs/blob/master/archive/InstallGlideinWMSFactory

4a. Register for a github account (if you don't have one already). Propose a new document using a pull request:
Clone the OSG doc repo, add the new document and do a pull request once ready (first maybe let me review it):
docs/other/install-gwms-factory.md
It will be:
https://opensciencegrid.github.io/docs/other/install-gwms-factory/

4b. Feel free to provide feedback and suggest changes w/ a pull request for the frontend install document.

Here some generic guidelines for the OSG documentation:
https://opensciencegrid.github.io/technology/documentation/writing-documentation/
https://opensciencegrid.github.io/technology/documentation/style-guide/
https://opensciencegrid.github.io/technology/documentation/markdown-migration/

As fare as computing resource for this test you can use fermicloud025, a globus CE. You can use as entry something like:

      </entry><entry auth_method="grid_proxy" enabled="True" gatekeeper="fermicloud025.fnal.gov/jobmanager-condor" gridtype="gt2" name="ITB_FC_CE2" rsl="(queue=default)(jobtype=single)" trust_domain="grid" verbosity="std" work_dir="OSG">
         <config>
            <max_jobs>
               <default_per_frontend glideins="5000" held="50" idle="100" />
               <per_entry glideins="10000" held="1000" idle="2000" />
               <per_frontends>
               </per_frontends>
            </max_jobs>
            <release max_per_cycle="20" sleep="0.2" />
            <remove max_per_cycle="5" sleep="0.2" />
            <restrictions require_glidein_glexec_use="False" require_voms_proxy="False" />
            <submit cluster_size="10" max_per_cycle="100" sleep="0.2" slots_layout="fixed">
               <submit_attrs>
               </submit_attrs>
            </submit>
         </config>
         <allow_frontends>
         </allow_frontends>
         <attrs>
            <attr const="False" glidein_publish="False" job_publish="False" name="CONDOR_ARCH" parameter="True" publish="True" type="string" value="default" />
            <attr const="False" glidein_publish="False" job_publish="False" name="CONDOR_OS" parameter="True" publish="True" type="string" value="rhel6" />
            <attr const="True" glidein_publish="False" job_publish="False" name="GLEXEC_JOB" parameter="True" publish="True" type="string" value="False" />
            <attr const="True" glidein_publish="True" job_publish="True" name="GLIDEIN_Site" parameter="True" publish="True" type="string" value="ITB_FC_CE2" />
            <attr const="True" glidein_publish="True" job_publish="False" name="USE_CCB" parameter="True" publish="True" type="string" value="True" />
         </attrs>
         <files>
         </files>
         <infosys_refs>
         </infosys_refs>
         <monitorgroups>
         </monitorgroups>
      </entry>

History

#1 Updated by Lorena Lobato Pardavila almost 2 years ago

  • Status changed from New to Work in progress

#2 Updated by Lorena Lobato Pardavila almost 2 years ago

Two hosts have been created ( one for the Factory and one for the Frontend)

-bash-4.1$ onetemplate instantiate --name LLPFactorytest SLF7V_DynIP_Home
VM ID: 42885
-bash-4.1$ onetemplate instantiate --name LLPFrontEndtest SLF7V_DynIP_Home
VM ID: 42886

-bash-4.1$ ./checkNetwork.sh
72.154.225.131.in-addr.arpa domain name pointer fermicloud137.fnal.gov.
101.155.225.131.in-addr.arpa domain name pointer fermicloud364.fnal.gov.

42885 131.225.154.72 72.154.225.131.in-addr.arpa domain name pointer fermicloud137.fnal.gov.
42886 131.225.155.101 101.155.225.131.in-addr.arpa domain name pointer fermicloud364.fnal.gov.

Both hosts are running:

-bash-4.1$ onevm list
    ID USER     GROUP    NAME            STAT UCPU    UMEM HOST             TIME
 42878 llobato  users    llobatoGWMStest runn    0    1.9G fcl009       1d 05h55
 42885 llobato  users    LLPFactorytest  runn   93    1.9G fcl116       0d 00h02
 42886 llobato  users    LLPFrontEndtest runn  103    1.9G fcl411       0d 00h01

Also, I wrote down both certificates DNs.

#3 Updated by Lorena Lobato Pardavila almost 2 years ago

For the documentation:

  1. Forked the OSG doc repo in my account in GitHub
  2. Have created a new branch names as this ticket (18992)
  3. Have created a new document in docs/other named install-gwms-factory.md
  4. Have used the current content of https://github.com/opensciencegrid/docs/blob/master/archive/InstallGlideinWMSFactory. Started to adapt it to .md file
    and to correct no-clear descriptions.

The idea is starting to correct things that are no clear for me at the same time that I go through the documentation of GWMS Factory installation.

#4 Updated by Lorena Lobato Pardavila almost 2 years ago

FrontEnd part working:

[root@fermicloud364 ~]# /usr/sbin/gwms-frontend reconfig
Using default Frontend config file: /etc/gwms-frontend/frontend.xml
...Saved the current config file into the working dir
...Saved the backup config file into the working dir
...Reconfigured frontend 'fermicloud364-fnal-gov_OSG_gWMSFrontend'
...Active groups are:
     main
...Verifying rrd schema
...Work files are in /var/lib/gwms-frontend/vofrontend
Reconfiguring the frontend                                 [  OK  ]
[root@fermicloud364 ~]# /usr/sbin/gwms-frontend upgrade
Using default frontend config file: /etc/gwms-frontend/frontend.xml
...Updated the frontend_startup script
...Saved the current config file into the working dir
...Saved the backup config file into the working dir
...Reconfigured frontend 'fermicloud364-fnal-gov_OSG_gWMSFrontend'
...Active groups are:
     main
...Verifying rrd schema
...Work files are in /var/lib/gwms-frontend/vofrontend
...Overriding the frontend config file in /etc/gwms-frontend/frontend.xml to the current configuration
Upgrading the frontend                                     [  OK  ]
[root@fermicloud364 ~]# /usr/sbin/gwms-frontend start
Starting glideinWMS frontend fermicloud364-fnal-gov_OSG_gWM[  OK  ]d:

#5 Updated by Lorena Lobato Pardavila almost 2 years ago

Factory part working:

[root@fermicloud137 gwms-factory]# /usr/sbin/gwms-factory reconfig
Using default factory config file: /etc/gwms-factory/glideinWMS.xml
Reconfiguring the factory............................................................................................................+++
....................+++
...Reconfigured glidein 'gfactory_instance' is complete
...Active entries are:
     ITB_FC_CE2
...Verifying rrd schema
...Submit files are in /var/lib/gwms-factory/work-dir
                                                           [  OK  ]
[root@fermicloud137 gwms-factory]# /usr/sbin/gwms-factory upgrade
Using default factory config file: /etc/gwms-factory/glideinWMS.xml
...Updated the glidein_startup.sh and local_start.sh scripts
...Updated the glidein_startup.sh file in the staging area
...Updated the factory_startup script
...Reconfigured glidein 'gfactory_instance' is complete
...Active entries are:
     ITB_FC_CE2
...Verifying rrd schema
...Submit files are in /var/lib/gwms-factory/work-dir
Upgrading the factory                                      [  OK  ]
[root@fermicloud137 gwms-factory]# /usr/sbin/gwms-factory start
Starting GlideinWMS Factory gfactory_instance@gfactory_serv[  OK  ]

#6 Updated by Lorena Lobato Pardavila almost 2 years ago

I was struggled in the condor_mapfiles and entries configuration due to Factory documentation is some kind of obsolete. I am updating it based on my experience as new user.

On the other hand, a collateral error was found out when I was manipulating entries configuration in the glideinWMS.xml file from the factory. When there is entry attribute - condor_tarball attribute not matchedand any service is being activated, a concatenation error appears in the background related to the code.

[root@fermicloud137 gwms-factory]# /usr/sbin/gwms-factory reconfig
Using default factory config file: /etc/gwms-factory/glideinWMS.xml
Reconfiguring the factoryCondor (version=default, os=rhel6, arch=default) for entry ITB_FC_CE2 could not be resolved from <glidein><condor_tarballs>...</condor_tarballs></glidein> configuration.
Traceback (most recent call last):
  File "/sbin/reconfig_glidein", line 251, in <module>
    print2(re)
  File "/sbin/reconfig_glidein", line 41, in print2
    journal.send( message )
  File "/usr/lib64/python2.7/site-packages/systemd/journal.py", line 391, in send
    args = ['MESSAGE=' + MESSAGE]
TypeError: cannot concatenate 'str' and 'ReconfigError' objects
                                                           [FAILED]

A ticket was opened (Issue #19325)

#7 Updated by Lorena Lobato Pardavila almost 2 years ago

Due to a misconfiguration in the frontend.xml (white space..), I was not able to have the system fully working.

Now, it's! :) And I have learnt a lot about GlideinWMS troubleshooting thanks to Marco.

FrontEnd Activity

[root@fermicloud364 llobato]# condor_status
Name                                           OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

glidein_23238_232105354@fermicloud025.fnal.gov LINUX      X86_64 Claimed   Busy      6.660 1858  0+00:00:02
glidein_23239_67780100@fermicloud025.fnal.gov  LINUX      X86_64 Claimed   Busy      6.660 1858  0+00:00:03
glidein_23240_660892020@fermicloud025.fnal.gov LINUX      X86_64 Claimed   Busy      6.790 1858  0+00:00:02
glidein_23680_177720320@fermicloud025.fnal.gov LINUX      X86_64 Claimed   Busy      6.650 1858  0+00:00:02
glidein_23703_553260162@fermicloud025.fnal.gov LINUX      X86_64 Claimed   Busy      6.650 1858  0+00:00:02
glidein_23723_556519572@fermicloud025.fnal.gov LINUX      X86_64 Claimed   Busy      6.650 1858  0+00:00:03

                     Machines Owner Claimed Unclaimed Matched Preempting  Drain

        X86_64/LINUX        6     0       6         0       0          0      0

               Total        6     0       6         0       0          0      0

Factory Activity

[root@fermicloud137 llobato]# condor_q -g

-- Schedd: schedd_glideins2@fermicloud137.fnal.gov : <131.225.154.72:9615?... @ 03/13/18 18:40:33
OWNER    BATCH_NAME                 SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
frontend CMD: glidein_startup.sh   3/13 17:57      _      6    100    106 15.0 ... 25.5

106 jobs; 0 completed, 0 removed, 100 idle, 6 running, 0 held, 0 suspended

Next stage: To the test with an official factory. I already sent an email to to request access to the OSG Glidewin Factory at UCSD.

I keep updating and completing the Factory and correcting FrontEnd documentation

#8 Updated by Lorena Lobato Pardavila almost 2 years ago

After several investigations where my pilots were I had issues with:

  • Proxies and/or certifications
  • Firewalls
  • Factory connections
  • Misconfigurations
  • Glideins

I finally have my FE connected with my own Factory as indicated above, and also with the OSG Factory.

- User Pool (Before it was empty, meaning that glideins were not back)

[root@fermicloud364 ~]# condor_status
Name                                                          OpSys      Arch   State     Activity

glidein_977_173498444@fermicloud025.fnal.gov                  LINUX      X86_64 Claimed   Busy
glidein_2383_148379400@fermicloud025.fnal.gov                 LINUX      X86_64 Claimed   Busy
glidein_9108_839757324@fermicloud025.fnal.gov                 LINUX      X86_64 Claimed   Busy
glidein_15013_233920960@fermicloud025.fnal.gov                LINUX      X86_64 Claimed   Busy
glidein_21252_245361060@fermicloud025.fnal.gov                LINUX      X86_64 Claimed   Busy
glidein_27226_59802428@fermicloud025.fnal.gov                 LINUX      X86_64 Claimed   Busy
glidein_1314498_71965726@mwt2-c055.campuscluster.illinois.edu LINUX      X86_64 Unclaimed Idle

                     Machines Owner Claimed Unclaimed Matched Preempting  Drain

        X86_64/LINUX        7     0       6         1       0          0      0

               Total        7     0       6         1       0          0      0

- Queues n the FE

-bash-4.2$ condor_q

-- Schedd: fermicloud364.fnal.gov : <131.225.155.101:9615?... @ 03/28/18 14:45:18
OWNER   BATCH_NAME            SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
llobato CMD: jobpi_test.py   3/28 14:39     33      2     65    100 43.33-99

67 jobs; 0 completed, 0 removed, 65 idle, 2 running, 0 held, 0 suspended

- Group Main log

...................
[2018-03-28 14:54:08,921] INFO:     1(   26     1     1     0)     3(    0 10000) |     0     0     0     0 |     0     0     0 |     1     2 | Up   OSG_US_WORKSHOP_8@gfactory_instance@ITBGOC@glidein-itb.grid.iu.edu
[2018-03-28 14:54:08,922] INFO:     1(   26     1     1     0)     3(    0 10000) |     0     0     0     0 |     0     0     0 |     1     2 | Up   OSG_US_WORKSHOP_9@gfactory_instance@ITBGOC@glidein-itb.grid.iu.edu
[2018-03-28 14:54:08,923] INFO:     1(   26     1     1     0)     3(    0 10000) |     0     0     0     0 |     0     0     0 |     1     2 | Up   OSG_US_WSU_GRID_ce2@gfactory_instance@ITBGOC@glidein-itb.grid.iu.edu
[2018-03-28 14:54:08,924] INFO:             Jobs in schedd queues                 |           Slots         |       Cores       | Glidein Req | Factory/Entry Information
[2018-03-28 14:54:08,925] INFO: Idle (match  eff   old  uniq )  Run ( here  max ) | Total  Idle   Run  Fail | Total  Idle   Run | Idle MaxRun | State Factory
[2018-03-28 14:54:08,925] INFO:    62( 1612    61    62     0)   186(    3  620k) |     7     1     6     0 |     7     1     6 |    61   126 | Up   Sum of useful factories
[2018-03-28 14:54:08,925] INFO:     2(   52     2     2     0)     6(    0 20000) |     0     0     0     0 |     0     0     0 |     0     0 | Down Sum of down factories
[2018-03-28 14:54:08,926] INFO:     0(    0     0     0     0)     0(    0     0) |     0     0     0     0 |     0     0     0 |     0     0 | Down Unmatched
[2018-03-28 14:54:08,956] INFO: Advertising global and singular requests for factory fermicloud137.fnal.gov
[2018-03-28 14:54:08,962] INFO: Advertising global and singular requests for factory glidein-itb.grid.iu.edu
[2018-03-28 14:54:09,089] INFO: Advertising 64 glideresource classads to the user pool
[2018-03-28 14:54:09,113] INFO: There are 64 classads to advertise
[2018-03-28 14:54:09,490] INFO: Done advertising
[2018-03-28 14:54:09,493] INFO: iterate_one status: None
[2018-03-28 14:54:09,493] INFO: Writing stats

After this configuration, I learnt how to:

  • Configure GlidewinWMS services
  • Debug the different services
  • Check the monitoring
  • Create and configure certificates and proxies
  • Best practices for configuring FE when I'm connecting with a OSG Factory
  • How to use query_expressions and attributes

Note: Still lot to learn, but this will improve in the future :)

#9 Updated by Lorena Lobato Pardavila almost 2 years ago

  • Status changed from Work in progress to Closed

#10 Updated by Marco Mambelli almost 2 years ago

  • Status changed from Closed to Feedback
  • Assignee changed from Lorena Lobato Pardavila to Marco Mambelli

#11 Updated by Marco Mambelli almost 2 years ago

Hi Marco,

My local repository is: https://github.com/llobato/docs.git
The branch is: 18992
The documents to review are:
Factory: https://github.com/llobato/docs/blob/18992/docs/other/install-gwms-factory.md
FrontEnd: https://github.com/llobato/docs/blob/18992/docs/other/install-gwms-frontend.md
Simple_diagram which had been already uploaded. Just that I’ve pointed out from factory.md also

Don’t hesitate to tell if there is anything that you don’t like or you’d change.

Thank you for reviewing it 😊

#12 Updated by Marco Mambelli almost 2 years ago

  • Assignee changed from Marco Mambelli to Lorena Lobato Pardavila
  • Target version set to v3_4_0

#13 Updated by Lorena Lobato Pardavila almost 2 years ago

Checked the pull request towards my branch with the review of the changes regarding to the FE and Factory installation in the OSG documentation. PR to OSG with an updated of noticing you don’t have to scape in the DN for the condor_mapfile.
Waiting for review from Brian Lin and merge (https://github.com/opensciencegrid/docs/pull/349)

#14 Updated by Lorena Lobato Pardavila over 1 year ago

The Factory part was already merged into OSG documentation. It will be under "services" (docs/services/gwms-factory.md)

The FrontEnd documentation suggestions will be reviewed by Brian along this week (hopefully). I'll follow it up.

As I have already completed (several times) either the installation of glideinWMS from the scratch or upgrades, and I've already got familiarized with, I close the ticket.

#15 Updated by Lorena Lobato Pardavila over 1 year ago

  • Status changed from Feedback to Resolved

#16 Updated by Marco Mambelli over 1 year ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF