Project

General

Profile

GlideinwmsTestInfrastructure

Fermicloud hosts the following infrastructure that can help GlideinWMS testing:

  • An ITB Frontend (SL7 and 8.6.13-1.4.osg34.el7 at this moment)which is a copy of the ITB Frontend production. This machine (fermicloud308.fnal.gov) is connected to a local Factory and to the ITB Factory (managed by Factory operators) for testing purposes for FIFE and GlideinWMS.
    • The proxies used for this machine are located in etc/gwms-frontend/proxies/ where:
      • dune_proxy was provided by Frontend operators from production to fermicloud308.
      • pilot_proxy is Lorena's proxies. Contact her if needed to be used.
    • Since this machine is shared across FIFE and GlideinWMS team, it is important to keep track of all the changes in the machine configuration. Please be aware of the following files in /root/ and modify them if needed:
      • glideinwms_notify: with a list of emails of people that should be notified each time a change affecting others is performed
      • glideinems_eventlog: a file where changes to the machine and other activities should be written. This file should be checked also before applying disruptive changes to check if there are tests in progress or other activities that could be disrupted.
    • A reminder was added to MOTD (message of the day) that shows up every time a user logs into the machine.
  • Two OSG 3.4 HTCondor Compute Elements
    • fermicloud378: SL7
    • fermicloud105: SL6
  • Two OSG 3.3 (deprecated) Compute Elements
    • fermicloud127: SL7
    • fermicloud046: SL6
  • A all-in-one Computer Element (Globus and HTCondor CE with local grid-mapfile, HTCondor slots) on fermicloud025.fnal.gov: great for small tests, makes it easy to manage new DNs
    • This is an OSG 3.2 installation. It still works, but uses unsupported software.
  • A small computing cluster (HTCondor CE using GUMS and 4 worker nodes): worker nodes are bigger (2 cores and 8GB RAM), ideal to test partitionable slots and glidein policies
    • CE is fermicloud121.fnal.gov
    • ( Worker nodes are fermicloud111, fermicloud081, fermicloud313, fermicloud314
    • these nodes are all OSG 3.2 - they still work, but the software versions are no longer supported
  • A reverse web proxy for Factory and Frontend so that you can send glideins outside Fermilab (e.g. on Amazon AWS or Google Engine)

Below are notes on each of the above.

Compute Element on fermicloud025.fnal.gov

It is a OSG CE with both Globus and HTCondor CE enabled (https://twiki.grid.iu.edu/bin/view/Documentation/Release3/InstallComputeElement).
Uses a edg-mkgridmap with the following VOs enabled (/etc/edg-mkgridmap.conf): osg, mis
It has also a local grid-mapfile (/etc/grid-security/local-grid-mapfile) where DNs can be added.
To refresh the grid-mapfile run: /usr/sbin/edg-mkgridmap

The job manager is HTCondor configured to run jobs on the same node.
The machine is overprovisioned, CE, HTCondor and multiple job slots (more than the available cores) are all on the same SL6 VM with 1 core and 2GB RAM.

Computing Cluster fermicloud121.fnal.gov

It is a OSG CE on fermicloud121 with both Globus and HTCondor CE enabled (https://twiki.grid.iu.edu/bin/view/Documentation/Release3/InstallHTCondorCE). It has a HTCondor job manager (collector and negotiator are on the fermiloud121 as well) and 4 worker nodes: fermicloud111, fermicloud081, fermicloud313, fermicloud314.
Worker nodes are bigger: 2 cores and 8GB RAM, ideal to test partitionable slots and glidein policies

This cluster uses shared home directories exported from the CE and GUMS authentication: https://gums2.fnal.gov:8443/gums/services/GUMSXACMLAuthorizationServicePort

The enabled VOs are: cms (*), fermilab, mis, osg, samgrid
CMS has a generic user, no users pool has been created. If you are mapped to a specific user, e.g. uscms3423, it needs to be added to the cluster.

Compute Element fermicloud378.fnal.gov

An OSG HTCondor CE version 3.2.1 running OSG 3.4 software on Scientific Linux 7.5 OS. GSI authentication via mapfiles. No external worker nodes, 5 local partionable condor slots.
Alowed_VOs = ["osg", "fermilab"] -> allows sub-groups (i.e. Nova, Dunde) of fermilab to run as well. Has home directories and accounts for fermicloud users.

Important: Pool can be queried now from landscape: https://landscape.fnal.gov/kibana/goto/4308d8fa921a61d7edaa94b50c98853f , meaning that Machine classAd attributes can be checked.

Compute Element fermicloud105.fnal.gov

An OSG HTCondor CE version 3.2.2 running OSG 3.4 software on Scientific Linux 6.10.GSI authentication via mapfiles. No external worker nodes, 5 local partionable condor slots.
Alowed_VOs = ["osg", "fermilab"] -> allows sub-groups (i.e. Nova, Dunde) of fermilab to run as well. NO local directories or accounts for fermicloud users.

Compute Element fermicloud046.fnal.gov

An OSG HTCondor CE version 2.2.4 running OSG 3.3 software on Scientific Linux 6.10 OS. NO local directories or accounts for fermicloud users. GSI authentication via mapfiles. No external worker nodes, 5 local partionable condor slots.
Alowed_VOs = ["osg", "fermilab"] -> allows sub-groups (i.e. Nova, Dunde) of fermilab to run as well.

Compute Element fermicloud127.fnal.gov

An OSG HTCondor CE version 2.2.4 running OSG 3.3 software on Scientific Linux 7.5 OS. GSI authentication via mapfiles. No external worker nodes, 5 local partionable condor slots.
Alowed_VOs = ["osg", "fermilab"] -> allows sub-groups (i.e. Nova, Dunde) of fermilab to run as well. Has home directories and accounts for fermicloud users.

Frontend on gwms-dev-frontend.fnal.gov

This is a GlideinWMS Frontend (w/ Schedd and User Collector) installed using OSG RPMs and modified to get all the files from the Git repository cloned in /opt/gwms-git.
A checkout followed by an upgrade and reconfig will have the Frontend use the new version of the software.

Factory on gwms-dev-factory.fnal.gov

This is a GlideinWMS Factory (w/ Factory Collector) installed using OSG RPMs and modified to get all the files from the Git repository cloned in /opt/gwms-git.
A checkout followed by an upgrade and reconfig will have the Factory use the new version of the software.

Reverse Web Proxy for Factory and Frontend on gwms-web.fnal.gov

Fermilab restricts the visibility of Web servers outside of the lab. Specifically on Fermicloud by default: if you run a Web server on port 80 it will be visible only to other Fermicloud nodes; if you run a Web server on another port, that port will be blocked by Fermilab security unless you have an approved exemption.
If you have resources (worker nodes) outside of Fermilab, then you need to make the stage directory on Frontend and Factory visible to them and you can use the reverse proxy to do that (unless you want to ask for a static IP and a Web exemption).

To use gwms-web.fnal.gov:
1. Edit the Apache configuration in /etc/httpd/conf.d/gwms.conf
by adding the ProxyPass and ProxyPassReverse directives for each host you need to be visible outside, e.g. (fermicloud320.fnal.gov is the Factory and fermicloud319.fnal.gov is the Frontend to make visible):

<Location /hepcloud/factory>
 ProxyPass http://fermicloud320.fnal.gov/factory
 ProxyPassReverse http://fermicloud320.fnal.gov/factory
</Location>

<Location /hepcloud/vofrontend>
 ProxyPass http://fermicloud319.fnal.gov/vofrontend
 ProxyPassReverse http://fermicloud319.fnal.gov/vofrontend
</Location>

2. Restart Apache on gwms-web.fnal.gov
3. Edit the Factory (/etc/gwms-factory/glideinWMS.xml) and Frontend (/etc/gwms-frontend/frontend.xml) configurations to let the glideins know the exact URL of the staging areas, which are the Locations you just defined (host is gwms-web.fnal.gov, the path starts with the location defined, followed by "stage"), e.g.
   <stage base_dir="/var/lib/gwms-factory/web-area/stage" use_symlink="True" web_base_url="http://gwms-web.fnal.gov/hepcloud/factory/stage" />

4. Reconfig your Factory and Frontend

You can test the reverse proxy by accessing the monitoring pages or Factory and Frontend, e.g. http://gwms-web.fnal.gov/hepcloud/factory/monitor/factoryStatus.html