Project

General

Profile

Support #24511

Install GlideinWMS framework on Fermicloud

Added by Marco Mambelli about 1 month ago. Updated 21 days ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
06/03/2020
Due date:
% Done:

100%

Estimated time:
Stakeholders:
Duration:

Description

Install GlideinWMS framework on Fermicloud using OSG RPMs.

1. You will need one host for the Factory (including the condor Factory collector and schedd) and one for the Frontend (including the condor User collector/negotioator and a schedd).
Take note of the 2 host addresses (hostname) you will need them for the configuration
Take note also of the certificates DNs. Certificates are in /etc/grid-security/host*

To make new VMs and administer them use fcluigpvm01:
ssh fcluigpvm01.fnal.gov

There are many templates for your VMs, to start I recommend SLF7V_DynIP_Home
This script will list your hosts (ID and hostname needed to ssh):
~marcom/bin/myhosts

Then login as root on your VMs:
ssh
And follow the install instructions.

I'd like you to install the development version of GlideinWMS, the one distributed in the osg-upcoming yum repository. You'll see the option in the instructions below.

2. The factory installation instructions are in the OSG documentation. Follow them to install it:
https://opensciencegrid.org/operations/services/install-gwms-factory/

3. The frontend installation instructions are in the OSG documentation. Follow them to install it :
https://opensciencegrid.github.io/docs/other/install-gwms-frontend/

Continue as much as you can and provide also feedback about the instructions

History

#1 Updated by Marco Mambelli about 1 month ago

There is a known issue with v3.7, the version installed from osg-upcoming

After installing/upgrading the GWMS Factory to v3.7 the following file needs to be present, otherwise, the Factory will not start.
Also if you don't use the new logging feature, you still must touch the file (it is OK to have an empty file):
touch /var/lib/gwms-factory/server-credentials/jwt_secret.key

In short:
Alter any RPM install or RPM upgrade to the GlideinWMS v3.7 Factory please run the following:
touch /var/lib/gwms-factory/server-credentials/jwt_secret.key

#2 Updated by Namratha Urs about 1 month ago

  • % Done changed from 0 to 30
  • Status changed from New to Work in progress

Tasks completed:

1. Created two VMs for the installation of Factory and Frontend components of the GWMS framework
2. Completed pre-requisites for GWMS Factory installation as outlined in OSG documentation
3. Completed GWMS Factory installation from osg-upcoming repository (v3.7-1)
4. Started GWMS Factory configuration

#3 Updated by Namratha Urs 28 days ago

  • % Done changed from 30 to 80
GlideinWMS Factory:
  • Completed all required configurations.
  • Encountered an issue with building of Condor tarballs. Marco assisted me with the resolution and released a patch for the same (#24527).
  • Started all GWMS Factory services and verified the same.
GlideinWMS Frontend:
  • Completed Frontend (v3.7-1) instance installation
  • Completed all required configurations.
  • Encountered an issue with the creation of the pilot proxy via the gwms-renew-proxies.timer service. journalctl shows the following error message:
ERROR: Failed to renew proxy /etc/gwms-frontend/pilot_proxy: Could not find entry in /etc/vomses for /etc/gwms-frontend/mycert.pem. Please verify your VO data installation.

#4 Updated by Namratha Urs 21 days ago

  • % Done changed from 80 to 100
  • Status changed from Work in progress to Closed
GlideinWMS Frontend:
  • Completed all frontend configurations
  • Started GWMS Frontend services and verified the same
  • Submitted user jobs from the frontend and verified that the jobs are run and completed on the grid CE.

Summary:
  • GlideinWMS framework v3.7-1 (factory and frontend instances) verified and all services are running!
  • Encountered several issues during the configuration of the framework and received assistance from Dennis, Bruno and Marco with troubleshooting and understanding the issues. Most of the issues encountered during the configuration process are described below for my future reference (OR) in case someone needs pointers when they are getting on board the project/the team.

ISSUES ENCOUNTERED:
  • Error while building the condor tarball:
    Using default factory config file: /etc/gwms-factory/glideinWMS.xml
    
Error creating condor tgz: Error copying /usr/lib64/libSciTokens.so.0.0.2 in lib/libSciTokens.so.0.0.2: [Errno 2] No such file or directory: 'lib/libSciTokens.so.0.0.2'
    
Upgrading the factory                                      [FAILED]

    
    • Refer to #24527 which has been updated as a bug fix.
  • The factory's condor_status should have a corresponding row on the frontend's condor_status which was missing, i.e. the glidefactory component on the factory side should correspond to a glideresource on the frontend, but condor_status doesn't list the glideresource component when the frontend is started.
    • Checked the frontend proxies to ensure they were not expired. The factory log (/var/log/gwms-factory/server/factory/factory.all.log) reported the following error message:
      CredentialError: Client provided invalid ReqEncIdentity(vofrontend_service@gfermicloud395.fnal.gov!=vofrontend_service@fermicloud395.fnal.gov). Skipping for security reasons.
      [2020-06-11 11:17:28,392] ERROR: glideFactory:629: Error occurred processing the globals classads:
      
    • There was a typo in the frontend.xml in the my_identity attribute in the <collector> tag. After correcting the typo, issued a reconfig of the frontend.xml file followed by restart of factory/frontend services. Doing this resolved the credential error and the glideresource component was being listed in the condor_status output (using the command condor_status -any -wide)
  • Job submitted via condor_submit never starts and remains idle (IDLE columns shows I indicating the job status indefinitely).
    • The following line in frontend.xml was misconfigured. The line read:
      <factory query_expr='((stringListMember("VO", GLIDEIN_Supported_VOs)))'>
      
      and the attribute GLIDEIN_Supported_VOs was not defined within the XML file, which was causing the resource configuration to fail. Instead of defining the attribute within the <attrs> tag, the line was changed to:
      factory query_expr="True">
      
      and with a frontend reconfig followed by restart of services (both frontend and factory), the jobs were submitted and were running on the CE as expected.
  • Upon submitting new jobs on the frontend, the jobs remained in idle status without being run on the CE. The factory log (/var/log/gwms-factory/client/user_frontend/glidein_gfactory_instance/entry_FC395_TEST_ENTRY/condor_activity_xxx.log) reports the following message (fermicloud025 is the grid CE entry configured in the factory):
    012 (172.007.000) 2020-06-17 15:48:58 Job was held.
        Error connecting to schedd fermicloud025.fnal.gov: SECMAN:2007:Failed to received post-auth ClassAd|AUTHENTICATE:1004:Failed to authenticate using FS
        Code 0 Subcode 0
    
    • Again, checked to see if the proxies were expired, which was not the case. Marco assisted with testing of the CE to see if the CE is working as expected and accepting job submissions using condor_ce_trace. The condor_ce_trace submits a condor job to the CE and prints diagnostic messages that helps understand whether the CE is working or not. Previously, the CE was not running and was restarted followed by a run of condor_ce_trace.
  • While the frontend proxy was automatically renewed, the pilot proxy did not generate and/or renew automatically via gwms-renew-proxies.service service. The gwms-renew-proxies.service was in failed state (via systemctl status gwms-renew-proxies.service command) with the following message:
    ERROR: Failed to renew proxy /etc/gwms-frontend/pilot_proxy: Could not find entry in /etc/vomses for /etc/gwms-frontend/mycert.pem. Please verify your VO data installation.
    
    • The proxies.ini file in /etc/gwms-frontend contains configuration for automatic proxy generation and renewal (including the VO membership renewal) for both the proxies (frontend and pilot). The use_voms_server setting indicates how the proxy's VO attributes will be signed. This needs to be set to True to allow the VO's attributes to be signed by the virtual organization's VOMS server (set to False, by default).


Also available in: Atom PDF