Project

General

Profile

Support #22928

Fermilab Frontend not communicating w/ upgraded Factory

Added by Marco Mambelli 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Category:
-
Target version:
Start date:
07/12/2019
Due date:
% Done:

0%

Estimated time:
Stakeholders:
Duration:

Description

After the Factory upgraded to 3.4.5 the Fermilab Frontend gpfrontend02, 3.4.2, stopped requesting jobs: was seeing the schedd jobs, was seeing the Factory entries (up), was not requesting glideins to the Factory, almost all stats were 0 (except few glideins running in an entry). Don't know if there were actually no glideins or if it was seeing none, anyway it should have been requesting some.
After upgrading the Frontend to 3.4.5 the requests restarted but glideins in one group were not reporting back.
Later we found out that there was a broken script, Exp_CVMFS_check.sh (missing line continuations). Probably the other groups had glideins queued submitted when a previous version of the script was the current one (glideins download the script version that was current at submission time, not run time).

Problem seems solved now for Factory 3.4.5 and Frontend 3.4.5
Remains to test and reproduce the possible problem w/ Factory 3.4.5 and Frontend 3.4.2. Is it there an incompatibility that was missed at testing?


Related issues

Related to GlideinWMS - Support #22999: Make Factory compatible w/ older 3.4 Frontends - Revert back to send REQUIRE_VOMS/REQUIRE_GLEXEC_USE attributes as stringsClosed07/27/2019

Related to GlideinWMS - Support #23046: Revert the changes in #22999, at least any workaround to be compatible w/ buggy behavior in 3.4Accepted07/28/2019

History

#1 Updated by Marco Mambelli 2 months ago

There are some changes in branch v35/22928. This is not solving 22928 which is due to a bug that was in 3.4.2. Improved code clarity (log_and_sum_factory_line invocation) and PEP 8 compliance.

To solve the compatibility problem the 2 values should be sent again as string, like it was in 3.4.2 and earlier, in order not to trigger the bug in 3.4.2 and earlier Frontends.

#2 Updated by Marco Mambelli about 2 months ago

Things observed:

3.4.2 Frontend at Fermilab
3.4.5 Factory
All except dune group was 0, but jobs of dune were not running either

CMS Factory 3.4.5 and Frontend 3.4.2 were OK

Explanation:

The problem is with 3.4.2 Frontends and 3.4.3 or bigger Factories (3.4.5, because 3.4.3 and 3.4.4 did not go into production)
It happens when the REQUIRE_VOMS/REQUIRE_GLEXEC_USE attributes are used (these are not used by CMS).

This is related to the REQUIRE_VOMS/REQUIRE_GLEXEC_USE str to boolean (int) tickets (09df208a7ebbd74a5416e80598e42126d1db986a).
There have been several subsequent changes to both frontend and factory, to the point

Initially, there was a bug in the interpretation of booleans in the Frontend and some parameters that should have been boolean were strings (both in the configuration and in the classads sent from the Factory.

Ticket [#21325] (branch 21325_1) changes the attribute from string to boolean in the Factory, with some adjustment in the following releases (i.e. the function provided by Marco)

    
    -        self.dicts['vars'].add_extended("GLIDEIN_REQUIRE_VOMS", "boolean", restrictions[u'require_voms_proxy'], None, False, True, True)
    -        self.dicts['vars'].add_extended("GLIDEIN_REQUIRE_GLEXEC_USE", "boolean", restrictions[u'require_glidein_glexec_use'], None, False, True, True)
    +        self.dicts['vars'].add_extended("GLIDEIN_REQUIRE_VOMS", "boolean", bool(strtobool(restrictions[u'require_voms_proxy'])), None, False, True, True)
    +        self.dicts['vars'].add_extended("GLIDEIN_REQUIRE_GLEXEC_USE", "boolean", bool(strtobool(restrictions[u'require_glidein_glexec_use'])), None, False, True, True)

The final result is correct but creates an incompatibility with FE < 3.4.4.

Not much we can do now.

The problem was not caught in testing because those attributes were not present in the compatibility testing setup.
They should have, considering that they were in important tickets for the release

We could go back sending strings and that would be backward compatible.
But Factory 3.4.5 is out in production and is the one causing the incompatibility (triggering the Frontend bug).

#3 Updated by Lorena Lobato Pardavila about 2 months ago

  • Priority changed from Normal to High
  • Assignee changed from Marco Mambelli to Lorena Lobato Pardavila

#4 Updated by Marco Mambelli about 2 months ago

  • Related to Support #22999: Make Factory compatible w/ older 3.4 Frontends - Revert back to send REQUIRE_VOMS/REQUIRE_GLEXEC_USE attributes as strings added

#5 Updated by Marco Mambelli about 2 months ago

  • Target version changed from v3_4_6 to v3_5_1
  • Assignee changed from Lorena Lobato Pardavila to Marco Mambelli
  • Status changed from New to Resolved

v35/22928 has been merged in master
An initial discussion w/ Brian Lin resulted in No other changes needed (no reversing to sending strings).
A following request by B.Lin (OSG) was to make the Factory compatible w/ older buggy Frontends.
Opened a separate ticket to handle that [#22999]
Opened a ticket to revert the changes in the 3.5 series (sending back the parameter in the correct way, no need to be compatible w/ the buggy frontends) [#23046]

#6 Updated by Marco Mambelli about 2 months ago

  • Related to Support #23046: Revert the changes in #22999, at least any workaround to be compatible w/ buggy behavior in 3.4 added


Also available in: Atom PDF