Fermilab Frontend not communicating w/ upgraded Factory
After the Factory upgraded to 3.4.5 the Fermilab Frontend gpfrontend02, 3.4.2, stopped requesting jobs: was seeing the schedd jobs, was seeing the Factory entries (up), was not requesting glideins to the Factory, almost all stats were 0 (except few glideins running in an entry). Don't know if there were actually no glideins or if it was seeing none, anyway it should have been requesting some.
After upgrading the Frontend to 3.4.5 the requests restarted but glideins in one group were not reporting back.
Later we found out that there was a broken script, Exp_CVMFS_check.sh (missing line continuations). Probably the other groups had glideins queued submitted when a previous version of the script was the current one (glideins download the script version that was current at submission time, not run time).
Problem seems solved now for Factory 3.4.5 and Frontend 3.4.5
Remains to test and reproduce the possible problem w/ Factory 3.4.5 and Frontend 3.4.2. Is it there an incompatibility that was missed at testing?
#1 Updated by Marco Mambelli over 1 year ago
There are some changes in branch v35/22928. This is not solving 22928 which is due to a bug that was in 3.4.2. Improved code clarity (log_and_sum_factory_line invocation) and PEP 8 compliance.
To solve the compatibility problem the 2 values should be sent again as string, like it was in 3.4.2 and earlier, in order not to trigger the bug in 3.4.2 and earlier Frontends.
#2 Updated by Marco Mambelli over 1 year ago
3.4.2 Frontend at Fermilab
All except dune group was 0, but jobs of dune were not running either
CMS Factory 3.4.5 and Frontend 3.4.2 were OK
The problem is with 3.4.2 Frontends and 3.4.3 or bigger Factories (3.4.5, because 3.4.3 and 3.4.4 did not go into production)
It happens when the REQUIRE_VOMS/REQUIRE_GLEXEC_USE attributes are used (these are not used by CMS).
This is related to the REQUIRE_VOMS/REQUIRE_GLEXEC_USE str to boolean (int) tickets (09df208a7ebbd74a5416e80598e42126d1db986a).
There have been several subsequent changes to both frontend and factory, to the point
Initially, there was a bug in the interpretation of booleans in the Frontend and some parameters that should have been boolean were strings (both in the configuration and in the classads sent from the Factory.
Ticket [#21325] (branch 21325_1) changes the attribute from string to boolean in the Factory, with some adjustment in the following releases (i.e. the function provided by Marco)
- self.dicts['vars'].add_extended("GLIDEIN_REQUIRE_VOMS", "boolean", restrictions[u'require_voms_proxy'], None, False, True, True) - self.dicts['vars'].add_extended("GLIDEIN_REQUIRE_GLEXEC_USE", "boolean", restrictions[u'require_glidein_glexec_use'], None, False, True, True) + self.dicts['vars'].add_extended("GLIDEIN_REQUIRE_VOMS", "boolean", bool(strtobool(restrictions[u'require_voms_proxy'])), None, False, True, True) + self.dicts['vars'].add_extended("GLIDEIN_REQUIRE_GLEXEC_USE", "boolean", bool(strtobool(restrictions[u'require_glidein_glexec_use'])), None, False, True, True)
The final result is correct but creates an incompatibility with FE < 3.4.4.
Not much we can do now.¶
The problem was not caught in testing because those attributes were not present in the compatibility testing setup.
They should have, considering that they were in important tickets for the release
We could go back sending strings and that would be backward compatible.
But Factory 3.4.5 is out in production and is the one causing the incompatibility (triggering the Frontend bug).
#5 Updated by Marco Mambelli over 1 year ago
- Target version changed from v3_4_6 to v3_5_1
- Assignee changed from Lorena Lobato Pardavila to Marco Mambelli
- Status changed from New to Resolved
v35/22928 has been merged in master
An initial discussion w/ Brian Lin resulted in No other changes needed (no reversing to sending strings).
A following request by B.Lin (OSG) was to make the Factory compatible w/ older buggy Frontends.
Opened a separate ticket to handle that [#22999]
Opened a ticket to revert the changes in the 3.5 series (sending back the parameter in the correct way, no need to be compatible w/ the buggy frontends) [#23046]