Factory counts forgotten Held jobs against limit
We just had an "incident" when CMS did not get any glideins submitted for about a month for one entry (on one factory).
This happened after CMS changed the FE name.
The culprit seems to be the fact that glidein jobs submitted on behalf of the old FE went held,
hitting the Held limit (for all CMS FEs).
However, since the factory forgot about the old FE, there was nobody to release them!
To add to the problem, the monitoring was showing 0 held, even though the factory logs were complaining about hitting the limit (and thus not submitting new glideins).
I think the factory should not count those glideins, since it is not managing them else.
#3 Updated by Burt Holzman about 7 years ago
- Assignee changed from Parag Mhashilkar to Burt Holzman
I'm not sure that disregarding held glideins for FEs that have dropped out is the right thing to do. Flaky transient FEs could cause strange dynamic behavior, and we have to add special code when the factory starts up before any FEs report in. I'm thinking we should just get the monitoring right (and maybe keep trying to release/remove the held glideins).
#6 Updated by Burt Holzman about 7 years ago
Ok, finally I think I'm closing in on this. It's specific to hitting the FE security class limit.
The problem is that the sanitize inside of keepIdleGlideins uses a limited condor_q (getQProxSecClass). When the other FE disappears, the limits are
still enforced, but nothing shows up in the condor_q since it belonged to the other FE. It also returns '1' after calling sanitizeGlideinsSimple, so
that the main loop thinks work was actually done in the entry.
There is an entry-wide sanitize, but that only gets called when the loop thinks nothing was done in the entry.
To preserve the old behavior, I could write a patch to only return '1' if the FE-level sanitize actually removes glideins. Although maybe we should
just call sanitize every entry loop and remove the FE-level cleaning entirely.