Project

General

Profile

Bug #4612

Factory counts forgotten Held jobs against limit

Added by Igor Sfiligoi over 7 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
Factory
Target version:
Start date:
08/30/2013
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

We just had an "incident" when CMS did not get any glideins submitted for about a month for one entry (on one factory).
This happened after CMS changed the FE name.

The culprit seems to be the fact that glidein jobs submitted on behalf of the old FE went held,
hitting the Held limit (for all CMS FEs).
However, since the factory forgot about the old FE, there was nobody to release them!
To add to the problem, the monitoring was showing 0 held, even though the factory logs were complaining about hitting the limit (and thus not submitting new glideins).

I think the factory should not count those glideins, since it is not managing them else.

History

#1 Updated by Burt Holzman over 7 years ago

  • Target version changed from v2_7_x to v3_2_x

#2 Updated by Burt Holzman about 7 years ago

  • Priority changed from Normal to High

This affected XSEDE just recently -- bumping the priority on this.

#3 Updated by Burt Holzman about 7 years ago

  • Assignee changed from Parag Mhashilkar to Burt Holzman

I'm not sure that disregarding held glideins for FEs that have dropped out is the right thing to do. Flaky transient FEs could cause strange dynamic behavior, and we have to add special code when the factory starts up before any FEs report in. I'm thinking we should just get the monitoring right (and maybe keep trying to release/remove the held glideins).

#4 Updated by Igor Sfiligoi about 7 years ago

Never forgetting a job submitted on behalf of any FE would of course be optimal
(I was sure we had a ticket on this, but could not find it)

But until we have that, having consistent counting would be way better than what we have now.

#5 Updated by Burt Holzman about 7 years ago

I talked to Jeff a bit over IM. The issue is that the factory was actually stuck in a hold/release loop for these glideins (error 131).
However, I thought we had protection against infinite hold/release loops with max_release_count. (10 by default).

#6 Updated by Burt Holzman about 7 years ago

Ok, finally I think I'm closing in on this. It's specific to hitting the FE security class limit.

The problem is that the sanitize inside of keepIdleGlideins uses a limited condor_q (getQProxSecClass). When the other FE disappears, the limits are
still enforced, but nothing shows up in the condor_q since it belonged to the other FE. It also returns '1' after calling sanitizeGlideinsSimple, so
that the main loop thinks work was actually done in the entry.

There is an entry-wide sanitize, but that only gets called when the loop thinks nothing was done in the entry.

To preserve the old behavior, I could write a patch to only return '1' if the FE-level sanitize actually removes glideins. Although maybe we should
just call sanitize every entry loop and remove the FE-level cleaning entirely.

#7 Updated by Burt Holzman about 7 years ago

  • Target version changed from v3_2_x to v3_2_1

#8 Updated by Burt Holzman about 7 years ago

  • Status changed from New to Feedback
  • Assignee changed from Burt Holzman to Parag Mhashilkar

#9 Updated by Parag Mhashilkar about 7 years ago

This change has been reviewed. Just waiting on releasing v3_2 before merging it back to branch_v3_2 and master

#10 Updated by Parag Mhashilkar about 7 years ago

  • Status changed from Feedback to Closed
  • Assignee changed from Parag Mhashilkar to Burt Holzman

Also available in: Atom PDF