Project

General

Profile

Feature #2897

Shut down glidein if in Claimed/Idle for too long

Added by Igor Sfiligoi about 7 years ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Factory
Target version:
Start date:
08/21/2012
Due date:
% Done:

0%

Estimated time:
Stakeholders:
Duration:

Description

I have observed the glideins staying in Claimed/Idle for a long time, wasting CPU.

It was due to a known shadow bug
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2719

but the startd should still protect itself.

I propose to add a 10 min timeout, and get the startd out of that state by itself.

History

#1 Updated by Igor Sfiligoi about 7 years ago

The expression to use is

(State=="Claimed") && (Activity=="Idle") && (EnteredCurrentActivity>600)

#2 Updated by Igor Sfiligoi about 7 years ago

The question is:
Where should we put it?

Should we just add it (as an or) to the

STARTD.DAEMON_SHUTDOWN

expression?

Not the optimal way.
But since this should not happen too often, maybe it is an appropriate action to take.

#3 Updated by Parag Mhashilkar about 7 years ago

If we want startd to shutdown by itself because of this, daemon_shutdown seems to be the appropriate knob to use.

Kind of related questions. Will it help to trigger a startd restart in this case to get shadow out of the buggy state? If so is it worth doing that if we can preserve that startd has already lived 10 mins or so out of its intended lifetime?

#4 Updated by Igor Sfiligoi about 7 years ago

Technically, we just want to relinquish the claim... so DAEMON_SHUTDOWN is a bit of an overkill... but will still do the work.

As for "restarting", no, DAEMON_SHUTDOWN will kill the whole glidein...
no restart after this event.

If anyone has a better proposal, I am all ears.

#5 Updated by Parag Mhashilkar about 7 years ago

I wasn't clear in the comment. When I said restarting startd, what I meant was to use a scheme to make master restart startd and not use daemon shutdown at all. That way it wont kill glidein at all as it wont get the control at all. Preliminary search through condor manual doesnt seem to show any useful parameter/macro we can use.

#6 Updated by Igor Sfiligoi about 7 years ago

Ah...

well... I did not think too much about it...
just shutting down the glidein works... anything else would need to be investigated.

Given that we don;t expect this to happen too often, I would go for the easy solution (i.e. just kill the whole glidein).

#7 Updated by Igor Sfiligoi over 5 years ago

  • Assignee changed from Igor Sfiligoi to Burt Holzman

Reassigning to Burt to evaluate if this is still relevant,
since he is looking at the DAEMON_SHUTDOWN expression right now.

#8 Updated by Parag Mhashilkar over 4 years ago

  • Target version changed from v2_7_x to v3_2_x

#9 Updated by Parag Mhashilkar about 3 years ago

  • Assignee changed from Burt Holzman to Dennis Box

#10 Updated by Marco Mambelli over 1 year ago

  • Target version changed from v3_2_x to v3_4_x

#11 Updated by Marco Mambelli about 1 year ago

  • Target version changed from v3_4_x to v3_5_x

#12 Updated by Marco Mambelli about 2 months ago

  • Target version changed from v3_5_x to v3_7_x


Also available in: Atom PDF