Shut down glidein if in Claimed/Idle for too long
I have observed the glideins staying in Claimed/Idle for a long time, wasting CPU.
It was due to a known shadow bug
but the startd should still protect itself.
I propose to add a 10 min timeout, and get the startd out of that state by itself.
#3 Updated by Parag Mhashilkar over 7 years ago
If we want startd to shutdown by itself because of this, daemon_shutdown seems to be the appropriate knob to use.
Kind of related questions. Will it help to trigger a startd restart in this case to get shadow out of the buggy state? If so is it worth doing that if we can preserve that startd has already lived 10 mins or so out of its intended lifetime?
#4 Updated by Igor Sfiligoi over 7 years ago
Technically, we just want to relinquish the claim... so DAEMON_SHUTDOWN is a bit of an overkill... but will still do the work.
As for "restarting", no, DAEMON_SHUTDOWN will kill the whole glidein...
no restart after this event.
If anyone has a better proposal, I am all ears.
#5 Updated by Parag Mhashilkar over 7 years ago
I wasn't clear in the comment. When I said restarting startd, what I meant was to use a scheme to make master restart startd and not use daemon shutdown at all. That way it wont kill glidein at all as it wont get the control at all. Preliminary search through condor manual doesnt seem to show any useful parameter/macro we can use.
#6 Updated by Igor Sfiligoi over 7 years ago
well... I did not think too much about it...
just shutting down the glidein works... anything else would need to be investigated.
Given that we don;t expect this to happen too often, I would go for the easy solution (i.e. just kill the whole glidein).