Project

General

Profile

Bug #5359

Modify DAEMON_SHUTDOWN to use idle timers that are relative to change in state

Added by Burt Holzman over 6 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Category:
Factory
Target version:
Start date:
02/06/2014
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

The current logic checks TotalTimeUnclaimedIdle < GLIDEIN_Max_Idle (and TotalTimeUnclaimedBusy < GLIDEIN_Max_Tail).

This works fine for non-partitionable slots, but for partitionable slots, the parent slot state doesn't change state when subslots are partitioned. (It helps that don't enforce those shutdowns unless all the subslots were returned to the parent, but the behavior still differs).

The HTCondor team provided us with a way to do what we want in 8.1.3:

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3481


Related issues

Related to GlideinWMS - Feature #4680: Quantize glidein shutdown times in the cloudAssigned09/24/2013

History

#1 Updated by Parag Mhashilkar about 6 years ago

  • Subject changed from Modify DAEMON_SHUTDOWN to use idle timers that work with partitionable slots to Modify DAEMON_SHUTDOWN to use idle timers that are relative to change in state

We need to fix the idle time calculation to be relative to change in state. This affects both partitionable and non-partitionable slots.

On 3/14/2014 11:33 AM, Igor Sfiligoi wrote:
The slot goes from
Unclaimed/Idle -> Claimed/Idle
after a match.
Only after the shadow has started (and possibly the files transferred,
not sure about this part),
will the slot finally go into Claimed/Busy.

Notice that only 5s passed since it went into Claimed, so it is not
unreasonable for the schedd to take a bit to be ready to get the job going.

I think the problem is due to the fact that Slot1_TotalTimeClaimedBusy
is set to 0 the moment it enters Claimed State,
thus triggering the "Tail" expression.
But cannot be 100% sure, as I don't have that information available.

So changing State=="Busy" would fix this one particular case.
At the last gwms meeting the entire team agreed that changing the behavior so that we use relative rather than cumulative time was reasonable (in other words: we agree with you and will fix it)

#2 Updated by Parag Mhashilkar about 6 years ago

  • Assignee changed from Burt Holzman to Marco Mambelli

#3 Updated by Parag Mhashilkar about 6 years ago

  • Target version changed from v3_2_5 to v3_2_6

#4 Updated by Parag Mhashilkar almost 6 years ago

  • Target version changed from v3_2_6 to v3_2_7

#5 Updated by Igor Sfiligoi almost 6 years ago

The fix should play well with
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=4266

I.e. the shutdown expression must be relative to the internal Condor time attributes.

#6 Updated by Marco Mambelli almost 6 years ago

  • Status changed from Assigned to Feedback
  • Assignee changed from Marco Mambelli to Parag Mhashilkar

Hi Parag,
I committed the changes. I tested w/ condor 8.0 and 8.2.
Works as expected.
Marco

#7 Updated by Marco Mambelli almost 6 years ago

I changed the names of the variables in the expression and separated some parts to be more expressive and clear.

I opened 2 tickets asking for documentation about 2 attributes that are not in the manual index:
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktedit?tn=4623
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=4624

I'm investigating how expressions are evaluated to see if I should add controls in the expressions.

#8 Updated by Parag Mhashilkar over 5 years ago

  • Status changed from Feedback to Resolved
  • Assignee changed from Parag Mhashilkar to Marco Mambelli

Looks ok. Merged.

#9 Updated by Marco Mambelli over 5 years ago

  • Status changed from Resolved to Feedback
  • Assignee changed from Marco Mambelli to Parag Mhashilkar

Corrected documentation as suggested by Burt.

#10 Updated by Parag Mhashilkar over 5 years ago

  • Status changed from Feedback to Resolved
  • Assignee changed from Parag Mhashilkar to Marco Mambelli

looks ok ... merged

#11 Updated by Parag Mhashilkar over 5 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF