Project

General

Profile

Bug #3399

DAEMON_SHUTDOWN broken for partitionable slots

Added by Burt Holzman almost 8 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Factory
Target version:
Start date:
02/01/2013
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

Partitionable slots always have a single idle slot, so the current DAEMON_SHUTDOWN expression evaluates to TRUE when we don't want it to.

From Derek Weitzel:

I watched a glidein run, and I saw this in the log:
01/31/13 17:54:53 (pid:40630) The DaemonShutdown expression "( ((Activity=="Idle") && (TotalTimeUnclaimedIdle =!= UNDEFINED) && (GLIDEIN_Max_Idle =!= UNDEFINED)&& (TotalTimeUnclaimedIdle > GLIDEIN_Max_Idle))||((Activity=="Idle") && (GLIDEIN_ToRetire =!= UNDEFINED) && (CurrentTime > GLIDEIN_ToRetire )) || ((GLIDEIN_ToDie =!= UNDEFINED) && (CurrentTime > GLIDEIN_ToDie )) || ((Activity=="Idle") && (TotalTimeUnclaimedIdle=!= UNDEFINED) && (TotalTimeClaimedBusy=!= UNDEFINED) && (GLIDEIN_Max_Tail=!= UNDEFINED) && (TotalTimeUnclaimedIdle > GLIDEIN_Max_Tail)) )" evaluated to TRUE: starting graceful shutdown

And it made me realize that in partitionable slots, there is always 1 slot that is idle. So this expression has to change in order for partitionable slots to work. How it should change? I'm not sure.

I believe we should change this expression to:
( ((Activity=="Idle") && (TotalTimeUnclaimedIdle =!= UNDEFINED) && (GLIDEIN_Max_Idle =!= UNDEFINED)&& (TotalTimeUnclaimedIdle > GLIDEIN_Max_Idle))||((Activity=="Idle") && (GLIDEIN_ToRetire =!= UNDEFINED) && (CurrentTime > GLIDEIN_ToRetire )) || ((GLIDEIN_ToDie =!= UNDEFINED) && (CurrentTime > GLIDEIN_ToDie )) || ((Activity=="Idle") && (TotalTimeUnclaimedIdle=!= UNDEFINED) && (TotalTimeClaimedBusy=!= UNDEFINED) && (GLIDEIN_Max_Tail=!= UNDEFINED) && (TotalTimeUnclaimedIdle > GLIDEIN_Max_Tail)) ) && ( (isUndefined(PartitionableSlot) =!= False) || (TotalSlots =?= 1) )

This won't keep the glidein around for 20 minutes after the last job. But I think the logic is correct to keep the glidein from evicting.

History

#1 Updated by Burt Holzman almost 8 years ago

  • Assignee changed from Brian Bockelman to Burt Holzman

I'll look at this.

#2 Updated by Burt Holzman almost 8 years ago

I think we need some help from the Condor team here to add a few classads we need into the partitionable slot. With this prescription, if the partitionable slot reclaims all the resources, DAEMON_SHUTDOWN never gets triggered. If "EnteredCurrentState" would reset when a subslot was created or reclaimed, that might be sufficient.

#3 Updated by Burt Holzman almost 8 years ago

Actually, let me correct that: with this prescription, if the partitionable slot reclaims all the resources and has been RUNNING for longer than Glidein_Max_Idle, it will trigger. That's probably "close enough". Here's a little cleaner version:

DS_TODIE = ((GLIDEIN_ToDie =!= UNDEFINED) && (CurrentTime > GLIDEIN_ToDie))
DS_IDLE_MAX = ((TotalTimeUnclaimedIdle =!= UNDEFINED) && (GLIDEIN_Max_Idle =!= UNDEFINED) && \
               (TotalTimeUnclaimedIdle > GLIDEIN_Max_Idle))
DS_IDLE_RETIRE = ((GLIDEIN_ToRetire =!= UNDEFINED) && (CurrentTime > GLIDEIN_ToRetire ))
DS_IDLE_TAIL = ((TotalTimeUnclaimedIdle=!= UNDEFINED) && (TotalTimeClaimedBusy=!= UNDEFINED) && \
                (GLIDEIN_Max_Tail=!= UNDEFINED) && (TotalTimeUnclaimedIdle > GLIDEIN_Max_Tail))
DS_IDLE = ( (Activity == "Idle") &&  ($(DS_IDLE_MAX) || $(DS_IDLE_RETIRE) || $(DS_IDLE_TAIL)) )

DAEMON_SHUTDOWN = DS_TO_DIE || ($(DS_IDLE) && ((PartitionableSlot =!= True) || (TotalSlots =?=1)))

#4 Updated by Derek Weitzel almost 8 years ago

You are inconsistent with the naming of DS_TO_DIE vs DS_TODIE, but I believe the logic is good.

#5 Updated by Burt Holzman almost 8 years ago

  • Status changed from Assigned to Feedback
  • Assignee changed from Burt Holzman to Parag Mhashilkar

#6 Updated by Parag Mhashilkar almost 8 years ago

  • Status changed from Feedback to Resolved
  • Assignee changed from Parag Mhashilkar to Burt Holzman

I tested with regular jobs with the changes applied to 2.7 alpha. Derek tested the partionable slot on ITB. No issues identified in either case so far.

Merged into respective branches
branch_v2plus_3399 -> branch_v2plus
master_3399 -> master

#7 Updated by Parag Mhashilkar over 7 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF