Project

General

Profile

Feature #12797

Bug #11491: Factory should not release certain AWS glideins and use forcex

Factory sometimes need to use forcex option for mostly cloud resources

Added by HyunWoo Kim over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
05/26/2016
Due date:
% Done:

0%

Estimated time:
Stakeholders:
Duration:

Description

We decided to split the original 11491 into 2 tasks
Original 11491 takes care of "should not release certain AWS glideins"
and the new task takes care of "use forcex"

History

#1 Updated by HyunWoo Kim over 3 years ago

  • Target version set to v3_2_15

#2 Updated by HyunWoo Kim over 3 years ago

I have reviewed glideFactoryLib.py again and came up with the following possible solutions:
1.
If EnteredCurrentStatus time is greater than 20 minutes,
and current status is 5(held)
and isGlideinUnrecoverable() returns True,
condor_rm --forcex
This will cover unrecoverable category immediately.

2.A
If we really want to condor_rm --forcex against any jobs(running, idle or held) that the first condor_rm failed to kill,
we can do
If EnteredCurrentStatus time is greater than 60 minutes,
and current status is 5(held)
This will kill any held jobs that are held more than one hours..

2.B
Or obviously, if I can find the magic HoldReasonCode that corresponds to "some number that indicates Failed to condor_rm", I can use that
as follows
If current status is 5(held)
and HoldReasonCode is "some number that indicates Failed to condor_rm",

I think I can easily implement this idea in my v3/12797 branch

#3 Updated by HyunWoo Kim over 3 years ago

I am about to modify the branch v3/12797 to implement the solution
and before I actually do that, I am trying to summarize the idea here one more time:


1. The most important point here is, if we use condor_rm for any (running, idle or held) jobs and the killing attempt fails,
    this result should be handled by isGlideinUnrecoverable().
   Once again, we need to find out how we are going to identify those jobs that condor failed to kill using condor_rm command.
   And put this new condition in isGlideinUnrecoverable
   Until I can find the hidden magic HRCode, I will just have to assume this failed-to-kill jobs will be in 3 or 5 state indefinitely..
   my temporary suggestion for one more condition to add in def isGlideinUnrecoverable() is would    be:

   if (JobStatus == 3) or (JobStatus == 5):

2. Next point is, we have to (or can) add a new code in sanitizeGlideins() and possibly clean_glide_queue() as well
   We already have the follign code in def sanitizeGlideins()
   unrecoverable_held_list = extractUnrecoverableHeld(condorq)
   removeGlideins(condorq.schedd_name, unrecoverable_held_forcex_list, force=False,

   We just have to add a new code in def sanitizeGlideins()
   unrecoverable_held_forcex_list = extractUnrecoverableHeldForceX(condorq)
   removeGlideins(condorq.schedd_name, unrecoverable_held_forcex_list, force=True,

Assuming this will cover our new ideas of trying to start using forcex, I can easily modify glideFactoryLib.py.

#4 Updated by HyunWoo Kim over 3 years ago

I have modified one file glideFactoryLib.py

I had to also modify various places in the same file, in order to simulate a real situation.

The test looks promising except for one mysterious thing.

In isGlideinUnrecoverable(), JobInfo("HoldReason") is None
when condor_q -g -long | grep HoldReason apparently shows

HoldReason = "\n<Response><Errors><Error><Code>UnauthorizedOperation</Code><Message>You are not authorized to perform this operation.</Message></Error></Errors><RequestID>52fd87e3-e0a6-4934-84d8-c0f23f590939</RequestID></Response>"

I don't know, it took me one hour to discover this anomaly.

I need to debug this on Monday..

#5 Updated by HyunWoo Kim over 3 years ago

  • Status changed from New to Feedback
  • Assignee changed from HyunWoo Kim to Parag Mhashilkar

Hi Parag,

As I explained this morning, I have been modifying glideFactoryLib.py.
I have cleaned up the code so that you can have an initial review.
I want this code to be reviewed now because of another change in removeGlideins() method.

As you will be able to see, I had to change the order of if statements otherwise, force=True was never executed.

#6 Updated by Parag Mhashilkar over 3 years ago

  • Assignee changed from Parag Mhashilkar to HyunWoo Kim
  • Make sure that code formatting and indentation is correct. I noticed left over comment that is not indented correctly
  • I do not necessarily agree with the need to reorder the code as you say and would like to talk to you to understand it better.

#7 Updated by HyunWoo Kim over 3 years ago

The basic idea is same:
- existing code in sanitizeGlideins() will use condor_rm WITHOUT forcex against a unrecoverable held glidein in each iteration (every one minute)
This will increase NumSystemHolds.
This also updates EnteredCurrentStatus each iteration, which is why we can not use EnteredCurrentStatus to see if an unrecoverable held glidein is 20 minutes old.
For this purpose, we now use NumSystemHolds.

- new code in sanitizeGlideins() uses condor_rm with forcex againt list of unrecoverable held glideins.
But important finding was, this new code should run before the existing code (described above),
otherwise, the existing code will put a glidein on X state and thus the new code's condor_rm will fail.

I am putting this ticket under feedback assigned to Parag.

#8 Updated by HyunWoo Kim over 3 years ago

  • Assignee changed from HyunWoo Kim to Parag Mhashilkar

I forgot to change "Assignee" to Parag

#9 Updated by Parag Mhashilkar over 3 years ago

  • Assignee changed from Parag Mhashilkar to HyunWoo Kim

Unless I am missing something, I see an issue with your change. unrecoverable_held_list will have all the unrecoverable held glideins, including those that are held for more than 20 mins. Second call to the removeGlideins() should be called for jobs that are in (unrecoverable_held_list - unrecoverable_held_forcex_list)

    # Check if there are held glideins that are not recoverable AND held for more than 20 minutes
    unrecoverable_held_forcex_list = extractUnrecoverableHeldForceX(condorq)
    if len(unrecoverable_held_forcex_list) > 0:
        glideins_sanitized = 1
        log.warning("Found %i unrecoverable held glideins that have been held for over 20 minutes" 
                    % len(unrecoverable_held_forcex_list))
        removeGlideins(condorq.schedd_name, unrecoverable_held_forcex_list,
                       force=True, log=log, factoryConfig=factoryConfig)

    # Check if there are held glideins that are not recoverable
    unrecoverable_held_list = extractUnrecoverableHeldSimple(condorq)
    if len(unrecoverable_held_list) > 0:
        glideins_sanitized = 1
        log.warning("Found %i unrecoverable held glideins" % len(unrecoverable_held_list))
        removeGlideins(condorq.schedd_name, unrecoverable_held_list,
                       force=False, log=log, factoryConfig=factoryConfig)

Also you are not really looking at glideins held for more than 20 min but instead, glideins with numsysholds > 20. So you should change the variable/function names to reflect it to avoid confusion.

#10 Updated by HyunWoo Kim over 3 years ago

  • Assignee changed from HyunWoo Kim to Parag Mhashilkar

1. I inserted the following in def extractUnrecoverableHeldSimple()

and not isGlideinHeldTwentyIterations(el, factoryConfig=factoryConfig)

to meet this requirement from you:
Second call to the removeGlideins() should be called for jobs that are in (unrecoverable_held_list - unrecoverable_held_forcex_list)

2. and changed the names that indicate twenty minutes to new ones to indicate twenty iterations.

3. And did a test.

Full diff results:

diff --git a/factory/glideFactoryLib.py b/factory/glideFactoryLib.py
index ec02bd1..2518407 100644
--- a/factory/glideFactoryLib.py
+++ b/factory/glideFactoryLib.py
@@ -989,14 +989,15 @@ def extractStaleSimple(q, factoryConfig=None):
 def extractUnrecoverableHeldSimple(q, factoryConfig=None):
     #  Held==5 and glideins are not recoverable
-    qheld = q.fetchStored(lambda el:(el["JobStatus"] == 5 and isGlideinUnrecoverable(el, factoryConfig=factoryConfig)))
+    qheld = q.fetchStored(lambda el:(el["JobStatus"] == 5 and isGlideinUnrecoverable(el, factoryConfig=factoryConfig)
+                                     and not isGlideinHeldTwentyIterations(el, factoryConfig=factoryConfig)))
     qheld_list = qheld.keys()
     return qheld_list

 def extractUnrecoverableHeldForceX(q, factoryConfig=None):
     #  Held==5 and glideins are not recoverable AND been held for more than 20 minutes
     qheld = q.fetchStored(lambda el:(el["JobStatus"] == 5 and isGlideinUnrecoverable(el, factoryConfig=factoryConfig)
-                                     and isGlideinHeldTwentyMins(el, factoryConfig=factoryConfig)))
+                                     and isGlideinHeldTwentyIterations(el, factoryConfig=factoryConfig)))
     qheld_list = qheld.keys()
     return qheld_list
@@ -1624,10 +1625,10 @@ def isGlideinUnrecoverable(jobInfo, factoryConfig=None):
     return unrecoverable

-def isGlideinHeldTwentyMins(jobInfo, factoryConfig=None):
+def isGlideinHeldTwentyIterations(jobInfo, factoryConfig=None):
     """ 
     This function looks at the glidein job's information and returns if the
-    CondorG job is held for more than 20 minutes
+    CondorG job is held for more than 20 iterations

     This is useful to remove Unrecoverable glidein (CondorG job) with forcex option.

@@ -1640,13 +1641,12 @@ def isGlideinHeldTwentyMins(jobInfo, factoryConfig=None):
     if factoryConfig is None:
         factoryConfig = globals()['factoryConfig']

-    greater_than_twenty_minutes = False
+    greater_than_twenty_iterations = False
     nsysholds  = jobInfo.get('NumSystemHolds')
-#    log.info("Unrecoverble Held for NumSystemHolds = %d minutes" % nsysholds )
     if nsysholds > 19:
-        greater_than_twenty_minutes = True
+        greater_than_twenty_iterations = True

-    return greater_than_twenty_minutes
+    return greater_than_twenty_iterations

#11 Updated by Parag Mhashilkar over 3 years ago

  • Assignee changed from Parag Mhashilkar to HyunWoo Kim

I do not see a reason to change the function extractUnrecoverableHeldSimple(). This function as the name suggests should return all unrecoverable held glideins and should be reverted back. As mentioned in my feedback earlier you should call removeGlideins() on (unrecoverable_held_list - unrecoverable_held_forcex_list)

Also, more I look at the code, you may want to rename isGlideinHeldTwentyIterations(jobInfo, factoryConfig=None) to isGlideinHeldNTimes(jobInfo, factoryConfig=None, n=20) and make it more generic. This way you can remove the hard coding in "if nsysholds > 19"

Also please fix the comments/code docs referring to 20 min

#12 Updated by HyunWoo Kim over 3 years ago

  • Assignee changed from HyunWoo Kim to Parag Mhashilkar

1. In def sanitizeGlideins

    unrecoverable_held_list = extractUnrecoverableHeldSimple(condorq)
    if len(unrecoverable_held_list) > 0:
        glideins_sanitized = 1
        log.warning("Found %i unrecoverable held glideins" % len(unrecoverable_held_list))
        unrecoverable_held_list_minus_forcex = list( set(unrecoverable_held_list) - set(unrecoverable_held_forcex_list) )
    log.warning("But removing only %i (unrecoverable held - unrecoverable held forcex)" 
                    % len(unrecoverable_held_list_minus_forcex))
        removeGlideins(condorq.schedd_name, unrecoverable_held_list_minus_forcex,
               force=False, log=log, factoryConfig=factoryConfig)

2. reverted the change in def extractUnrecoverableHeldSimple(q, factoryConfig=None):

3. clean up relevant comments to replace 20 minuts by 20 interations

Tested in my local instance.

#13 Updated by Parag Mhashilkar over 3 years ago

  • Assignee changed from Parag Mhashilkar to HyunWoo Kim

Looks ok to merge.

#14 Updated by HyunWoo Kim over 3 years ago

  • Status changed from Feedback to Closed

Merged.



Also available in: Atom PDF