Project

General

Profile

Bug #11491

Factory should not release certain AWS glideins and use forcex

Added by Parag Mhashilkar over 4 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
05/26/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
First Occurred:
Occurs In:
Stakeholders:

HEPCloud

Duration:

Description

Based on the lessons learned from HEPcloud, AWS glideins can end up in an infinite spiral of remove-hold-release-idle-hold. We should collect possible holdreasons/codes and use them in the factory.


Subtasks

Feature #12797: Factory sometimes need to use forcex option for mostly cloud resourcesClosedHyunWoo Kim

History

#1 Updated by Parag Mhashilkar over 4 years ago

  • Target version changed from v3_2_13 to v3_2_14

#2 Updated by Parag Mhashilkar over 4 years ago

  • Assignee changed from Parag Mhashilkar to HyunWoo Kim
  • Stakeholders updated (diff)

#3 Updated by HyunWoo Kim about 4 years ago

I reported this to HTConder developer(Todd Miller)
and he opened their local ticket for this issue and I am in the watch list:

New ticket #5628 created by user tlmiller.

Title: Set HoldReasonCode and HoldReasonSubCode to EC2 job holds.
URL: http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=5628
Owner: tlmiller
Assigned to: tlmiller
Type: enhance
Status: new
Subsystem: Grid
Priority: 4
Broken Version:
Fixed Version:
Visibility: public
Due Date:
Rust:
Notify: , ,

Description:
Fermi requests that we set HoldReasonCode and HoldReasonSubCode for EC2 jobs when they go on hold, instead of just setting the HoldReason string.

#4 Updated by HyunWoo Kim about 4 years ago

The following is the latest message from Todd Miller:

<My reply:>
On 4/27/16, 2:42 PM, "Hyun Woo Kim" <hyunwoo@fnal.gov> wrote:
Thanks Todd, I believe now everything is clear.

Let me just list important points:
1. I will wait until you get a new build(still in 8.5.5) with the value changes(globally unique HoldReasonCodes) through.
2. When that is done, I will test the new binaries that you will put in your personal URL.
3. Even if we want more changes, they will have to wait until 8.5.6.

Here is what I am going to do for now:
Recently, Parag modified GWMS Factory code to be able to look at HoldReason strings when HoldReasonCodes and HoldReasonSubCodes are all zeros.
I think I can apply the same change to AWS case too until HTCondor 8.5.5(or 8.5.6 if new HRCodes/HRSubCodes in 8.5.5 do not fully cover our situation) is available
at which time we can rely on HRCodes and HRSubCodes instead of HoldReason strings..
Thanks again,
HyunWoo
</My reply:>

<Email from ToddM:>
On 4/27/16, 2:20 PM, "Todd L Miller" <tlmiller@cs.wisc.edu> wrote:
>The best way to clarify this confusion would be for me to see what HR
>codes/subcodes are actually returned when we submit some test AWS jobs.

I've been told that I did in the binaries I distributed earlier can't be released, because HoldReasonCodes must be globally unique.
My current plan is to increment the codes I sent previously, to make them globally unique.

> Could you please give me access to the new 8.5.5 binaries as well?

 Access isn't restricted, but I'd appreciate it if you didn't spread the URL around.
The HoldReasonCodes in these binaries will all be different numbers (37 higher, if I recall correctly) in the release:
I'll make new binaries available there once I get a build with the value changes through.

> If I discover (before the code freezes) that more HR subcodes should be
> added to the new list of subcodes (to meet our needs), can your team add more subcodes?

Regrettably, because of problems with releasing 8.5.4, the 8.5.5 code freeze was moved up to the end of this week.
I'll have some room to fiddle with the codes for maybe a week after that, but larger or additional changes would have to wait until 8.5.6.

https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=MagicNumbers)

FYI, that's out of date.  An up-to-date list of hold reason codesis available in the HTCondor manual:
http://research.cs.wisc.edu/htcondor/manual/v8.5/12_Appendix_A.html#100889
</Email from ToddM:>

#5 Updated by HyunWoo Kim about 4 years ago

  • % Done changed from 0 to 50

I upgraded condor in hepcloud-devfac with the private condor rpm files provided by Todd Miller

When I did
yum localupdate condor-8.5.5-0.365495.el6.x86_64.rpm condor-classads-8.5.5-0.365495.el6.x86_64.rpm condor-procd-8.5.5-0.365495\
.el6.x86_64.rpm
I got
Error: Package: condor-8.5.5-0.365495.el6.x86_64 (/condor-8.5.5-0.365495.el6.x86_64)
Requires: condor-external-libs(x86-64) = 8.5.5-0.365495.el6

So, I downloaded condor-external-libs-8.5.5-0.365495.el6.x86_64.rpm as well,
but I can not localupdate all these 4 at the same time according to Steve

yum localinstall condor-external-libs-8.5.5-0.365495.el6.x86_64.rpm
yum localupdate condor-8.5.5-0.365495.el6.x86_64.rpm condor-classads-8.5.5-0.365495.el6.x86_64.rpm condor-procd-8.5.5-0.365495\
.el6.x86_64.rpm

Now, the question is how to reproduce these 3 or 4 errors:

According to Steve, these errors were observed only from spot instances.
And also per Steve, t1.micro is the only free instance type that is supported by spot instance
but with one condition that I need to select paravirtualized image.
All our fermi image is exclusively hvm.
I need to use paravirt image that is provided by Amazon.
I will then submit around 500 t1.micro instances in RnD account and see if
this new condor provides proper ReasonCode and ReasonSubCode for EC2-related erros.

#6 Updated by HyunWoo Kim about 4 years ago

After I upgraded condor to 8.5.5 in Factory, I discovered that Factory does not receive classad from Frontend properly.
I upgraded condor in Frontend too.
In Frontend, the new condor(8.5.5) seems ok, but Frontend would not start properly.

The error comes from glideFrontendLib.py when Frontend tries to contact the schedd.
The schedd returns the query result in XML
and Frontend fails to parse the XML from schedd.

The XML is

<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
<131.225.154.98:9615?addrs=131.225.154.98-9615&noUDP&sock=23761_f653_6>
/var/lib/condor/spoolhepcloud-devfe.fnal.gov
<131.225.154.98:9615?addrs=131.225.154.98-9615&noUDP&sock=23761_f653_6>
/var/lib/condor/spoolhepcloud-devfe.fnal.gov
</classads>

Our conclusion is that, a new bug seems to have been introduced in condor 8.5.5
namely, the XML data from schedd contains unescaped strings that can be parsed by XML parser as unclosed tags..

According to Marco, not many people use XML data from schedd..
Also, no one has tried condor 8.5.5 version with GWMS.

Tomorrow, I will send an email to Todd Miller and let him know about this error
and I will also ask Parag if he knows anything about this.

When/if this bug is fixed and I can go ahead with my testing plan, I will proceed to modifying my Factory configuration(as I described in the note that I wrote this morning)
to use spot instance and launch at the level of 500 jobs in AWS
and see if the new condor properly sets HoldReasonCodes and HoldReasonSubCodes for those errors that Steve/Burt
observed such as "spot instance request ID does not exist", "request limit exceeded ", and "job cancel did not suceed after 3 tries".

In the meantime, namly, before we have confirmed that the new condor version handles code/subcode properly,
I can adopt the recipe that Parag devised in association with issue number 12052.
This modification will be pretty easy, namely, I will just have to add those error strings (shown in the previous paragraph) to the list of strings
that Parag added to the isGrlidinUnrecoverable method in glideFactoryLib.py.
This way, when Factory receives errors whose code and subcode are both zeros and whose HoldReason string is found in the list of strings,
Factory can decide not to try to recover these held jobs and instead issue condor_rm command with forcex option.

More later.

#7 Updated by HyunWoo Kim about 4 years ago

I talked with Parag and his assessment of current situation is as follows:

<131.225.154.98:9615?addrs=131.225.154.98-9615+[--1]-9615&noUDP&sock=79034_a48e_6>/var/lib/condor/spoolhepcloud-devfe.fnal.gov<131.225.154.98:9615?addrs=131.225.154.98-9615+[--1]-9615&noUDP&sock=79034_a48e_6>/var/lib/condor/spoolhepcloud-devfe.fnal.gov</classads>

suggests that shared port is being used.
He suggested to look at [root@hepcloud-devfe config.d]# less /etc/condor/config.d/01_gwms_collectors.config
and see if COLLECTOR_USES_SHARED_PORT is set
I found that COLLECTOR_USES_SHARED_PORT=False
So, it is not the shared port issue.

But the following evidence suggests that the install of new condor 8.55 must be wrong:

[root@hepcloud-devfe ~]# condor_status -schedd -xml -constraint 'Name=?="hepcloud-devfe.fnal.gov"'
shows only the following 3:

<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
<c>
    <a n="Name"><s>hepcloud-devfe.fnal.gov</s></a>
    <a n="MyType"><s>Scheduler</s></a>
    <a n="Machine"><s>hepcloud-devfe.fnal.gov</s></a>
</c>
</classads>

I installed my original condor from osg.
But these Todd Miller's condor RPMs are not osg.

Parag suggested to install all of rpms in http://pages.cs.wisc.edu/~tlmiller/gt-5561/
which I did, but does not solve the problem yet..

So, my conclusion at this point is,
I should reinstall gwms in hepcloud-devfe.fnal.gov with gwms tar ball in order to remove osg-condor dependency from gwms.
and install http://pages.cs.wisc.edu/~tlmiller/gt-5561/ again..

Also, the above should be a long-term solution
but I should also work on the short-term solution which is adopting Parag's recipe that is found in ticket 12025
and add more HoldReason strings to the new list of strings
so that those errors that Steve/Burt observed whose HR-codes/subcodes are zero can be used properly by factory
to determine unrecoverable held jobs and not try to release them..

#8 Updated by HyunWoo Kim about 4 years ago

1. testing Todd Miller's 8.5.5 binaries: since I and marco at cern observe a similar phenemonon (regarding condor_status -xml), I reported this to Todd Miller

2. In the meantime, I created a v3/11491 branch off v3/12052 branch and just added 3 strings to

unrecoverable_reason_str = ['Failed to authenticate with any method', 'Job cancel did not succeed after 3 tries', 'The spot instance request ID does not exist', 'Request limit exceeded']

in glideFactoryLib.py

Maybe Parag can review this?
Before I put this under feedback, I will contemplate over this for a bit more..

One thing to note is, I should wait for Parag to merge 12052 into branch_v3_2 before I would try.

#9 Updated by HyunWoo Kim about 4 years ago

Current task about this ticket is, modifying Factory logic such that
- executes a condor_rm command against a "unrecoverable" job
- waits 20 minutes
- the finally executes condor_rm with --forcex option this time against the same job.

I have reviewed tonight the relevant Factory logic in order to determine
where I should put the above new logic:

Basic structure of the code is as follows:
glideFactoryEntry.check_and_perform_work() calls
glideFactoryEntry.unit_work_v3() calls
glideFactoryEntry.perform_work_v3() eventually calls glideFactoryLib.keepIdleGlideins()

glideFactoryLib.keepIdleGlideins() calls either
calls sanitizeGlideins() if there are too many held
or calls clean_glidein_queue() under some condition
or calls submitGlideins() otherwise to run condor_submit command

So, my current assessment is, I should modify
def sanitizeGlideins() such that
in case of unrecoverable_held_list, it calls a new removeGlideins_new() method
and I define this new removeGlideins_new() method
such that the logic first checks if a given jid is found in a new list
if not, this jid is put in the new list with a time stamp and condor_rm is issued against this jid.
If this jid is found in the list, the logic should determine the time difference
and see if this time difference is greater than our threshold which is 20 minutes.
If the time difference is greather than 20 minutes, we issue condor_rm --forcex against this jid
and remove this jid from the special_list

I need to elaborate on this logic more carefully.

#10 Updated by HyunWoo Kim about 4 years ago

My current thought is that I should clone removeGlideins function and create
a new function that will be called only for "unrecoverable held jobs"

The algorithm in this new function can be as simple as this:
- There has to be a global list of removed jobs: hkremove_jids
hkremove_jids is a list of tuples of (jid, timestamp)
- When I try to remove a jid, first check if this jid is found in hkremove_jids
If this jid is found in hkremove_jids, compare the current timestamp and timestamp of the jid's entry(tuple) in hkremove_jids
If the time difference is greater than 20 minutes,
condorManager.condorRemoveOne("%li.%li" % (jid0, jid1), schedd_name, do_forcex=True)
If the time difference is less than 20 minutes, skip this iteration, namely, we will condor_rm --forcex this jid later
If this jid is NOT found in hkremove_jids,
run condorManager.condorRemoveOne("%li.%li" % (jid0, jid1), schedd_name) without --forcex
and put this jid in hkremove_jids as (jid, timestamp)
and then go to the next iteration..

The question is, whether we can have a global list..
I need to talk with Parag tomorrow.

#11 Updated by Parag Mhashilkar about 4 years ago

  • Target version changed from v3_2_14 to v3_2_15

#12 Updated by HyunWoo Kim about 4 years ago

  • Status changed from New to Feedback
  • Assignee changed from HyunWoo Kim to Parag Mhashilkar

I discussed this issue further with Parag
and suggested to split this into 2 tasks:
Original 11491 takes care of "should not release certain AWS glideins"
and the new task takes care of "use forcex"

In v3/11491 branch, I modified factory/glideFactoryLib.py to add just 3 more strings as shown below

[hyunwoo@hepcloud-devfe glideinwms]$ git diff v3/11491 branch_v3_2
diff --git a/factory/glideFactoryLib.py b/factory/glideFactoryLib.py
index 91c94c3..748f149 100644
--- a/factory/glideFactoryLib.py
+++ b/factory/glideFactoryLib.py

@@ -1572,8 +1572,7 @@ def isGlideinUnrecoverable(jobInfo, factoryConfig=None):
                                22, 27, 28, 31, 37, 47, 48,
                                72, 76, 81, 86, 87,
                                121, 122 ]}
-    # adding 3 more reasons that were observed that have zeros for both HoldReasonCode/SubCode
-    unrecoverable_reason_str = ['Failed to authenticate with any method', 'Job cancel did not succeed after 3 tries',
-    'The spot instance request ID does not exist', 'Request limit exceeded']

+    unrecoverable_reason_str = ['Failed to authenticate with any method']

I am assigning this to Parag for feedback because this ticket is almost identical to 12052 that Parag recently worked on

#13 Updated by Parag Mhashilkar about 4 years ago

  • Assignee changed from Parag Mhashilkar to HyunWoo Kim
  • Target version changed from v3_2_15 to v3_2_14
Changes look ok. Before you merge and resolve this one
  • Open a new ticket for forcex to be addressed in v3_2_15
  • Updated this ticket with the ticket number

#14 Updated by HyunWoo Kim about 4 years ago

I already created a sub task(12797) off this one(11491) before I tossed this one to you for feedback.
I set the target version of 12797 to v3_2_15 just now.

#15 Updated by HyunWoo Kim about 4 years ago

  • Status changed from Feedback to Resolved

I merged this into branch_v3_2 and pushed.
Setting the status to resolved.

#16 Updated by Parag Mhashilkar about 4 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF