Project

General

Profile

Bug #12052

Factory should not release glideins sent to Condor CE in case they held because of expired proxies.

Added by Parag Mhashilkar almost 4 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Parag Mhashilkar
Category:
-
Target version:
Start date:
03/24/2016
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

On Mar 23, 2016, at 11:04 PM, Brian Bockelman

Hi,

When a frontend’s proxy expires, the condor-ce will automatically hold the job with an appropriate message indicating the issue.

The factory doesn’t appear to understand the problem and will just automatically condor_release the pilot — although it’s almost guaranteed this will not help (I can’t think of any reason why we would want the factory to condor_release a job for the condor-ce?). Once enough pilots are in this mode, the entry point will always be over the limit of held pilots until manual action is taken.

It’s almost certainly a factory bug. Can someone file an issue for me?

The Syracuse endpoint was in this precise error mode. It has been cleaned up on gfactory-1. Can someone fix it for the GOC factory?

Brian

glidein_classad (5.55 KB) glidein_classad Parag Mhashilkar, 04/19/2016 01:31 PM

History

#1 Updated by Parag Mhashilkar almost 4 years ago

  • Description updated (diff)

#2 Updated by Parag Mhashilkar almost 4 years ago

This is short coming from the CondorCE /Schedd side. Currently any job thats held at the schedd because it is held at CondorCE, it does not have a valid hold code and sub code. I opened a ticket with the HTCondor team

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=5640

#3 Updated by Parag Mhashilkar almost 4 years ago

#4 Updated by Parag Mhashilkar almost 4 years ago

  • Status changed from New to Feedback
  • Assignee changed from Parag Mhashilkar to HyunWoo Kim

Changes are in v3/12052 fore review.

#5 Updated by HyunWoo Kim over 3 years ago

  • Status changed from Feedback to Assigned
  • Assignee changed from HyunWoo Kim to Parag Mhashilkar

I have review the (two) changes in glideFactoryLib.py
1. Inserting a new tuple in the list q_glidein_format_list is obvious

2. Modifying the code(def isGlideinUnrecoverable) such that HoldReason is also considered in determining if a glidein is unrecoverable
is also obvious.
There is nothing which needs to be improved or modified by me.
This ticket is ready for release.

I just think that the solution in this ticket 12052 can also be applied to the ticket 11491.
I can simply add to unrecoverable_reason_str = ['Failed to authenticate with any method']
more error strings that AWS/CondorG returns to Factory that Steve/Burt observed.
Of course, this can be a temporary solution before Todd Miller makes it such that CondorG returns non-zero codes and subcodes
for those errors that occur in AWS,
or the new code/subcode that Todd already implemented in 8.5.5 are sufficient for our purpose.
I will have to read more carefully his email on Thursday, April 21, 2016 at 4:15 PM

#6 Updated by Parag Mhashilkar over 3 years ago

  • Status changed from Assigned to Resolved

#7 Updated by Parag Mhashilkar over 3 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF