Bug #12052

Factory should not release glideins sent to Condor CE in case they held because of expired proxies.

Added by Parag Mhashilkar about 4 years ago. Updated almost 4 years ago.

Parag Mhashilkar
Target version:
Start date:
Due date:
% Done:


Estimated time:
First Occurred:
Occurs In:


On Mar 23, 2016, at 11:04 PM, Brian Bockelman


When a frontend’s proxy expires, the condor-ce will automatically hold the job with an appropriate message indicating the issue.

The factory doesn’t appear to understand the problem and will just automatically condor_release the pilot — although it’s almost guaranteed this will not help (I can’t think of any reason why we would want the factory to condor_release a job for the condor-ce?). Once enough pilots are in this mode, the entry point will always be over the limit of held pilots until manual action is taken.

It’s almost certainly a factory bug. Can someone file an issue for me?

The Syracuse endpoint was in this precise error mode. It has been cleaned up on gfactory-1. Can someone fix it for the GOC factory?


glidein_classad (5.55 KB) glidein_classad Parag Mhashilkar, 04/19/2016 01:31 PM


#1 Updated by Parag Mhashilkar about 4 years ago

  • Description updated (diff)

#2 Updated by Parag Mhashilkar about 4 years ago

This is short coming from the CondorCE /Schedd side. Currently any job thats held at the schedd because it is held at CondorCE, it does not have a valid hold code and sub code. I opened a ticket with the HTCondor team

#3 Updated by Parag Mhashilkar about 4 years ago

#4 Updated by Parag Mhashilkar about 4 years ago

  • Status changed from New to Feedback
  • Assignee changed from Parag Mhashilkar to HyunWoo Kim

Changes are in v3/12052 fore review.

#5 Updated by HyunWoo Kim about 4 years ago

  • Status changed from Feedback to Assigned
  • Assignee changed from HyunWoo Kim to Parag Mhashilkar

I have review the (two) changes in
1. Inserting a new tuple in the list q_glidein_format_list is obvious

2. Modifying the code(def isGlideinUnrecoverable) such that HoldReason is also considered in determining if a glidein is unrecoverable
is also obvious.
There is nothing which needs to be improved or modified by me.
This ticket is ready for release.

I just think that the solution in this ticket 12052 can also be applied to the ticket 11491.
I can simply add to unrecoverable_reason_str = ['Failed to authenticate with any method']
more error strings that AWS/CondorG returns to Factory that Steve/Burt observed.
Of course, this can be a temporary solution before Todd Miller makes it such that CondorG returns non-zero codes and subcodes
for those errors that occur in AWS,
or the new code/subcode that Todd already implemented in 8.5.5 are sufficient for our purpose.
I will have to read more carefully his email on Thursday, April 21, 2016 at 4:15 PM

#6 Updated by Parag Mhashilkar about 4 years ago

  • Status changed from Assigned to Resolved

#7 Updated by Parag Mhashilkar almost 4 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF