Factory should not release glideins sent to Condor CE in case they held because of expired proxies.
On Mar 23, 2016, at 11:04 PM, Brian Bockelman
When a frontend’s proxy expires, the condor-ce will automatically hold the job with an appropriate message indicating the issue.
The factory doesn’t appear to understand the problem and will just automatically condor_release the pilot — although it’s almost guaranteed this will not help (I can’t think of any reason why we would want the factory to condor_release a job for the condor-ce?). Once enough pilots are in this mode, the entry point will always be over the limit of held pilots until manual action is taken.
It’s almost certainly a factory bug. Can someone file an issue for me?
The Syracuse endpoint was in this precise error mode. It has been cleaned up on gfactory-1. Can someone fix it for the GOC factory?
#2 Updated by Parag Mhashilkar about 4 years ago
This is short coming from the CondorCE /Schedd side. Currently any job thats held at the schedd because it is held at CondorCE, it does not have a valid hold code and sub code. I opened a ticket with the HTCondor team
#5 Updated by HyunWoo Kim about 4 years ago
- Status changed from Feedback to Assigned
- Assignee changed from HyunWoo Kim to Parag Mhashilkar
I have review the (two) changes in glideFactoryLib.py
1. Inserting a new tuple in the list q_glidein_format_list is obvious
2. Modifying the code(def isGlideinUnrecoverable) such that HoldReason is also considered in determining if a glidein is unrecoverable
is also obvious.
There is nothing which needs to be improved or modified by me.
This ticket is ready for release.
I just think that the solution in this ticket 12052 can also be applied to the ticket 11491.
I can simply add to unrecoverable_reason_str = ['Failed to authenticate with any method']
more error strings that AWS/CondorG returns to Factory that Steve/Burt observed.
Of course, this can be a temporary solution before Todd Miller makes it such that CondorG returns non-zero codes and subcodes
for those errors that occur in AWS,
or the new code/subcode that Todd already implemented in 8.5.5 are sufficient for our purpose.
I will have to read more carefully his email on Thursday, April 21, 2016 at 4:15 PM