Project

General

Profile

Bug #7893

authentication problems between HA servers

Added by Dennis Box over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
02/17/2015
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

Found today during testing jobsub_hold jobsub_release and jobsub_remove all land on any of the servers in the DNS round-robin group and do condor_hold, condor_release, or condor_rm with the following style of constraint:
condor_hold -l -name fermicloud383.fnal.gov -constraint '(Owner =?= "dbox") && (ClusterId 1026) && (ProcId 0)'
Unfortunately there is currently some authentication problem between servers . If the (hold,release,remove) command lands on a server that has the schedd that owns the job, the command works. If it lands on a different server, and we get errors like so:
[dbox@fermicloud391 ~]$ echo $X509_USER_PROXY
/var/lib/jobsub/creds/proxies/nova/x509cc_dbox_Analysis
[dbox@fermicloud391 ~]$ condor_release -l -name fermicloud383.fnal.gov -constraint '(Owner =?= "dbox") && (ClusterId 1026) && (ProcId 0)'
IsQueueSuperUser = false
job_1026_0 = 5
JobAction = 2
ServerTime = 1424210595
TotalJobAds = 225
ActionResultType = 1
CurrentTime = time()
ActionResult = 0

AUTHENTICATE:1004:Failed to authenticate using FS
Couldn't find/release all jobs matching constraint ((Owner =?= "dbox") && (ClusterId 1026) && (ProcId 0))

As a temporary workaround, we are going to force these command to land on the correct server.

History

#1 Updated by Dennis Box over 4 years ago

  • Status changed from New to Resolved

merged back into master, tagged v1.1.0.1rc1 and released to redmine file area and /grid/fermiapp/products where it is 'current'

#2 Updated by Parag Mhashilkar over 4 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF