authentication problems between HA servers
Found today during testing jobsub_hold jobsub_release and jobsub_remove all land on any of the servers in the DNS round-robin group and do condor_hold, condor_release, or condor_rm with the following style of constraint:
condor_hold -l -name fermicloud383.fnal.gov -constraint '(Owner =?= "dbox") && (ClusterId 1026) && (ProcId 0)'
Unfortunately there is currently some authentication problem between servers . If the (hold,release,remove) command lands on a server that has the schedd that owns the job, the command works. If it lands on a different server, and we get errors like so:
[dbox@fermicloud391 ~]$ echo $X509_USER_PROXY
[dbox@fermicloud391 ~]$ condor_release -l -name fermicloud383.fnal.gov -constraint '(Owner =?= "dbox") && (ClusterId 1026) && (ProcId 0)'
IsQueueSuperUser = false
job_1026_0 = 5
JobAction = 2
ServerTime = 1424210595
TotalJobAds = 225
ActionResultType = 1
CurrentTime = time()
ActionResult = 0
AUTHENTICATE:1004:Failed to authenticate using FS
Couldn't find/release all jobs matching constraint ((Owner =?= "dbox") && (ClusterId 1026) && (ProcId 0))
As a temporary workaround, we are going to force these command to land on the correct server.