Bug #16891

jobsub client commands should not fail when one of several servers is down

Added by Kevin Retzke almost 2 years ago. Updated over 1 year ago.

JobSub Client
Target version:
Start date:
Due date:
% Done:


Estimated time:
First Occurred:
Occurs In:



This weekend fifebatch2 went down, causing many jobsub commands to fail whenever the client tried to contact that server (because the DNS alias resolved to it).

Generally the commands failed with a credentials error like that shown below, since the first command to fail was cigetcert when it tried to obtain the options file from
Why does jobsub run cigetcert, even when the user already has a valid proxy?
This is especially a confusing and unhelpful error when a user has a valid proxy.

Traceback (most recent call last):
  File "/grid/fermiapp/products/common/db/../prd/jobsub_client/v1_2_3_2/NULL/jobsub_q", line 249, in <module>
  File "/grid/fermiapp/products/common/db/../prd/jobsub_client/v1_2_3_2/NULL/jobsub_q", line 230, in main
    options.acctGroup, None, [], extra_opts=optDict)
  File "/grid/fermiapp/products/common/prd/jobsub_client/v1_2_3_2/NULL/", line 108, in __init__
  File "/grid/fermiapp/products/common/prd/jobsub_client/v1_2_3_2/NULL/", line 757, in serverAuthMethods
  File "/grid/fermiapp/products/common/prd/jobsub_client/v1_2_3_2/NULL/", line 304, in cigetcert_to_x509
    raise CredentialsNotFoundError(err)


#1 Updated by Dennis Box almost 2 years ago

  • Assignee changed from Parag Mhashilkar to Dennis Box

#2 Updated by Dennis Box almost 2 years ago

  • Target version set to v1.2.4

One thing that should be done as soon as possible is to get rid of round-robin DNS and put an HAProxy server in front of the jobsub servers that detects when machines are down and doesn't rout traffic to a dead one. Nick and I have been discussing this perhaps we need to deploy it sooner rather than later.

I had the 'contact the server for cigetcertopts.txt and then use those to contact myproxy server' step forced on me during the DCAFI project for reasons I don't really understand. The error message should be better.

#3 Updated by Kevin Retzke almost 2 years ago

Also, note that even if a user tried to go to fifebatch1 with the --jobsub-server option, the submit would still fail if cigetcert was directed to fifebatch2. Can the cigetcert call be updated to also use the specified --jobsub-server for cigetcertopts.txt?

#4 Updated by Dennis Box almost 2 years ago

The right way to fix this is with HA_Proxy or other load balancer in front of the servers. This will not be ready for 1.2.4, moving it to 1.2.5

#5 Updated by Dennis Box almost 2 years ago

  • Target version changed from v1.2.4 to v1.2.5

#6 Updated by Dennis Box almost 2 years ago

  • Status changed from New to Resolved
  • Target version changed from v1.2.5 to v1.2.4.1

#7 Updated by Dennis Box over 1 year ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF