Project

General

Profile

Bug #16891

jobsub client commands should not fail when one of several servers is down

Added by Kevin Retzke almost 2 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
High
Assignee:
Category:
JobSub Client
Target version:
Start date:
06/19/2017
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:

fife-group@fnal.gov

Duration:

Description

This weekend fifebatch2 went down, causing many jobsub commands to fail whenever the client tried to contact that server (because the fifebatch.fnal.gov DNS alias resolved to it).

Generally the commands failed with a credentials error like that shown below, since the first command to fail was cigetcert when it tried to obtain the options file from https://fifebatch.fnal.gov.
Why does jobsub run cigetcert, even when the user already has a valid proxy?
This is especially a confusing and unhelpful error when a user has a valid proxy.

Traceback (most recent call last):
  File "/grid/fermiapp/products/common/db/../prd/jobsub_client/v1_2_3_2/NULL/jobsub_q", line 249, in <module>
    sys.exit(main(sys.argv))
  File "/grid/fermiapp/products/common/db/../prd/jobsub_client/v1_2_3_2/NULL/jobsub_q", line 230, in main
    options.acctGroup, None, [], extra_opts=optDict)
  File "/grid/fermiapp/products/common/prd/jobsub_client/v1_2_3_2/NULL/jobsubClient.py", line 108, in __init__
    self.serverAuthMethods()
  File "/grid/fermiapp/products/common/prd/jobsub_client/v1_2_3_2/NULL/jobsubClient.py", line 757, in serverAuthMethods
    self.verbose)
  File "/grid/fermiapp/products/common/prd/jobsub_client/v1_2_3_2/NULL/jobsubClientCredentials.py", line 304, in cigetcert_to_x509
    raise CredentialsNotFoundError(err)
jobsubClientCredentials.CredentialsNotFoundError
<pre>

History

#1 Updated by Dennis Box almost 2 years ago

  • Assignee changed from Parag Mhashilkar to Dennis Box

#2 Updated by Dennis Box almost 2 years ago

  • Target version set to v1.2.4

One thing that should be done as soon as possible is to get rid of round-robin DNS and put an HAProxy server in front of the jobsub servers that detects when machines are down and doesn't rout traffic to a dead one. Nick and I have been discussing this perhaps we need to deploy it sooner rather than later.

I had the 'contact the server for cigetcertopts.txt and then use those to contact myproxy server' step forced on me during the DCAFI project for reasons I don't really understand. The error message should be better.

#3 Updated by Kevin Retzke almost 2 years ago

Also, note that even if a user tried to go to fifebatch1 with the --jobsub-server option, the submit would still fail if cigetcert was directed to fifebatch2. Can the cigetcert call be updated to also use the specified --jobsub-server for cigetcertopts.txt?

#4 Updated by Dennis Box over 1 year ago

The right way to fix this is with HA_Proxy or other load balancer in front of the servers. This will not be ready for 1.2.4, moving it to 1.2.5

#5 Updated by Dennis Box over 1 year ago

  • Target version changed from v1.2.4 to v1.2.5

#6 Updated by Dennis Box over 1 year ago

  • Status changed from New to Resolved
  • Target version changed from v1.2.5 to v1.2.4.1

#7 Updated by Dennis Box over 1 year ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF