jobsub client commands should not fail when one of several servers is down
This weekend fifebatch2 went down, causing many jobsub commands to fail whenever the client tried to contact that server (because the fifebatch.fnal.gov DNS alias resolved to it).
Generally the commands failed with a credentials error like that shown below, since the first command to fail was cigetcert when it tried to obtain the options file from https://fifebatch.fnal.gov.
Why does jobsub run cigetcert, even when the user already has a valid proxy?
This is especially a confusing and unhelpful error when a user has a valid proxy.
Traceback (most recent call last): File "/grid/fermiapp/products/common/db/../prd/jobsub_client/v1_2_3_2/NULL/jobsub_q", line 249, in <module> sys.exit(main(sys.argv)) File "/grid/fermiapp/products/common/db/../prd/jobsub_client/v1_2_3_2/NULL/jobsub_q", line 230, in main options.acctGroup, None, , extra_opts=optDict) File "/grid/fermiapp/products/common/prd/jobsub_client/v1_2_3_2/NULL/jobsubClient.py", line 108, in __init__ self.serverAuthMethods() File "/grid/fermiapp/products/common/prd/jobsub_client/v1_2_3_2/NULL/jobsubClient.py", line 757, in serverAuthMethods self.verbose) File "/grid/fermiapp/products/common/prd/jobsub_client/v1_2_3_2/NULL/jobsubClientCredentials.py", line 304, in cigetcert_to_x509 raise CredentialsNotFoundError(err) jobsubClientCredentials.CredentialsNotFoundError <pre>
#2 Updated by Dennis Box almost 2 years ago
- Target version set to v1.2.4
One thing that should be done as soon as possible is to get rid of round-robin DNS and put an HAProxy server in front of the jobsub servers that detects when machines are down and doesn't rout traffic to a dead one. Nick and I have been discussing this perhaps we need to deploy it sooner rather than later.
I had the 'contact the server for cigetcertopts.txt and then use those to contact myproxy server' step forced on me during the DCAFI project for reasons I don't really understand. The error message should be better.
#3 Updated by Kevin Retzke almost 2 years ago
Also, note that even if a user tried to go to fifebatch1 with the --jobsub-server option, the submit would still fail if cigetcert was directed to fifebatch2. Can the cigetcert call be updated to also use the specified --jobsub-server for cigetcertopts.txt?