jobsub_client retry logic
It would be nice if job submission fails that the jobsub_client would retry X times (maybe try every machine in the round robin at least once???). Especially with our HA setup, the first machine tried could be in some bad state and you'd want to then submit the job to the next one.
#3 Updated by Parag Mhashilkar almost 4 years ago
- Assignee changed from Parag Mhashilkar to Dennis Box
- Stakeholders updated (diff)
NOTE: We should brainstorm before implementing this.
We want to be very careful not to resubmit in case the jobs were submitted successfully but for some reason the client did not get the info back from the server.
Here is my proposal and we should look for better alternatives if any
- Submit re-try logic
- Figure the server to use by our usual method
- Generate a unique token that is added to the jobs classad when you do the submit
- If the client gets a failure code from the server, for every jobsubserver/schedd in the system
- Wait for few sec
- Check if there were jobs submitted with that unique token in the classad
- If no jobs were submitted, move to next schedd in the list and submit jobs
- Add one liner to the client's output clear mentioning the server/schedd that failed and the server/schedd that will be tried next
- Fetchlog retry logic
- Wait and retry same server after some time. Exit with failure on second failed attempt.
- Other commands re-try logic
- Unless I am missing something, other commands can be executed from either servers
#4 Updated by Joe Boyd almost 4 years ago
Running into a failure case right now I just thought I'd add here as food for thought.
We have three schedd's on pre-prod. fermicloud393 is currently broken in some way that I haven't investigated but my submissions there are failing every time with
User authorization has failed: Error authorizing DN='/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Joe B. Boyd/CN=UID:boyd' for AcctGroup='mu2e'
even though I just sent an exact same job to fermicloud391.fnal.gov. Anyway, broken schedd.
The problem is that since I already submitted one job to fermicloud391 and fermicloud383 each, my submit now every time is going to fermicloud393 I assume because of the load balancing stuff that was added. So, we need retry logic if there's a failure, but it can't just keep trying the same schedd because of the load balancing stuff either because that schedd may be broken. In this case since there is something wrong with auth the job will never submit even with retrys.