jobsub connetions should query RecentDaemonCoreDutyCycle
The schedd occasionally gets very busy. One of the reasons it can get really busy is if there is a large job submission. We'd like jobsub to try to throttle things a bit.
We're seeing users submitting jobs in loops. We currently have a limit of only being able to submit 10k procs in a cluster but if they're submitting jobs in a loop we still get an overload.
jobsub could query the parameter RecentDaemonCoreDutyCycle from the schedd when it's looking at the schedds to see which one to submit a job to. If the schedd it's going to submit to has a high duty cycle then the jobsub server should pause for a little while and give it time to recover. If the user is submitting jobs in a loop, and they're submitting a large number of procs, this should slow the submission down and give the schedd a chance to keep up.
#4 Updated by Joe Boyd over 3 years ago
When we run into problems it's usually hitting 100% busy. If there is a schedd that isn't busy I think you can just submit it there. Otherwise maybe wait 60 seconds and try again printing out some message for the user.
Looking at the recent schedd numbers we haven't been seeing this so much lately.
Maybe if the schedd is over 85% busy don't submit more jobs to it until it comes down? 90% might be acceptable.