Provide mechanism to set and enforce limits on job submissions, query rate, etc
There have been several ongoing and intermittent issues with users intentionally or inadvertently abusing Jobsub commands to causing excessive load on the condor infrastructure. This has been discussed at length at FIFE meetings, but no action has been taken1. I'm opening this issue so we can start tracking and formally discussing the issue.
- Users submit many jobs in a short period. This was partially addressed by limiting cluster size to 10K, but there's nothing stopping users from submitting many clusters (until the schedd stops responding).
- Users submit jobs at a reasonable rate, but submit more than could run in a reasonable about of time (1-2 weeks)
- Users make many scripted jobsub_q queries (e.g. once every 10 seconds)
- Users fetch many very large sandboxes
Some technical controls should be implemented to limit these commands.
1 A workaround has been proposed/is in development to separate experiments into their own Jobsub servers/schedds, which will limit the area of effect, but is not really a solution.
#1 Updated by Dennis Box over 2 years ago
I think all of these issues are addressed except possibly the large sandbox issue. Testing/tweaking of the cache duration parameters to find the sweet spot is needed
* Users submit many jobs in a short period. This was partially addressed by limiting cluster size to 10K, but there's nothing stopping users from submitting many clusters (until the schedd stops responding).
* Users submit jobs at a reasonable rate, but submit more than could run in a reasonable about of time (1-2 weeks)
I have been testing a condor configuration parameter MAX_JOBS_PER_OWNER . It seems to work, and counts processes as well as clusters. This is what you get out of jobsub_submit when a submission would bump you over the limit:
ERROR: Failed to create proc
Number of submitted jobs would exceed MAX_JOBS_PER_OWNER
if you need help with the above errors
please open a service desk ticket
NB there is a parameter MAX_RUNNING_SCHEDULER_JOBS_PER_OWNER which I haven't been able to make work the way I think it should.
* Users make many scripted jobsub_q queries (e.g. once every 10 seconds)
The jobsub_q queries resolve to cachable URLs, and we should cache them using this newish feature (from release notes)
- Feature 17736 Enable caching of http GETS through jobsub.ini settings
NEW jobsub.ini settings:
enable_http_cache = (True || False) default is False
if True, cache http GETS to server
for 'http_cache_duration' seconds
http_cache_duration = (integer)
cache GETs from server for this many
seconds. Default is 120
* Users fetch many very large sandboxes
Even with the above feature turned off, the sandboxes are now cached for a tunable length of time (default 10 minutes), to prevent sandbox denial of service attacks like we experienced in the past.
Other suggestions for what to do about this problem are welcome.