Project

General

Profile

Feature #17606

Provide mechanism to set and enforce limits on job submissions, query rate, etc

Added by Kevin Retzke over 1 year ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
08/29/2017
Due date:
% Done:

0%

Estimated time:
Stakeholders:
Duration:

Description

There have been several ongoing and intermittent issues with users intentionally or inadvertently abusing Jobsub commands to causing excessive load on the condor infrastructure. This has been discussed at length at FIFE meetings, but no action has been taken1. I'm opening this issue so we can start tracking and formally discussing the issue.

Some Issues:

  • Users submit many jobs in a short period. This was partially addressed by limiting cluster size to 10K, but there's nothing stopping users from submitting many clusters (until the schedd stops responding).
  • Users submit jobs at a reasonable rate, but submit more than could run in a reasonable about of time (1-2 weeks)
  • Users make many scripted jobsub_q queries (e.g. once every 10 seconds)
  • Users fetch many very large sandboxes

Some technical controls should be implemented to limit these commands.

1 A workaround has been proposed/is in development to separate experiments into their own Jobsub servers/schedds, which will limit the area of effect, but is not really a solution.

History

#1 Updated by Dennis Box over 1 year ago

I think all of these issues are addressed except possibly the large sandbox issue. Testing/tweaking of the cache duration parameters to find the sweet spot is needed

* Users submit many jobs in a short period. This was partially addressed by limiting cluster size to 10K, but there's nothing stopping users from submitting many clusters (until the schedd stops responding).
* Users submit jobs at a reasonable rate, but submit more than could run in a reasonable about of time (1-2 weeks)

I have been testing a condor configuration parameter MAX_JOBS_PER_OWNER . It seems to work, and counts processes as well as clusters. This is what you get out of jobsub_submit when a submission would bump you over the limit:

ERROR: Failed to create proc
Number of submitted jobs would exceed MAX_JOBS_PER_OWNER
if you need help with the above errors
please open a service desk ticket

NB there is a parameter MAX_RUNNING_SCHEDULER_JOBS_PER_OWNER which I haven't been able to make work the way I think it should.

* Users make many scripted jobsub_q queries (e.g. once every 10 seconds)

The jobsub_q queries resolve to cachable URLs, and we should cache them using this newish feature (from release notes)

  • Feature 17736 Enable caching of http GETS through jobsub.ini settings
    NEW jobsub.ini settings:
    enable_http_cache = (True || False) default is False
    if True, cache http GETS to server
    for 'http_cache_duration' seconds

    http_cache_duration = (integer)
    cache GETs from server for this many
    seconds. Default is 120

* Users fetch many very large sandboxes

Even with the above feature turned off, the sandboxes are now cached for a tunable length of time (default 10 minutes), to prevent sandbox denial of service attacks like we experienced in the past.

Other suggestions for what to do about this problem are welcome.



Also available in: Atom PDF