Project

General

Profile

Feature #14183

Add the ability to request specific amount of disk space for special resource slots

Added by Marco Mambelli about 4 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
Start date:
10/19/2016
Due date:
% Done:

0%

Estimated time:
Stakeholders:
Duration:

Description

Currently special resource slots can specify only how much memory to reserve, the disks is split evenly and static special slot often end up with more disk space and there is not enough disk space for the jobs.
Condor is not enforcing quotas but since jobs have disk requirements they are not matched and cpus remain unused.

Here the email form Farrukh

Hi Marco,

This is more or less the same high io slot use case we had in mind previously. Now we are thinking of deploying this into the global pool itself.

The main production workflows that CMS runs are now mostly multi core. When a workflow completes the processing part, it starts small single core clean up jobs. These small single core jobs are termed as high io jobs. These small jobs are the highest prio jobs in the pool and they break the multicore slots into single core slots causing priority inversion (high prio multicore stays idle, low prio single core starts running).

In the tier0 pool, we only have tier0 workflows so it was safe to overcommit cpus, and have the io slot be a part of the main slot. In the global pool though, we also have CRAB3 single core jobs. We cannot overcommit cpu and add the ioslot to the main slot because there is nothing preventing CRAB3 from landing there if sufficient memory is available.

I was thinking of making a static io slot in addition to the partitionable 8 core slot, but with condor equally dividing the disk into two halves, we run out of disk on the main slot causing cpus to go idle. The idea is that the static slot would then just match the high io jobs and rejects the CRAB3 jobs which do not define the Request* attribute altogether.

I think there is a 'WithinResouceLimits' expression that must evaluate to true for the job to match. This expression checks on cpu, disk and memory to make sure there is enough available to satisfy the job request. We hit this issue with the tier0 pool with one of the initial implementations of the resource slots. Basically the 8 core slot got 80 GB of disk and was only able to run one four core job (CMS has 20GB/core requirement for disk).

I think it would be handy to be able to specify disk as well where we can carve out just 20 GB for the io slot.

Do you have any other suggestions to tackle this problem with the current version (3.2.15) that we have?

Best regards,
Farrukh

History

#1 Updated by Marco Mambelli about 4 years ago

  • Tracker changed from Bug to Feature
  • Status changed from New to Feedback
  • Assignee changed from Marco Mambelli to Parag Mhashilkar

code is in v3/14183
disk amounts have to be expressed as percentage/fraction since so HTCondor requires (because they are dynamic and ther is no reservation/guarantee mechanism)

#2 Updated by Parag Mhashilkar about 4 years ago

  • Assignee changed from Parag Mhashilkar to Marco Mambelli

Requested minor changes over messenger. Else looks ok to merge.

#3 Updated by Marco Mambelli about 4 years ago

  • Status changed from Feedback to Resolved

merged to branch_v3_2

#4 Updated by Parag Mhashilkar almost 4 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF