Project

General

Profile

Feature #11877

More flexible mechanisms for a job to request resources (cpus, memory, ...)

Added by Marco Mambelli almost 4 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Parag Mhashilkar
Category:
-
Target version:
Start date:
03/03/2016
Due date:
% Done:

0%

Estimated time:
Stakeholders:

CMS

Duration:

Description

GlideinWMS uses condor.
Condor allows to specify the exact number of resources that you need:
1 cpu and 2GB, or 4 cpus and 5GB
It is possible also to ask for the whole node, whatever it is.

Would be nice to have the possibility to ask for a range:
between 3 and 7 cpus or
all the cpus you have left

This may require new mechanisms in condor.


Related issues

Blocked by GlideinWMS - Bug #11453: Frontend policies do not work with RequestCpus in classad is a condor expressionClosed01/20/2016

History

#1 Updated by Parag Mhashilkar almost 4 years ago

  • Blocked by Bug #11453: Frontend policies do not work with RequestCpus in classad is a condor expression added

#2 Updated by Parag Mhashilkar almost 4 years ago

  • Target version set to v3_2_x
  • Stakeholders updated (diff)

#3 Updated by Parag Mhashilkar almost 4 years ago

Email from Brian requesting the change

On Apr 19, 2016, at 10:42 AM, Brian Bockelman wrote:

Hi Parag, Marco^2,

(Also some HTCondor folks who might be amused by the setup)

CMS is starting to commission “resizable jobs” - jobs whose RequestCpus attribute is an expression.  The relevant portion of the ClassAd looks approximately looks like this:

MinCores = 1
MaxCores = 4
OriginalCpus = 4
JOB_GLIDEIN_Cpus = "$$(Cpus:0)”
RequestResizedCpus = (Cpus>MaxCores) ? MaxCores : ((Cpus < MinCores) ? MinCores : Cpus)
JobCpus = ((JobStatus =!= 1) && (JobStatus =!= 5) && !isUndefined(MATCH_EXP_JOB_GLIDEIN_Cpus) && (int(MATCH_EXP_JOB_GLIDEIN_Cpus) isnt error)) ? int(MATCH_EXP_JOB_GLIDEIN_Cpus) : OriginalCpus
RequestCpus = WMCore_ResizeJob ? (!isUndefined(Cpus) ? RequestResizedCpus : JobCpus) : OriginalCpus

(with RequestMemory and MaxWallTimeMins adjusted accordingly).

In English, this should read:
If WMCore_ResizeJob is true,
- When matchmaking,  RequestCpus should evaluate to the number of cores in the candidate slot (adjusted to the MinCores / MaxCores range)
- If the job is running, RequestCpus should be the number of cores in the matched slot
- Otherwise, evaluate to the original CPU request.
 - “Original CPU” is the number of cores used to benchmark the workflow (useful point of reference for the memory and estimated wall time).

Right now, it appears all works with the frontend because, when RequestCpus isn’t an integer, it’ll default to 1.  The monitoring code appears to mostly be based on the slots, so it should also calculate running jobs correctly (very important for setting max jobs running).

*However*, it’d be far more useful if glideinWMS could understand the setup natively:
0) Evaluate expressions in job ads.  This implies a switch from using “condor_q -xml” to the python bindings, I think.
1) Understand  MinCore

#4 Updated by Parag Mhashilkar over 3 years ago

  • Target version changed from v3_2_x to v3_2_14

#5 Updated by Parag Mhashilkar over 3 years ago

  • Assignee set to Parag Mhashilkar

#6 Updated by Parag Mhashilkar over 3 years ago

  • Target version changed from v3_2_14 to v3_2_15

#7 Updated by Parag Mhashilkar over 3 years ago

As per Farrukh, changes in #11453 are sufficient to come up with expressions for RequestCpus as required. I am keeping this ticket active just in case if we require additional semantics to be added to the frontend.

#8 Updated by Parag Mhashilkar over 3 years ago

  • Target version changed from v3_2_15 to v3_2_16

#9 Updated by Parag Mhashilkar over 3 years ago

  • Status changed from New to Resolved

From the discussions with CMS, they have everything required from the glideinwms to achieve what they want to do. There were few bugs but they were fixed and released in v3.2.15. Closing this issue.

#10 Updated by Parag Mhashilkar over 3 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF