Project

General

Profile

Feature #25076

Add knobs to control the Disk amount and Memory amount at the slot level in the Glideins

Added by Marco Mambelli about 1 month ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
Start date:
10/14/2020
Due date:
% Done:

0%

Estimated time:
Stakeholders:

FactoryOps

Duration:

Description

Add something similar to GLUDEIN_CPUS and GLIDEIN_ESTIMATED_CPUS for memory and disk.
There is currently GLIDEIN_MaxMemMBs and GLIDEIN_MaxMemMBs_Estimate
And nothing for disk.

Hi Jeff,
if I understood correctly the problem you want to enforce a limit on the disk used by the jobs (or jobs+glidein). The startd started by the Glidein should enforce that.

It is because at this specific site the glideins are submitted to the worker nodes as many 1 slot jobs (many glideins, no partitionable slots, no 1 glidein w/ static 1 core slots).
Each glidein sees the whole space available in the disk in the $(execute) directory, so they overuse the disk space (or they use more than the local admin would like).

There is currently no mechanism in GlideinWMS to set the limit.

But you can set arbitrary attributes in the glidein request (<submit_attrs>) or in the startd configuration (<attrs>) both in the Factory configuration.
This may give a wrkaround, but I'm not sure and I'm adding some condor people for ideas.

Setting RequestDisk (or JOB_DEFAULT_REQUESTDISK) in the job is not a solution, since it is a min for the job, not a limit.
Disk, TotalSlotDisk, TotalDisk are set by condor, so adding them in the condor_config of the startd would have no effect
RESERVED_DISK will not work because the WN have variable disk amounts

Not sure about MODIFY_REQUEST_EXPR_REQUESTDISK

The main slot in both partitionable and static glideins is SLOT_TYPE_1, you could try in the attrs something like:
SLOT_TYPE_1 = $(SLOT_TYPE_1) disk=YOURVALUE

For the CPUs there is NUM_CPUS

For the memory there is MEMORY

For the disk there seem not to be an equivalent knob in HTCondor

The SLOT_TYPE_# attributes though can achieve the purpose

In condor_startup.sh

        # Set number of CPUs (otherwise the physical number is used)
        echo "NUM_CPUS = \$(GLIDEIN_CPUS)" >> "$CONDOR_CONFIG" 
        # set up the slots based on the slots_layout entry parameter
        slots_layout=`grep -i "^SLOTS_LAYOUT " "$config_file" | cut -d ' ' -f 2-`
        if [ "X$slots_layout" = "Xpartitionable" ]; then
            echo "NUM_SLOTS = 1" >> "$CONDOR_CONFIG" 
            echo "SLOT_TYPE_1 = cpus=\$(GLIDEIN_CPUS)" >> "$CONDOR_CONFIG" 
            echo "NUM_SLOTS_TYPE_1 = 1" >> "$CONDOR_CONFIG" 
            echo "SLOT_TYPE_1_PARTITIONABLE = True" >> "$CONDOR_CONFIG" 
            num_slots_for_shutdown_expr=1
        else
            # fixed slot
            echo "SLOT_TYPE_1 = cpus=1" >> "$CONDOR_CONFIG" 
            echo "NUM_SLOTS_TYPE_1 = \$(GLIDEIN_CPUS)" >> "$CONDOR_CONFIG" 
            num_slots_for_shutdown_expr=$GLIDEIN_CPUS
        fi

condor_startup.sh (46.5 KB) condor_startup.sh Marco Mambelli, 10/15/2020 11:15 AM

Related issues

Related to GlideinWMS - Bug #23339: Make wholenode knobs uniformNew09/26/2019

History

#1 Updated by Marco Mambelli about 1 month ago

  • Related to Bug #23339: Make wholenode knobs uniform added

#2 Updated by Marco Mambelli about 1 month ago

  • Description updated (diff)

#3 Updated by Marco Mambelli about 1 month ago

  • Stakeholders updated (diff)
  • Assignee set to Marco Mambelli

#4 Updated by Marco Mambelli about 1 month ago

Changes are in v36/25076

Added a knob to set disk= in the slot type definition.
It seems that condor does not allow to set an absolute value, an advertised quota for the disk used by the jobs.

        <td> <b>GLIDEIN_DISK</b> </td>
        <TD> Str </td>

        <TD> EMPTY </TD>
        <TD>
            <P>Disk amout that the jobs running in the Glidein should use. This is used to configure the HTCondor SLOT_TYPE_ definitions that affect the value of Disk/TotalSlotDisk.
            Valid values are the ones accepted by HTCondor for the "disk=" keyword in the SLOT_TYPE_ definition. The default is an empty string (undefined) which is equivalent to "auto".
            It seems that for disk, valid values are the "auto" string (that lets HTCondor discover and handle the space), fractions or percentages, not absolute values.
            See the <a href="https://htcondor.readthedocs.io/en/latest/admin-manual/policy-configuration.html#dividing-system-resources-in-multi-core-machines">HTCondor manual</a>.
            </P>
            <p>NOTE: this setting does not assure that a certain amount of disk space will be available. To do so you'll have to use request_disk in the submit_attr section
            </p>
        </TD>

To patch the Factory needs to replace the web-base/condor_startup.sh with the attached one and run an upgrade command.
It is safe to upgrade v3_6_x versions for v >= 3.6.2 (no other changes)
It is OK to upgrade v3_6_x versions older than 3.6.2 (minor changes)

#5 Updated by Marco Mambelli about 1 month ago

  • Assignee changed from Jeffrey Dost to Marco Mambelli
  • Status changed from Feedback to Resolved

Also available in: Atom PDF