Feature #25076
Add knobs to control the Disk amount and Memory amount at the slot level in the Glideins
0%
FactoryOps
Description
Add something similar to GLUDEIN_CPUS and GLIDEIN_ESTIMATED_CPUS for memory and disk.
There is currently GLIDEIN_MaxMemMBs and GLIDEIN_MaxMemMBs_Estimate
And nothing for disk.
Hi Jeff, if I understood correctly the problem you want to enforce a limit on the disk used by the jobs (or jobs+glidein). The startd started by the Glidein should enforce that. It is because at this specific site the glideins are submitted to the worker nodes as many 1 slot jobs (many glideins, no partitionable slots, no 1 glidein w/ static 1 core slots). Each glidein sees the whole space available in the disk in the $(execute) directory, so they overuse the disk space (or they use more than the local admin would like). There is currently no mechanism in GlideinWMS to set the limit. But you can set arbitrary attributes in the glidein request (<submit_attrs>) or in the startd configuration (<attrs>) both in the Factory configuration. This may give a wrkaround, but I'm not sure and I'm adding some condor people for ideas. Setting RequestDisk (or JOB_DEFAULT_REQUESTDISK) in the job is not a solution, since it is a min for the job, not a limit. Disk, TotalSlotDisk, TotalDisk are set by condor, so adding them in the condor_config of the startd would have no effect RESERVED_DISK will not work because the WN have variable disk amounts Not sure about MODIFY_REQUEST_EXPR_REQUESTDISK The main slot in both partitionable and static glideins is SLOT_TYPE_1, you could try in the attrs something like: SLOT_TYPE_1 = $(SLOT_TYPE_1) disk=YOURVALUE
For the CPUs there is NUM_CPUS
For the memory there is MEMORY
For the disk there seem not to be an equivalent knob in HTCondor
The SLOT_TYPE_# attributes though can achieve the purpose
In condor_startup.sh
# Set number of CPUs (otherwise the physical number is used) echo "NUM_CPUS = \$(GLIDEIN_CPUS)" >> "$CONDOR_CONFIG" # set up the slots based on the slots_layout entry parameter slots_layout=`grep -i "^SLOTS_LAYOUT " "$config_file" | cut -d ' ' -f 2-` if [ "X$slots_layout" = "Xpartitionable" ]; then echo "NUM_SLOTS = 1" >> "$CONDOR_CONFIG" echo "SLOT_TYPE_1 = cpus=\$(GLIDEIN_CPUS)" >> "$CONDOR_CONFIG" echo "NUM_SLOTS_TYPE_1 = 1" >> "$CONDOR_CONFIG" echo "SLOT_TYPE_1_PARTITIONABLE = True" >> "$CONDOR_CONFIG" num_slots_for_shutdown_expr=1 else # fixed slot echo "SLOT_TYPE_1 = cpus=1" >> "$CONDOR_CONFIG" echo "NUM_SLOTS_TYPE_1 = \$(GLIDEIN_CPUS)" >> "$CONDOR_CONFIG" num_slots_for_shutdown_expr=$GLIDEIN_CPUS fi
Related issues
History
#1 Updated by Marco Mambelli 6 months ago
- Related to Bug #23339: Make wholenode knobs uniform added
#2 Updated by Marco Mambelli 6 months ago
- Description updated (diff)
#3 Updated by Marco Mambelli 6 months ago
- Stakeholders updated (diff)
- Assignee set to Marco Mambelli
#4 Updated by Marco Mambelli 6 months ago
- Assignee changed from Marco Mambelli to Jeffrey Dost
- Status changed from New to Feedback
- File condor_startup.sh condor_startup.sh added
Changes are in v36/25076
Added a knob to set disk= in the slot type definition.
It seems that condor does not allow to set an absolute value, an advertised quota for the disk used by the jobs.
<td> <b>GLIDEIN_DISK</b> </td> <TD> Str </td> <TD> EMPTY </TD> <TD> <P>Disk amout that the jobs running in the Glidein should use. This is used to configure the HTCondor SLOT_TYPE_ definitions that affect the value of Disk/TotalSlotDisk. Valid values are the ones accepted by HTCondor for the "disk=" keyword in the SLOT_TYPE_ definition. The default is an empty string (undefined) which is equivalent to "auto". It seems that for disk, valid values are the "auto" string (that lets HTCondor discover and handle the space), fractions or percentages, not absolute values. See the <a href="https://htcondor.readthedocs.io/en/latest/admin-manual/policy-configuration.html#dividing-system-resources-in-multi-core-machines">HTCondor manual</a>. </P> <p>NOTE: this setting does not assure that a certain amount of disk space will be available. To do so you'll have to use request_disk in the submit_attr section </p> </TD>
To patch the Factory needs to replace the web-base/condor_startup.sh with the attached one and run an upgrade command.
It is safe to upgrade v3_6_x versions for v >= 3.6.2 (no other changes)
It is OK to upgrade v3_6_x versions older than 3.6.2 (minor changes)
#5 Updated by Marco Mambelli 6 months ago
- Assignee changed from Jeffrey Dost to Marco Mambelli
- Status changed from Feedback to Resolved
#6 Updated by Marco Mambelli 4 months ago
- Status changed from Resolved to Closed