Project

General

Profile

Bug #11407

Glidein is not setting the number of CPUs accourdnig to GLIDEIN_CPUs when using partitionable slots

Added by Marco Mambelli over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
Start date:
01/12/2016
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

In GWMS documentation we stated that we set the # of CPUs seen by condor to the value of GLIDEIN_CPUS if specified.
This was not happening in the case of partitionable slots, in the generated config there was not:

NUM_CPUS = \$(GLIDEIN_CPUS)

This problem is more evident when using nodes with artificially increased number of CPUs.
E.g. the testing VM has 2 physical cores but NUM_CPUS=16 and NUM_SLOTS=2, so it should have 2 slots with 8 artificial cores each.
When NUM_CPUS is not set in the glidein, condor was detecting the 2 physical cores, not the desired 4 in the job request (request_cpus=4), not the 8 cores of the glidein, not the 4 cores of GLIDEIN_CPUS, not the 16 cores in the receiving startd.

Here there may be also a problem in HTCondor where the detection of CPUs possibly should see the CPUs offered by the slot it is starting into. Will check w/ condor developers.

History

#1 Updated by Marco Mambelli over 4 years ago

GLIDEIN_CPUS can also be set to 'auto'/0 which currently detects the physical CPUs (same behavior of condor when NUM_CPUS is not set):

elif [ "${GLIDEIN_CPUS}" = "0" ]; then
    # detect the number of cores
    core_proc=`awk -F: '/^physical/ && !ID[$2] { P++; ID[$2]=1 }; /^physical/ { N++ };  END { print N, P }' /proc/cpuinfo`
    cores=`echo "$core_proc" | awk -F' ' '{print $1}'`
    if [ "$cores" = "" ]; then
        # Old style, no multiple cores or hyperthreading
        cores=`grep processor /proc/cpuinfo  | wc -l`
    fi
    GLIDEIN_CPUS="$cores" 
fi

It should preferably detect the CPUs made available via HTCondor (or the underlying job manager - it would be more in tune w/ dynamic provicsioning) but I leave this for another ticket. Must check w/ condor how to detect that (condor_condfig_val NUM_CPUS?

#2 Updated by Marco Mambelli over 4 years ago

  • Status changed from New to Feedback
  • Assignee changed from Marco Mambelli to Parag Mhashilkar

Bug fixed in branch v3/11407.
Opened [#11408] to address the new feature (auto-discovery based on job manager offerings).

#3 Updated by Parag Mhashilkar over 4 years ago

  • Assignee changed from Parag Mhashilkar to Marco Mambelli

looks good to merge

#4 Updated by Marco Mambelli over 4 years ago

  • Status changed from Feedback to Resolved

#5 Updated by Parag Mhashilkar over 4 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF