Project

General

Profile

Bug #14869

Hardening available cores auto detection

Added by Marco Mambelli over 2 years ago. Updated 12 months ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
Start date:
12/20/2016
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

Auto detection of available cores did not work correctly.
Brian suggested to test submitting glideins to Tusker, Crane and other sites to troubleshoot and verify the detection.

Here the email from Brian after the default auto detection caused problems:

There were a few problematic EU hosts.  If you want to debug, try submitting pilots to Tusker and Crane at Nebraska - both were affected by incorrect auto-detection.  I believe you should be able to submit pilots there?

I think condor is already auto detecting all the memory it has available.

Do you mean, when running inside a HTCondor batch system, the pilot is detecting the memory allocated to it by the HTCondor batch system (good!)?

Or do you mean it is literally detecting all available host memory (bad!)?

I do not understand well the second point.
Auto detection is not affecting the request to the host system. That is affected through the RSL or condor attributes.
The glidein is receiving anyway many cores that would not be used
Is your point that since the slots requested (in the factory RSL/condor attributes) have to many cores compared to the memory available and the requirements of the jobs, then the default equal splitting for condor static slots creates all unusable slots (instead of having at least few usable ones)?
We recommend partitionable slots and would like to know if there are drawbacks against them.

Well, the p-slots should also know how much memory they should utilize.

Should we change something in the splitting in static slots? 
We could add a minimum memory requirement and split using the min between the available cores and available memory slots.
Any suggestion?

I would suggest looking at $_CONDOR_MACHINE_AD and, when in "auto-detect" mode, set the memory used by glidein to be equal to the Memory allocated in $_CONDOR_MACHINE_AD.

Long-term, I would _love_ to allocate CMS the entire host at Nebraska to CMS, as we have 2.0-5.0 GB RAM / core.  As workflows diversify, I believe providing access to the the large-memory-per-core hosts will be a significant service to CMS.  However, we currently have to setup a new entry point per hardware type - not scalable at all!

History

#1 Updated by Parag Mhashilkar over 2 years ago

  • Assignee set to Marco Mambelli
  • Target version set to v3_3_x

#2 Updated by Marco Mambelli about 2 years ago

A couple of tickets in 3.2.19 fixed a bug in cores configuration and improved cores auto detection: [#16151], [#16147]

#3 Updated by Marco Mambelli over 1 year ago

  • Target version changed from v3_3_x to v3_4_x

#4 Updated by Marco Mambelli 12 months ago

  • Target version changed from v3_4_x to v3_5_x


Also available in: Atom PDF