Project

General

Profile

Bug #10552

Partitionable glideins not accounted for correctly - not accounted for at all

Added by Marco Mambelli over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
Start date:
10/16/2015
Due date:
% Done:

0%

Estimated time:
First Occurred:
Stakeholders:
Duration:

Description

This is a follow-up of issue #6897 that did not solve the problem.

Mats pointed out that the problem was not resolved and helped to identify it better:

1. the names in condor_status/Name include the partition id (e.g. slot1_2@...)
slot1@glidein_31618_384461100@uct2-c161.mwt2.org
slot1_1@glidein_31618_384461100@uct2-c161.mwt2.org
slot1_3@glidein_31618_384461100@uct2-c161.mwt2.org
slot1_4@glidein_31618_384461100@uct2-c161.mwt2.org

2. the names in condor_q/RemoteHost have only the slot (e.g. slot1@…)
3 slot1@glidein_31618_384461100@uct2-c161.mwt2.org

3. appendRealRunning in glideinFrontendLib.py is looking for a match (the dictionary key is the name from condor_status):
condor_status = status_dict[collector_name].fetchStored()
if remote_host in condor_status:
….

and is called with:
glideinFrontendLib.appendRealRunning(self.condorq_dict_running,
self.status_dict_types['Running']['dict'])

The parent slot (slot1@…) is not running any job, so it is not in the list (otherwise the number of running glideins would be incorrect), therefore is not matched in appendRealRunning and is not counted.

Now I don’t know which is the correct path to solve this problem:
1. the job should report the exact slot in which it runs and this is a HTCondor bug
2. the job is reporting only the parent slot and GWMS should parse the collector entry to match with the parent slot name

My preference would be for solution 1 if possible.
Solution 2 can be done within the gwms code but works only if the jobs in the sub-slots are equivalents because there is no easy way to match the job with the correct sub-slot (I watched inside the ClassAds and I think the only way to match is via PublicClaimId that I don't think it is saved in the dictionaries - those would need to be changed as well).

History

#1 Updated by Marco Mambelli over 4 years ago

Since GLIDEIN_Schedd, GLIDEIN_Entry_Name, GLIDEIN_Name and GLIDEIN_Factory depend on the submission and are all the same for sub-slots of a partitionable glidein, then solution 2 is possible.

It is in branch v3/10552 and attached in patch

#2 Updated by Parag Mhashilkar over 4 years ago

  • Target version set to v3_2_12

#3 Updated by Parag Mhashilkar over 4 years ago

  • Target version changed from v3_2_12 to v3_2_13

#4 Updated by Burt Holzman over 4 years ago

Hi Marco - this is what I just e-mail about, didn't know you had a bug already created for it.
I don't think #1 is the answer - HTCondor has always reported the parent slot as RemoteHost.
Isn't it trivial to match RemoteHost with the machine Name since we know the form is slotX_Y?

#5 Updated by Marco Mambelli over 4 years ago

In the meeting with the condor team on 12/11 Zach and Todd explained how dynamic slots are created only for the match of the job (existing only when claimed) so is preferable to use the parent partitionable slot for RemoteHost. There may be in the future a new attribute added to the job to track the actual slot.
This means that option #1 is not viable.

#6 Updated by Marco Mambelli over 4 years ago

  • Status changed from New to Feedback
  • Assignee changed from Marco Mambelli to Burt Holzman
  • Target version changed from v3_2_13 to v3_2_12

New changes are in v3/10552_v2, ignore the changes in v3/10552

In this version partitionable glideins are counted as 1 for total, as 1 running glidein if there is at least one dynamic slot, as 1 idle glidein if they have enough cpu and memory (cpu>0 and memory>2500MB, these limits have been imposed by CMS)
Note that sometime idle+running != total (partitionable glideins may be counted as both running and idle)

Are the conditions in the selection and in the count are OK and are not slowing down too much?

#7 Updated by Marco Mambelli over 4 years ago

  • Assignee changed from Burt Holzman to HyunWoo Kim

#8 Updated by HyunWoo Kim over 4 years ago

  • Status changed from Feedback to Assigned
  • Assignee changed from HyunWoo Kim to Marco Mambelli

I reviewed the 2 files that have changed
and there are two comments from me, see below..

1. frontend/glideinFrontendElement.py
only the changes as countCoresCondorStatus has changed its signature.

2. frontend/glideinFrontendLib.py
def getIdleCondorStatus
one change in the use of dictionary get method
Suggestion> one comment is, in line # 585, there is a comment line, # None != True, no need to set default
Shouldn't this go above line # 585?

def getRunningConderStatus
improvement in the logic

def getFailedCondorStatus
just lines and indentations

def getIdleCoresCondorStatus
removed the redundant method body and redirected to getIdleCondorStatus
Suggestion> the comments for this method might be a bit obsolete now,
Why don't we explain in more details, why and how these two methods, getIdleCoresCondorStatus and getIdleCondorStatus
have the same logic? and thus that the redundant part has been removed and that this method is redirected to getIdleCondorStatus..

def countCoresCondorStatus
added second argument to cover TotalCores, IdleCores, RunningCores

#9 Updated by Marco Mambelli over 4 years ago

  • Status changed from Assigned to Resolved

#10 Updated by Parag Mhashilkar over 4 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF