Partitionable glideins not accounted for correctly - not accounted for at all
This is a follow-up of issue #6897 that did not solve the problem.
Mats pointed out that the problem was not resolved and helped to identify it better:
1. the names in condor_status/Name include the partition id (e.g. slot1_2@...)
2. the names in condor_q/RemoteHost have only the slot (e.g. slot1@…)
3. appendRealRunning in glideinFrontendLib.py is looking for a match (the dictionary key is the name from condor_status):
condor_status = status_dict[collector_name].fetchStored()
if remote_host in condor_status:
and is called with:
The parent slot (slot1@…) is not running any job, so it is not in the list (otherwise the number of running glideins would be incorrect), therefore is not matched in appendRealRunning and is not counted.
Now I don’t know which is the correct path to solve this problem:
1. the job should report the exact slot in which it runs and this is a HTCondor bug
2. the job is reporting only the parent slot and GWMS should parse the collector entry to match with the parent slot name
My preference would be for solution 1 if possible.
Solution 2 can be done within the gwms code but works only if the jobs in the sub-slots are equivalents because there is no easy way to match the job with the correct sub-slot (I watched inside the ClassAds and I think the only way to match is via PublicClaimId that I don't think it is saved in the dictionaries - those would need to be changed as well).
#1 Updated by Marco Mambelli over 4 years ago
- File 0001-matching-the-main-slot-for-partitionable-slots.patch 0001-matching-the-main-slot-for-partitionable-slots.patch added
Since GLIDEIN_Schedd, GLIDEIN_Entry_Name, GLIDEIN_Name and GLIDEIN_Factory depend on the submission and are all the same for sub-slots of a partitionable glidein, then solution 2 is possible.
It is in branch v3/10552 and attached in patch
#4 Updated by Burt Holzman over 4 years ago
Hi Marco - this is what I just e-mail about, didn't know you had a bug already created for it.
I don't think #1 is the answer - HTCondor has always reported the parent slot as RemoteHost.
Isn't it trivial to match RemoteHost with the machine Name since we know the form is slotX_Y?
#5 Updated by Marco Mambelli over 4 years ago
In the meeting with the condor team on 12/11 Zach and Todd explained how dynamic slots are created only for the match of the job (existing only when claimed) so is preferable to use the parent partitionable slot for RemoteHost. There may be in the future a new attribute added to the job to track the actual slot.
This means that option #1 is not viable.
#6 Updated by Marco Mambelli over 4 years ago
- Status changed from New to Feedback
- Assignee changed from Marco Mambelli to Burt Holzman
- Target version changed from v3_2_13 to v3_2_12
New changes are in v3/10552_v2, ignore the changes in v3/10552
In this version partitionable glideins are counted as 1 for total, as 1 running glidein if there is at least one dynamic slot, as 1 idle glidein if they have enough cpu and memory (cpu>0 and memory>2500MB, these limits have been imposed by CMS)
Note that sometime idle+running != total (partitionable glideins may be counted as both running and idle)
Are the conditions in the selection and in the count are OK and are not slowing down too much?
#8 Updated by HyunWoo Kim over 4 years ago
- Status changed from Feedback to Assigned
- Assignee changed from HyunWoo Kim to Marco Mambelli
I reviewed the 2 files that have changed
and there are two comments from me, see below..
only the changes as countCoresCondorStatus has changed its signature.
one change in the use of dictionary get method
Suggestion> one comment is, in line # 585, there is a comment line, # None != True, no need to set default
Shouldn't this go above line # 585?
improvement in the logic
just lines and indentations
removed the redundant method body and redirected to getIdleCondorStatus
Suggestion> the comments for this method might be a bit obsolete now,
Why don't we explain in more details, why and how these two methods, getIdleCoresCondorStatus and getIdleCondorStatus
have the same logic? and thus that the redundant part has been removed and that this method is redirected to getIdleCondorStatus..
added second argument to cover TotalCores, IdleCores, RunningCores