Glideins are not submitted because of errors in the counting of total, idle and running jobs when partitionalble slots are involved
No new glidein was submitted to hepclpud when there were 4 running jobs, 4 idle jobs (4 cores each) and only 2 glidiens running (8 cores each, running 2 jobs each).
The frontend request was setting max_running to 16 cores, the ones requested by the idle jobs, without adding the 16 cores already running.
The issue is complex, includes multiple problems and I'm opening this umbrella ticket to track them together:
- partitionable slots are not counted correctly [#10552]
- status dictionaries refer to the idle glideins instead of the total [#11144]
The accounting of the cores is more clear (each core is either used to run a job or not).
The accounting of the glideins, for partitonable glideins is more unclear:
- when a partitionable glidien has still some cores (parent partitionable glidein which is idle) should it be counted as idle?
- when some cores of the partitionable glidein became dynamic slots and are running jobs should this be counted as one running glidein? Should it be counted as multiple running glideins (if there are multiple dynamic slots)?
#1 Updated by Marco Mambelli over 4 years ago
Recent tickets that modified the way glideins and cores are counted are:
- [#7903] Removed the partitionable slot (parent of dyn MC) from the running list
- [#9884] Subtract idle glideins from prop_mc (the number of idle jobs, divided by the number of credentials)
glidein_max_run = int((max(prop_jobs['Idle'] - idle_glideins, 0) + real) * self.fraction_running + 1)
Should be idle_glideins divided by the number of credentials as well?
E.g. 3 credentials, 12 idle jobs, 3 idle glideins, 0 running (fraction_running=1 for ease), for each credential:
- prop_jobs['Idle'] = 12/3 = 4
- glidein_max_run = ((4-3)+0)*1+1 = 2
- this would request 6 glidiens (thanks to the +1 increment, otherwise 3) instead of 9 or more.
It gets worse for bigger numbers when +1 makes less of a difference
#2 Updated by Marco Mambelli over 4 years ago
- Assignee set to Marco Mambelli
- Target version set to v3_2_13
10552 and 11144 have been solved.
Counters should be correct - test that heavily and check with stakeholders that current values are OK.
Specifically in the case of partitionable slots:
- the slot is counted as IDLE as long as it has at least one (idle) core and enough RAM (>2.5GB)
- the slot is counted as RUNNING as long as at least one core is busy running a job (= it has at least 1 dynamic sub-slot = not all its cores are idle)
The number of glideins can increase even if no new glideins are submitted and decrease without glideins actually ending. This because a partitionsble slot glidein (with nore than one core) is idle when in queue, is idle+running (counted in both lists) when running >=1 jobs, is idle again when the jobs complete. I.e. the glidein counted in the running list comes and goes.
Dynamic slots are not counted.
The number of running glideins is not the number of running slots and is not the number of running use jobs.
Counting the dynamic slots as running and not counting the partitionable slot would be a more accurate count of the running jobs.
On that regard, HTCondor is changing the the names in condor_q/RemoteHost to have the full slot name (e.g. slot1_2@...) instead of the current parent slot name (e.g. slot1@...).
This is available in current HTCondor 8.5.
This will allow to compare the RemoteHost of dynamic slots without having to modify the string (which is slow)
The questions open are:
- is this working as expected?
- is it OK not to count the dynamic slots?
- code and documentation should be reviewed to use correctly the terms cpu/core, glidein, slot
#6 Updated by Marco Mambelli over 4 years ago
After the counter review the jobs are handled differently.
There are 2 structures for the running jobs: one with static running slots, and dynamic slots, one including also the pslots that have some dynamic slot (have cores used for running jobs).
The second one is used when jobs have to identify on which entry they run (the jobs have the parent pslot name as remote_host, not the one of the dynamic slot).
The counters count the slots and all the slots, including all the dynamic slots.
So dynamic slots count as well against the limits.