Improve glideins scale down
In CMS glideins ramp up very quickly but they do not ramp down as well.
There is an "hysteresis" in ramp-down cycles causing unused cores.
This could be caused by idle glideins left in the queue at the factory or at the remote site, or by fragmentation (only part of the glidein cores are used by CMS jobs), or something else we did not consider.
Ramp down inefficiencies account roughly for a 5% on average, on some days could be 30% or more and are very visible.
They vary site to site. CERN is very efficient also in ramp down.
#4 Updated by Marco Mascheroni over 3 years ago
- Assignee changed from Marco Mascheroni to Marco Mambelli
As per meeting diuscussion I checked that:
1) As we thought there is some code to remove held glideins (see sanitizeGlideins). AFAIU it is called if the numbber held glideins for a frontend group is greater than what it has been configured in the frontend itself. The method is fairly intelligent and detects if a glidein is recoverable/unrecoverable when it is held (see isGlideinUnrecoverable). It should also remove idle and running glideins under some conditions, but I have not investigated too much that since this is only triggered if the number of held glideins is high.
2) I have also checked that we actually release held glideins (in this sanitizeGlideins method) if they are recoverable and within the limits.
#5 Updated by Marco Mambelli over 3 years ago
I'm elaborating a bit more on Marco's comment.
There is no periodic release expression
Glideins are released explicitly in sanitizeGlideins (same function that removes glideins as well)
that calls releaseGlideins(condorq.schedd_name, limited_held_list, log=log, factoryConfig=factoryConfig)
where held_list = extractRecoverableHeldSimple(condorq, factoryConfig=factoryConfig) and
limited_held_list = extractRecoverableHeldSimpleWithinLimits( condorq, factoryConfig=factoryConfig)
This uses: isGlideinWithinHeldLimits()
this checks num_holds>factoryConfig.max_release_count but max_release_count seems an hardcoded limit (10)
does heuristics on the type of error
isGlideinHeldNTimes checks how many timed the job was held
Putting to 0 max per cycle may work in stopping releases for the entry <release max_per_cycle="0" sleep="0.2"/>
since factoryConfig.max_releases in releaseGlideins seems to come from max_per_cycle
This should be tested.