Project

General

Profile

Feature #16414

Improve glideins scale down

Added by Marco Mambelli over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
High
Category:
-
Target version:
Start date:
05/04/2017
Due date:
% Done:

0%

Estimated time:
Stakeholders:

CMS

Duration:

Description

In CMS glideins ramp up very quickly but they do not ramp down as well.
There is an "hysteresis" in ramp-down cycles causing unused cores.
This could be caused by idle glideins left in the queue at the factory or at the remote site, or by fragmentation (only part of the glidein cores are used by CMS jobs), or something else we did not consider.

Ramp down inefficiencies account roughly for a 5% on average, on some days could be 30% or more and are very visible.

They vary site to site. CERN is very efficient also in ramp down.


Related issues

Related to GlideinWMS - Bug #24806: Missing documentation for GLIDEIN_IDLE_LIFETIMENew08/17/2020

History

#1 Updated by Marco Mambelli over 3 years ago

  • Assignee set to Marco Mascheroni
  • Target version set to v3_2_19

#2 Updated by Marco Mascheroni over 3 years ago

  • Status changed from New to Feedback
  • Assignee changed from Marco Mascheroni to Marco Mambelli

#3 Updated by Marco Mambelli over 3 years ago

  • Assignee changed from Marco Mambelli to Marco Mascheroni

#4 Updated by Marco Mascheroni over 3 years ago

  • Assignee changed from Marco Mascheroni to Marco Mambelli

As per meeting diuscussion I checked that:

1) As we thought there is some code to remove held glideins (see sanitizeGlideins). AFAIU it is called if the numbber held glideins for a frontend group is greater than what it has been configured in the frontend itself. The method is fairly intelligent and detects if a glidein is recoverable/unrecoverable when it is held (see isGlideinUnrecoverable). It should also remove idle and running glideins under some conditions, but I have not investigated too much that since this is only triggered if the number of held glideins is high.

2) I have also checked that we actually release held glideins (in this sanitizeGlideins method) if they are recoverable and within the limits.

#5 Updated by Marco Mambelli over 3 years ago

I'm elaborating a bit more on Marco's comment.

There is no periodic release expression

Glideins are released explicitly in sanitizeGlideins (same function that removes glideins as well)
that calls releaseGlideins(condorq.schedd_name, limited_held_list, log=log, factoryConfig=factoryConfig)
where held_list = extractRecoverableHeldSimple(condorq, factoryConfig=factoryConfig) and
limited_held_list = extractRecoverableHeldSimpleWithinLimits( condorq, factoryConfig=factoryConfig)

This uses: isGlideinWithinHeldLimits()
this checks num_holds>factoryConfig.max_release_count but max_release_count seems an hardcoded limit (10)

and: isGlideinUnrecoverable()
does heuristics on the type of error

isGlideinHeldNTimes checks how many timed the job was held

Putting to 0 max per cycle may work in stopping releases for the entry <release max_per_cycle="0" sleep="0.2"/>
since factoryConfig.max_releases in releaseGlideins seems to come from max_per_cycle
This should be tested.

#6 Updated by Marco Mambelli over 3 years ago

  • Status changed from Feedback to Resolved

The removal will work also on jobs that were held. If they go to idle and not running they will be killed right away since the time from submission elapsed.
Merging

#7 Updated by Marco Mambelli over 3 years ago

  • Assignee changed from Marco Mambelli to Marco Mascheroni

#8 Updated by Parag Mhashilkar over 3 years ago

  • Status changed from Resolved to Closed

#9 Updated by Parag Mhashilkar over 3 years ago

  • Tracker changed from Bug to Feature

#10 Updated by Marco Mambelli 3 months ago

  • Related to Bug #24806: Missing documentation for GLIDEIN_IDLE_LIFETIME added

Also available in: Atom PDF