Feature #21298

Add the possibility to disable completely Glidein removal

Added by Marco Mambelli about 1 year ago. Updated 12 months ago.

Target version:
Start date:
Due date:
% Done:


Estimated time:




Steve has a 3.3.3 Frontend (version #17221 of the mechanism) working against a 3.4.1 Factory (full mechanism as from #19293).
He has the mechanism disabled (<glideins_removal margin="0" requests_tracking="False" type="NO" wait="0"/>)
and observed glidein removal in the factory:

[2018-11-02 10:13:35,198] INFO: Client cmssrv266-fnal-gov_OSG_gWMSFrontend.cms_all (secid: hepcloudFEsrv266_cms) schedd status {1: 0, 2: 3}
[2018-11-02 10:13:35,202] INFO: Using v3+ protocol and credential 693452
[2018-11-02 10:13:35,202] INFO: Have enough glideins: idle=0 req_idle=0, not submitting
[2018-11-02 10:13:35,202] INFO: Too many glideins: idle=0, running=3, margin=0, max_running=0
[2018-11-02 10:13:35,203] INFO: Removing 3 running glideins
[2018-11-02 10:13:35,493] INFO: Removed 3 glideins on [(519L, 0L), (511L, 0L), (517L, 0L)]

The mechanism should work also if requests_tracking and margin are ignored. (94.3 KB) do not use Marco Mambelli, 11/05/2018 05:08 AM (94.3 KB) version 2, fixed Marco Mambelli, 11/05/2018 05:27 AM


#1 Updated by Marco Mambelli about 1 year ago

  • Tracker changed from Bug to Feature
  • Subject changed from Glidein removal triggered also when disabled to Add the possibility to disable completely Glidein removal
  • Occurs In deleted (v3_3_3)

Glidein removal could be triggered 2 ways:
- explicitly as soon as jobs requests drop to 0 or below the available job slots
- automatically after several cycles w/o queued or running jobs (about 30 min to remove waiting and idle glideins, longer to remove also running)

From Steve email it seemed that the explicit mechanism was triggered when should have not, but from my tests it behaves correctly.
I tested 3.4.2 and behaves correctly. Steve is seeing the same behavior he saw in 3.3.3 also in 3.4.2 (I may still check if 3.3.3 behaves differently).

Initially I thought it was a problem because the Frontend was asking the removal, but it had been running for more than a day w/o jobs, so it was OK.
I added extra logs and tested through the cycle and works correctly.
The automatic removal is the only one that kicks in. And it does so after several cycles. That part of the code and that behavior is still from 2014 or earlier.

It takes several cycles w/o idle or running jobs before it starts removing running glideins, longer that the 20 min shutdown time of the glidein

If it happens again or still happens, the history file in the group directory will contain information about the number of cycles since last jobs were seen:
e.g. /var/lib/gwms-frontend/vofrontend/group_main/

How long was since the last job was in the queue or finished running?
After 30 min or so the Frontend will ask to remove idle glidein, after one hour I think also the running

We should consider a mechanism to disable completely also the automatic removal (delayed forever - to ease debug) but at the moment there is not

#2 Updated by Marco Mambelli about 1 year ago

  • Status changed from New to Feedback
  • Assignee set to Dennis Box
  • Priority changed from Normal to High

To ease HEPCloud troubleshooting would be good to have this feature.
This case highlighted also that 'NO' option could have seen as misleading since the automatic removal was still in place.

Added DISABLE option to disable also automatic removal.
Changes are in v35/21298

Changes are all in frontend/
It can be patched replacing with the attached file both:
/usr/lib/python2.7/site-packages/glideinwms/frontend/ (SL7)
/usr/lib/python2.6/site-packages/glideinwms/frontend/ (SL6)

#4 Updated by Marco Mambelli about 1 year ago

Added fixed file (previous one had a bug

#5 Updated by Dennis Box about 1 year ago

  • Assignee changed from Dennis Box to Marco Mambelli

feedback sent

#6 Updated by Marco Mambelli about 1 year ago

  • Status changed from Feedback to Resolved

#7 Updated by Marco Mambelli about 1 year ago

  • Target version changed from v3_5 to v3_4_2

#8 Updated by Marco Mambelli about 1 year ago

  • Target version changed from v3_4_2 to v3_4_3

#9 Updated by Marco Mambelli 12 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF