Project

General

Profile

Feature #21298

Add the possibility to disable completely Glidein removal

Added by Marco Mambelli over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
High
Category:
-
Target version:
Start date:
11/02/2018
Due date:
% Done:

0%

Estimated time:
Stakeholders:

HEPCloud

Duration:

Description

Steve has a 3.3.3 Frontend (version #17221 of the mechanism) working against a 3.4.1 Factory (full mechanism as from #19293).
He has the mechanism disabled (<glideins_removal margin="0" requests_tracking="False" type="NO" wait="0"/>)
and observed glidein removal in the factory:

[2018-11-02 10:13:35,198] INFO: Client cmssrv266-fnal-gov_OSG_gWMSFrontend.cms_all (secid: hepcloudFEsrv266_cms) schedd status {1: 0, 2: 3}
[2018-11-02 10:13:35,202] INFO: Using v3+ protocol and credential 693452
[2018-11-02 10:13:35,202] INFO: Have enough glideins: idle=0 req_idle=0, not submitting
[2018-11-02 10:13:35,202] INFO: Too many glideins: idle=0, running=3, margin=0, max_running=0
[2018-11-02 10:13:35,203] INFO: Removing 3 running glideins
[2018-11-02 10:13:35,493] INFO: Removed 3 glideins on schedd_glideins4@cmssrv280.fnal.gov: [(519L, 0L), (511L, 0L), (517L, 0L)]

The mechanism should work also if requests_tracking and margin are ignored.

glideinFrontendElement.py (94.3 KB) glideinFrontendElement.py do not use Marco Mambelli, 11/05/2018 05:08 AM
glideinFrontendElement.py (94.3 KB) glideinFrontendElement.py version 2, fixed Marco Mambelli, 11/05/2018 05:27 AM

History

#1 Updated by Marco Mambelli over 1 year ago

  • Tracker changed from Bug to Feature
  • Subject changed from Glidein removal triggered also when disabled to Add the possibility to disable completely Glidein removal
  • Occurs In deleted (v3_3_3)

Glidein removal could be triggered 2 ways:
- explicitly as soon as jobs requests drop to 0 or below the available job slots
- automatically after several cycles w/o queued or running jobs (about 30 min to remove waiting and idle glideins, longer to remove also running)

From Steve email it seemed that the explicit mechanism was triggered when should have not, but from my tests it behaves correctly.
I tested 3.4.2 and behaves correctly. Steve is seeing the same behavior he saw in 3.3.3 also in 3.4.2 (I may still check if 3.3.3 behaves differently).

Initially I thought it was a problem because the Frontend was asking the removal, but it had been running for more than a day w/o jobs, so it was OK.
I added extra logs and tested through the cycle and works correctly.
The automatic removal is the only one that kicks in. And it does so after several cycles. That part of the code and that behavior is still from 2014 or earlier.

It takes several cycles w/o idle or running jobs before it starts removing running glideins, longer that the 20 min shutdown time of the glidein

If it happens again or still happens, the history file in the group directory will contain information about the number of cycles since last jobs were seen:
e.g. /var/lib/gwms-frontend/vofrontend/group_main/history.pk

How long was since the last job was in the queue or finished running?
After 30 min or so the Frontend will ask to remove idle glidein, after one hour I think also the running

We should consider a mechanism to disable completely also the automatic removal (delayed forever - to ease debug) but at the moment there is not

#2 Updated by Marco Mambelli over 1 year ago

  • Status changed from New to Feedback
  • Assignee set to Dennis Box
  • Priority changed from Normal to High

To ease HEPCloud troubleshooting would be good to have this feature.
This case highlighted also that 'NO' option could have seen as misleading since the automatic removal was still in place.

Added DISABLE option to disable also automatic removal.
Changes are in v35/21298

Changes are all in frontend/glideinFrontendElement.py
It can be patched replacing with the attached file both:
/usr/sbin/glideinFrontendElement.py
/usr/lib/python2.7/site-packages/glideinwms/frontend/glideinFrontendElement.py (SL7)
/usr/lib/python2.6/site-packages/glideinwms/frontend/glideinFrontendElement.py (SL6)

#4 Updated by Marco Mambelli over 1 year ago

Added fixed file (previous one had a bug

#5 Updated by Dennis Box over 1 year ago

  • Assignee changed from Dennis Box to Marco Mambelli

feedback sent

#6 Updated by Marco Mambelli over 1 year ago

  • Status changed from Feedback to Resolved

#7 Updated by Marco Mambelli over 1 year ago

  • Target version changed from v3_5 to v3_4_2

#8 Updated by Marco Mambelli over 1 year ago

  • Target version changed from v3_4_2 to v3_4_3

#9 Updated by Marco Mambelli over 1 year ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF