Project

General

Profile

Feature #3217

Better stale glidein cleanup or factory

Added by Parag Mhashilkar almost 7 years ago. Updated almost 5 years ago.

Status:
New
Priority:
Normal
Assignee:
Parag Mhashilkar
Category:
-
Target version:
Start date:
01/03/2013
Due date:
% Done:

0%

Estimated time:
Stakeholders:

Factory Operations

Duration:

Description

Hello GlideinWMS team,

We pretty regularly see really old glideins on our queues and have to do manual cleanup, but I think the factory can do better about doing this for us. Here are a few places where I think the factory can improve upon:

1. Glideins that go held that condor refuses to remove should be eventually removed with forcex
We've observed there are some globus hold reasons that condor simply refuses to remove no matter what unless -forcex is given. A few in particular:
31 the job manager failed to cancel the job as requested
121 the job state file doesn't exist
79 connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ...
  • actually 79 almost always goes from X back to H with globus error 121 the first time you condor_rm without forcex so in this case it's indirect.
    I don't thnik the factory has to be aware of all these cases but maybe it should be aware of the held glideins it already tried removing and then if they are still around after some amount of time, do a -forcex on them.

2. Glideins that stay in X for too long should be removed with forcex
It seems glideins can stay in X indefinitely. I don't have any data on what conditions cause this, but it is common.
- again maybe the details of why this happens aren't important, but the factory should try a -forcex after some duration has gone by after first trying to remove without -forcex, and the glideins are still stubbornly not removed.

3. "Rundiff" glideins that seem to be "running" indefinitely
- "Rundiff" is the term we factory operators use for glideins that from the factory side appear running, but no user collectors have any knowledge of the glideins. In many cases condor-g just loses track and the glideins have really long since terminated at the CE but condor-g didn't get the message. When this happens our queues have glideins that seem to be running for 10+ days and continue to "run" until we just clear them out.
- rundiff can be seen in the factoryStatusNow.html page when clicking "Troubleshoot", it's the 'Diff(Status: Running, Client: Registered)' column:
http://glidein.grid.iu.edu/osg_gfactory/factoryStatusNow.html
- I propose the factory periodically checks how long the glideins have been running, (can be calculated with EnteredCurrentStatus) and if the time running is > some reasonable limit, just remove them. "some reasonable limit" for example could be 2 x GLIDEIN_Max_Walltime for that entry.

I also have one final note. The auto removal policies that are already in place (especially for known unrecoverable globus held jobs) work quite well, however it is frustrating that it only appears to work as long as Frontends are currently requesting glideins at that particular entry. As soon as the frontend stops requesting, all glideins are forgotten. Besides no longer showing in our monitoring, autocleanup no longer happens, stale glideins just sit on our queues. I am aware this is a long known issue, but can this please be addressed?

Please take these recommendations into serious consideration because in my opinion, we have reached a capacity in our production factories (number of VOs / entries served) where manual cleanup is just far too time consuming in our daily operations routine.

Thanks,
Jeff Dost
OSG Glidein Factory Operations


Related issues

Related to GlideinWMS - Bug #2448: Held glideins not being removed on gt5 sitesClosed02/03/2012

Related to GlideinWMS - Feature #5309: Need more prompt fake running glidein detectionNew01/31/2014

History

#1 Updated by Burt Holzman over 6 years ago

  • Target version changed from v3_1 to v3_x

#2 Updated by Parag Mhashilkar almost 5 years ago

  • Stakeholders updated (diff)


Also available in: Atom PDF