Better stale glidein cleanup or factory
Hello GlideinWMS team,
We pretty regularly see really old glideins on our queues and have to do manual cleanup, but I think the factory can do better about doing this for us. Here are a few places where I think the factory can improve upon:1. Glideins that go held that condor refuses to remove should be eventually removed with
31 the job manager failed to cancel the job as requested
121 the job state file doesn't exist
79 connecting to the job manager failed. Possible reasons: job terminated, invalid job contact, network problems, ...
- actually 79 almost always goes from X back to H with globus error 121 the first time you condor_rm without
forcex so in this case it's indirect.I don't thnik the factory has to be aware of all these cases but maybe it should be aware of the held glideins it already tried removing and then if they are still around after some amount of time, do a -forcex on them.
2. Glideins that stay in X for too long should be removed with
forcex It seems glideins can stay in X indefinitely. I don't have any data on what conditions cause this, but it is common.
- again maybe the details of why this happens aren't important, but the factory should try a -forcex after some duration has gone by after first trying to remove without -forcex, and the glideins are still stubbornly not removed.
3. "Rundiff" glideins that seem to be "running" indefinitely
- "Rundiff" is the term we factory operators use for glideins that from the factory side appear running, but no user collectors have any knowledge of the glideins. In many cases condor-g just loses track and the glideins have really long since terminated at the CE but condor-g didn't get the message. When this happens our queues have glideins that seem to be running for 10+ days and continue to "run" until we just clear them out.
- rundiff can be seen in the factoryStatusNow.html page when clicking "Troubleshoot", it's the 'Diff(Status: Running, Client: Registered)' column:
- I propose the factory periodically checks how long the glideins have been running, (can be calculated with EnteredCurrentStatus) and if the time running is > some reasonable limit, just remove them. "some reasonable limit" for example could be 2 x GLIDEIN_Max_Walltime for that entry.
I also have one final note. The auto removal policies that are already in place (especially for known unrecoverable globus held jobs) work quite well, however it is frustrating that it only appears to work as long as Frontends are currently requesting glideins at that particular entry. As soon as the frontend stops requesting, all glideins are forgotten. Besides no longer showing in our monitoring, autocleanup no longer happens, stale glideins just sit on our queues. I am aware this is a long known issue, but can this please be addressed?
Please take these recommendations into serious consideration because in my opinion, we have reached a capacity in our production factories (number of VOs / entries served) where manual cleanup is just far too time consuming in our daily operations routine.
OSG Glidein Factory Operations