Factory Operations suggestions summary
This Ticket is base on the Factory Operation Development Suggestions document presented on 4/13/2018 by Marco Mascheroni:
Here there are a few items that the glideinWMS factory operators identified as potential areas of improvement.
- Setting glideinCPUs=”auto” messes up requests from the FE (it assumes 1 core). It also mess up the monitor. Proposal: introduce GLIDEIN_Estimated_CPUS [should be there]
- PeriodicRemove glideins if the runtime>maxwalltime+delta
- GWMS should stop looking at hold reasons to decide what to release. Just release a few times and remove it if necesary. condor_rm forcex if normal remove does not work. Maybe even stop submitting when glidein goes held? Marco Mambelli said there is a chance held glidein reconnect to user jobs when release (?)
- Jeff comments he has never seen a grid job recover on condor_release, it always requeues a fresh glidein
- Periodic removal of held glideins (maybe as an alternative to a factory initiated removal)
- Condor devs are following up
- Monitoring. For the new kibana monitoring we should write everything in json (in parallel to xml) and expose those json so that they can be fed to InFlux/Grafana/Kibana/whatever. Then we can remove the xml/RRD based stuff
- Speed up stop/reconf/start (maybe we wait RH7 or removal of RRD monitor)
- Remove really old files from reconfig.
- Clean up configuration files from disabled entries
- Provide a command that takes an entry name and removes it from the configuration. Optionally provide a --deleteAllDisabled flag
- Do not restart condor on factory upgrade