Project

General

Profile

Feature #19946

Factory Operations suggestions summary

Added by Marco Mambelli over 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Factory
Target version:
Start date:
05/14/2018
Due date:
% Done:

0%

Estimated time:
Stakeholders:

Factory Ops

Duration:

Description

This Ticket is base on the Factory Operation Development Suggestions document presented on 4/13/2018 by Marco Mascheroni:
https://docs.google.com/document/d/1ANP80spS9so58OGPPt3JmlZKiw7fRFH4TSJLT1RvOAs/edit?usp=sharing

Here there are a few items that the glideinWMS factory operators identified as potential areas of improvement.

  • Setting glideinCPUs=”auto” messes up requests from the FE (it assumes 1 core). It also mess up the monitor. Proposal: introduce GLIDEIN_Estimated_CPUS [should be there]
  • PeriodicRemove glideins if the runtime>maxwalltime+delta
  • GWMS should stop looking at hold reasons to decide what to release. Just release a few times and remove it if necesary. condor_rm forcex if normal remove does not work. Maybe even stop submitting when glidein goes held? Marco Mambelli said there is a chance held glidein reconnect to user jobs when release (?)
    • Jeff comments he has never seen a grid job recover on condor_release, it always requeues a fresh glidein
    • Periodic removal of held glideins (maybe as an alternative to a factory initiated removal)
    • Condor devs are following up
  • Monitoring. For the new kibana monitoring we should write everything in json (in parallel to xml) and expose those json so that they can be fed to InFlux/Grafana/Kibana/whatever. Then we can remove the xml/RRD based stuff
  • Speed up stop/reconf/start (maybe we wait RH7 or removal of RRD monitor)
    • Remove really old files from reconfig.
  • Clean up configuration files from disabled entries
    • Provide a command that takes an entry name and removes it from the configuration. Optionally provide a --deleteAllDisabled flag
  • Do not restart condor on factory upgrade

Related issues

Related to glideinWMS - Support #18869: Review Factory and Frontend tools, especially glidien_off and manual_glidein_submit.pyClosed2018-02-01

Related to glideinWMS - Feature #20201: Add the possibility to skip idle removal per entryClosed2018-06-20

Related to glideinWMS - Feature #19877: Add a scaling factor for all glideins limits in the entriesClosed2018-05-03

Related to glideinWMS - Feature #19160: Add entry monitoring breakdown for metasitesClosed2018-02-27

Blocked by glideinWMS - Feature #16161: Estimate the cores provided to glideins running on an entryClosed2017-04-11

Blocked by glideinWMS - Feature #19948: Clean up configuration files from disabled entriesNew2018-05-14

Blocked by glideinWMS - Feature #19947: Do not restart condor on factory upgradeNew2018-05-14

Blocked by glideinWMS - Feature #19949: Remove really old files from reconfig.Assigned2018-05-14

Blocked by glideinWMS - Feature #19950: Speed up stop/reconf/start New2018-05-14

Blocked by glideinWMS - Support #20301: Automatically remove glideins after walltimeClosed2018-07-10

Blocked by glideinWMS - Support #20295: restore the old color scheme in factoryStatus.htmlClosed2018-07-09

History

#1 Updated by Marco Mambelli over 1 year ago

  • Blocked by Feature #16161: Estimate the cores provided to glideins running on an entry added

#2 Updated by Marco Mambelli over 1 year ago

  • Blocked by Feature #19948: Clean up configuration files from disabled entries added

#3 Updated by Marco Mambelli over 1 year ago

  • Blocked by Feature #19947: Do not restart condor on factory upgrade added

#4 Updated by Marco Mambelli over 1 year ago

  • Blocked by Feature #19949: Remove really old files from reconfig. added

#5 Updated by Marco Mambelli over 1 year ago

#6 Updated by Marco Mambelli about 1 year ago

  • Target version changed from v3_4_x to v_collections

#7 Updated by Marco Mascheroni about 1 year ago

  • Related to Support #18869: Review Factory and Frontend tools, especially glidien_off and manual_glidein_submit.py added

#8 Updated by Marco Mascheroni about 1 year ago

  • Related to Feature #20201: Add the possibility to skip idle removal per entry added

#9 Updated by Marco Mascheroni about 1 year ago

  • Related to Feature #19877: Add a scaling factor for all glideins limits in the entries added

#10 Updated by Marco Mascheroni about 1 year ago

  • Related to Feature #19160: Add entry monitoring breakdown for metasites added

#11 Updated by Marco Mascheroni about 1 year ago

  • Blocked by Support #20301: Automatically remove glideins after walltime added

#12 Updated by Marco Mascheroni about 1 year ago

  • Blocked by Support #20295: restore the old color scheme in factoryStatus.html added


Also available in: Atom PDF