Add held job count to factoryStatusNow page
Request from Jeff Dost:
A few weeks back it came to our attention that our factories were not filling up queues fast enough on some of the really large sites (specifically for ATLAS). This was easily solved by tweaking our submit limits in the factory (cluster_size, max_per_cycle, sleep).
However the problem is we didn't notice until the site complained. The problem is how can we quickly detect this in our daily routine. One way to spot it is on our factoryStatusNow page:
If we click troubleshoot and click the Idle Diff column to it sorts low to high, it is likely sites we aren't submitting fast enough to will have a large negative "Idle Diff." However this isn't the only reason for large negative "Idle Diff". It can also happen if a site is experiencing problems and a large number of glideins are going Held.
I propose a simple addition to quickly be able to differentiate "negative idle diff because we aren't submitting fast enough" and "negative idle diff due to held jobs"
Simply add the "Held" column to the factoryStatusNow troubleshoot view. That way if we sort by high negative Idle Diff we can focus on the sites that have little or no "Held" glideins to consider for sites we aren't submitting fast enough to.
OSG Glidein Factory Operations