Factory monitoring issues because of changes in the semantics (glideins v/s slots) on the frontend side
OSG, CMS, Factory Ops
Since recent changes in the semantics of glideins v/s slots in the frontend, factory monitoring has become ambiguous and there is a scaling issue in the factory monitoring. Marco thinks it is matter of selecting appropriate options to plot.
Thinks to do
- Verify that labels/options to check make sense
- Check/update default options to plot that make sense in the current situation
- Check if we need another set of y axis for the plot
- We already count uniq glideins running in the frontend. We should be able to log this in frontend log as a new column and track it through classad
- Check with the factory operations to see if there is anything else they would like to see.
Feel free to open sub tickets as appropriate if the changes get too big to address for a release
#6 Updated by Marco Mambelli over 2 years ago
Reporting original email from Jeff:
Hello glideinWMS developers,
Here is the scaling issue I reported on the stakeholder call. Please see
the screenshot of the factoryStatus page for one of the GPGrid entries:
In the single core days, the black line "Glideins at Collector" and the
purple line, "Glideins claimed by user jobs" for a healthy site should
be at the same level as the green area, "Running glidein jobs".
Now that we are running multicore glideins, we can't visually easily
detect a good site from a bad site, because these 2 lines are a factor
of 8 too high. So its more of a picture of "cores at collector", and
"cores claimed by user jobs". In order to make this page useful again
for debugging, I propose those two data fields are divided by the core
count, so it can be scaled down to the real number of glideins (green area).
I assume a similar adjustment will need to be made on the factory status
now page for the same columns (Registered, Claimed):
#7 Updated by Marco Mambelli over 2 years ago
- Status changed from New to Feedback
- Assignee changed from Marco Mambelli to Dennis Box
Changes are in branch_v3_2
I added ReqIdleCores and ReqMaxCores that use GLIDIN_CPUS to estimated the requested cores per entry.
A next step would be to try to get the actual numbers but this may be difficult since counting at the factory schedd is not doable as it is (grid universe jobs do not report received cpus)
This will require some discussion and investigation.