Project

General

Profile

Feature #14559

Factory monitoring issues because of changes in the semantics (glideins v/s slots) on the frontend side

Added by Parag Mhashilkar over 3 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
High
Category:
-
Target version:
Start date:
11/17/2016
Due date:
% Done:

0%

Estimated time:
Stakeholders:

OSG, CMS, Factory Ops

Duration:

Description

Since recent changes in the semantics of glideins v/s slots in the frontend, factory monitoring has become ambiguous and there is a scaling issue in the factory monitoring. Marco thinks it is matter of selecting appropriate options to plot.
Thinks to do

  • Verify that labels/options to check make sense
  • Check/update default options to plot that make sense in the current situation
  • Check if we need another set of y axis for the plot
  • We already count uniq glideins running in the frontend. We should be able to log this in frontend log as a new column and track it through classad
  • Check with the factory operations to see if there is anything else they would like to see.

Feel free to open sub tickets as appropriate if the changes get too big to address for a release

factorystatus_glidein_scale.png (83.4 KB) factorystatus_glidein_scale.png Marco Mambelli, 07/26/2017 10:18 AM

History

#1 Updated by Parag Mhashilkar over 3 years ago

  • Target version changed from v3_2_17 to v3_2_18

#2 Updated by Parag Mhashilkar over 3 years ago

  • Priority changed from Normal to High

#3 Updated by Marco Mambelli over 3 years ago

  • Target version changed from v3_2_18 to v3_2_19

#4 Updated by Marco Mambelli about 3 years ago

  • Target version changed from v3_2_19 to v3_2_20

#5 Updated by Marco Mambelli almost 3 years ago

  • Stakeholders updated (diff)

#6 Updated by Marco Mambelli almost 3 years ago

Reporting original email from Jeff:

Hello glideinWMS developers,

Here is the scaling issue I reported on the stakeholder call. Please see
the screenshot of the factoryStatus page for one of the GPGrid entries:
http://gfactory-1.t2.ucsd.edu/factory/monitor/factoryStatus.html?entry=FNAL_GPGrid_ce01_mcore

In the single core days, the black line "Glideins at Collector" and the
purple line, "Glideins claimed by user jobs" for a healthy site should
be at the same level as the green area, "Running glidein jobs".

Now that we are running multicore glideins, we can't visually easily
detect a good site from a bad site, because these 2 lines are a factor
of 8 too high. So its more of a picture of "cores at collector", and
"cores claimed by user jobs". In order to make this page useful again
for debugging, I propose those two data fields are divided by the core
count, so it can be scaled down to the real number of glideins (green area).

I assume a similar adjustment will need to be made on the factory status
now page for the same columns (Registered, Claimed):
http://gfactory-1.t2.ucsd.edu/factory/monitor/factoryStatusNow.html

Thanks,
Jeff

#7 Updated by Marco Mambelli almost 3 years ago

  • Status changed from New to Feedback
  • Assignee changed from Marco Mambelli to Dennis Box

Changes are in branch_v3_2
I added ReqIdleCores and ReqMaxCores that use GLIDIN_CPUS to estimated the requested cores per entry.

A next step would be to try to get the actual numbers but this may be difficult since counting at the factory schedd is not doable as it is (grid universe jobs do not report received cpus)
This will require some discussion and investigation.

#8 Updated by Marco Mambelli almost 3 years ago

  • Status changed from Feedback to Assigned
  • Assignee changed from Dennis Box to Marco Mambelli

#9 Updated by Marco Mambelli almost 3 years ago

  • Status changed from Assigned to Feedback
  • Assignee changed from Marco Mambelli to Dennis Box

#10 Updated by Dennis Box almost 3 years ago

  • Assignee changed from Dennis Box to Marco Mambelli

OK to merge.

#11 Updated by Marco Mambelli almost 3 years ago

  • Status changed from Feedback to Resolved

#12 Updated by Marco Mambelli over 2 years ago

  • Status changed from Resolved to Closed

#13 Updated by Marco Mambelli over 2 years ago

  • Stakeholders updated (diff)


Also available in: Atom PDF