Project

General

Profile

Bug #7780

Inaccurate running pilot jobs number in glideresource classads

Added by Parag Mhashilkar almost 5 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
High
Assignee:
Parag Mhashilkar
Category:
-
Target version:
Start date:
02/04/2015
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:

CMS, OSG

Duration:

Description

From: Brian Bockelman
Subject: GlideFactoryMonitor* entries in glideresource ads
Date: February 4, 2015 at 8:57:49 PM CST
To: Parag Mhashilkar
Cc: Jeff Dost, JAMES LETTS

Hi all,

I’ve been trying to make sense of the GlideFactoryMonitor* keys in the glideresource ads. Basically, for each CMS entry I look at in the OSG, the number of running jobs is a huge over-estimate.

After looking at the factory monitoring, the local CEs, and the frontend monitoring, I’ve figured it out - the GlideFactoryMonitor* is the totals for all frontends although it is in a per-group ad. So, for example,

condor_status -any -pool vocms097.cern.ch -l CMS_T2_US_Nebraska_Red_gw2@gfactory_instance@SDSC@CMSG-v1_0.main | sort

will have:

GlideClientMonitorGlideinsRunning = 516
GlideClientMonitorJobsRunningHere = 509
GlideFactoryMonitorStatusRunning = 748

The GlideClient* numbers are roughly correct - that’s the number of pilot and payload jobs running from the SDSC factory in the ‘main’ group of the CMSG frontend. However, GlideFactoryMonitor* is the number of all running pilots in the SDSC factory for this entry across all VOs. Hence, using the glideresource ads, it’s impossible to reconcile the three views for CMS (running pilots, running payloads, and the running htcondor-g jobs).

I’m pretty sure this is a bug (and an annoying one, as the user collector has no way of knowing the number of running jobs according to HTCondor-G). I propose a patch along the lines below.

Brian

--- a/frontend/glideinFrontendInterface.py
+++ b/frontend/glideinFrontendInterface.py
@@ -1284,15 +1284,9 @@ class ResourceClassad(classadSupport.Classad):
         @param info: Useful information from the glidefactoryclient classad
         """ 

-        # Required keys do not start with TotalClientMonitor but only
-        # start with Total. Substitute Total with GlideFactoryMonitor
-        # and put it in the classad
-        
         for key in info.keys():
-            if not key.startswith('TotalClientMonitor'):
-                if key.startswith('Total'):
-                    ad_key = key.replace('Total', 'GlideFactoryMonitor', 1)
-                    self.adParams[ad_key] = info[key]
+            if key.startswith('Status') or key.startswith('Requested'):
+                self.adParams['GlideFactoryMonitor' + key] = info[key]

 class ResourceClassadAdvertiser(classadSupport.ClassadAdvertiser):

History

#1 Updated by Parag Mhashilkar almost 5 years ago

  • Subject changed from Inaccurate information in glideresource classads to Inaccurate running pilot jobs number in glideresource classads
  • Stakeholders updated (diff)

#2 Updated by Brian Bockelman almost 5 years ago

  • Priority changed from Normal to High

#3 Updated by Parag Mhashilkar almost 5 years ago

The proposed patch doesnt really work. All it does is makes the GlideFactoryMonitor* keys disappear from the classad. Let me see if there is a way to figure out what you need is available.

#4 Updated by Parag Mhashilkar almost 5 years ago

Ok so after more poking around, I found out that the Factory side Total monitoring info does not come from the glidefactoryclient classads at all, but from the glidefactory i.e. entry classads. I vaguely remember discussion that this info was enough as the other relevant info came from the frontend via glideclient classad.

We can get the info you need from the glidefactoryclient classads instead.

#5 Updated by Parag Mhashilkar almost 5 years ago

  • Status changed from New to Feedback
  • Assignee changed from Parag Mhashilkar to Marco Mambelli

Changes are in v3/7780. Please review.

#6 Updated by Marco Mambelli almost 5 years ago

  • Assignee changed from Marco Mambelli to Parag Mhashilkar

#7 Updated by Parag Mhashilkar almost 5 years ago

  • Status changed from Feedback to Resolved

#8 Updated by Parag Mhashilkar over 4 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF