Project

General

Profile

Bug #11645

GlideinWMS not submitting enough pilots to multicore sites

Added by Parag Mhashilkar over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
High
Assignee:
Parag Mhashilkar
Category:
-
Target version:
Start date:
02/05/2016
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

A new ELOG entry has been submitted:

Logbook : GlideInWMS
Author : Brian Bockelman
Email : mailto:
Category : Problem
Shifter Role : CompOps Expert
Subject : GlideinWMS not submitting enough pilots to multicore sites

Logbook URL : https://cms-logbook.cern.ch/elog/GlideInWMS/3336

=================================

Hi,

Just like we found ~10 months ago, glideinWMS frontend isn't requesting enough pilots at multicore sites. It appears that the patch
we submitted to the project worked, but was subsequently broken again 2 weeks later
(https://github.com/holzman/glideinWMS/commit/6c3246442430e3c3a4f9c4abff4b317390c7a971).

Specifically, the frontend believes there are no running partitionable slots. This is because the payloads appear to be running in the
p-slot (if you look at the job's RemoteHost attribute) but p-slots are excluded from the collector queries here:

https://github.com/holzman/glideinWMS/blob/master/frontend/glideinFrontendElement.py#L1401

Since the p-slots aren't returned by the collector, the fallback logic here:

https://github.com/holzman/glideinWMS/blob/master/frontend/glideinFrontendLib.py#L130

causes all jobs running on a p-slot to be labelled as RunningOn="UNKNOWN".

Since no jobs are labelled as running on the entry point, the max running calculation is based solely on the number of "effectively
idle" jobs that match the entry point and does not include the number of currently running pilots. This causes max running to be
way too low (often below the number of currently running pilots; the algorithm requires the max running to be at least the number of
currently running plus a few more based on the idle payloads in queue).

Now, if the (# of entries) * (# of groups) * (# of factories) for a site is sufficiently high, the sheer number of entry points will cause
too-low estimates to be high enough to cover up the issue. This is what's happening (I think) at the T1s: they get a factor-2
overestimate because of the t1prod group*.

However, the issue is much more apparent at T2s: for example, T2_US_Nebraska has been a few thousand cores under-utilized over
the past week or so. There were no pending pilots despite thousands of pending payloads. Thanks to Jean-Roch for pointing this
out!

Below is a patch that ought to workaround the issue; I suspect the gWMS devs will want to do a more thorough investigation. I
additionally rebuilt the RPM for the current version using OSG's build infrastructure:

http://koji-hub.batlab.org/koji/buildinfo?buildID=8981

Hope this helps,

Brian

  • Which is something we should probably dig into in the first place: the fact that we are utilizing our T1s right now suggests there
    are multiple bugs / misconfigurations that are cancelling each other out.
--- a/frontend/glideinFrontendElement.py
+++ b/frontend/glideinFrontendElement.py
@@ -371,7 +371,7 @@ class glideinFrontendElement:

        self.populate_status_dict_types()
        glideinFrontendLib.appendRealRunning(self.condorq_dict_running,
-                                             self.status_dict_types['Running']['dict'])
+                                             self.status_dict_types['Total']['dict'])

        # TODO: should IdleCores/RunningCores be commented here?
        self.stats['group'].logGlideins({
@@ -1414,7 +1414,7 @@ class glideinFrontendElement:
                status_format_list = list(status_format_list) + list(self.x509_proxy_plugin.get_required_classad_attributes())

            # Consider multicore slots with free cpus/memory only
-            constraint = '(GLIDECLIENT_Name=?="%s.%s") && (%s)' % (self.frontend_name, self.group_name, mc_idle_constraint)
+            constraint = '(GLIDECLIENT_Name=?="%s.%s") && (%s)' % (self.frontend_name, self.group_name, "True")
            # use the main collector... all adds must go there
            status_dict = glideinFrontendLib.getCondorStatus(
                              [None],

Related issues

Related to GlideinWMS - Bug #11521: Negative running Core countClosed01/27/2016

Related to GlideinWMS - Bug #11145: Glideins are not submitted because of errors in the counting of total, idle and running jobs when partitionalble slots are involvedClosed12/14/2015

History

#1 Updated by Parag Mhashilkar over 4 years ago

Burt also submitted alternative patch to branch burt_p-slot-fix

#2 Updated by Parag Mhashilkar over 4 years ago

  • Description updated (diff)

#3 Updated by Parag Mhashilkar over 4 years ago

  • Related to Bug #11521: Negative running Core count added

#4 Updated by Parag Mhashilkar over 4 years ago

Some of the accounting issues have been addressed as part of the #11521. This issue is also addressed during the code changes made as part of addressing p-slot accounting issues.

#5 Updated by Parag Mhashilkar over 4 years ago

  • Related to Bug #11145: Glideins are not submitted because of errors in the counting of total, idle and running jobs when partitionalble slots are involved added

#6 Updated by Parag Mhashilkar over 4 years ago

  • Status changed from New to Resolved

Issues #11145 #11521 #11580 #11645 are addressed in the branch v3/pslot-accounting-review
Changes have been merged to branch_v3_2

#7 Updated by Parag Mhashilkar over 4 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF