Log all curbs & limits hit to the user collector ads
The curbs and limits are one of the most opaque behaviors in glideinWMS. Many times (>10) we've tried to debug mysterious lack of glideins only to find the issue was not site-related -- rather, they had to do with an unrelated global limit and a problem at the other site.
We should be able to get the following into the glideresource ads:
- Per entry point, any factory-side curbs or limits applied (max running / idle jobs per frontend, for example)
- Per entry point, any frontend-side curbs or limits applied (max running / idle jobs per group, frontend, or globally, for example).
- Any schedd exclusions applied.
Since some limits are frontend-wide, it may make sense to have a "frontendresource" ad in the user collector. Not necessary though.
#3 Updated by HyunWoo Kim over 5 years ago
I learned that Frontend composes glideresource classad based on the contents of glideclient and glidefactoryclient classads.
So, I will just have to add new entries in Frontend's glideclient classad and Factory's glidefactoryclient classad
and these new entries can be used for curbs and limits that the user wants to know about.
I have identified the section of the relevant codes where I can add new classad entries,
and I will find out how to gather relevant information on useful curbs and limits.
#4 Updated by HyunWoo Kim over 5 years ago
- % Done changed from 0 to 50
In Frontend code:
def createAdvertizeWorkFile should be modified to add new attributes for curbs and limits to glideclient classad
In Factory code:
def writeClassadsToFile and gfi.MultiAdvertizeGlideinClientMonitoring should be modified to add new attributes for curbs and limits to glidefactoryclient classad
class ResourceClassad: has two methods
def setGlideClientMonitorInfo(self, monitorInfo): for glideclient classad
def setGlideFactoryMonitorInfo(self, info): for glidefactoryclient classad
The second method loops over all the attributes in glidefactoryclient classad
which means our new attributes will be picked up by glideresource
but the first method copies attributes using their names
which means we will have to modify this method to explicitly specify new attributes to be copied.
#5 Updated by HyunWoo Kim over 5 years ago
- % Done changed from 50 to 60
I confirmed that when I added new test attributes in glidefactoryclient classad, those were picked up automatically by
How glideclient classad is translated to glideresource classad is also understood.
Now, I just need to find actual information to be advertized.
Parag just showed me compute_glidein_min_idle method of class glideinFrontendElement
which computes those limits that were hit.
I will have to figure out how to provide that type of information to those codes that produce glideclient and glidefactoryclient classads.
#7 Updated by HyunWoo Kim about 5 years ago
I am currently finalizing on what kind of information should be provided and how to display.
Basic variables to be displayed are
- number of user jobs (running or idling),
- number of glideins(or startds) that are running user jobs or waiting(idling) for user jobs
- and limits(and curbs) that affect the number of glideins in each category (per entry, per group, per FE, or globally)
The following is a candidate solution that shows Frontend side.
I am still working on Factory side information..
- Information on Jobs
Total Idlling Jobs in the Queue (matched or unmatched) = 0
Total Idlling Jobs in the Queue matched for this Entry = 0
Total Running Jobs in the Queue matched for this entry = 2
- Information on Entry-centric Glideins or Startds
Total(Idlling+Running) Glideins/Startds per Entry(and its limit) = 2(10000)
All Idlling---------------Glideins/Startds per Entry(and its limit) = 0(100)
- Information on Frontend-centric Glideins or Startds
Total(Idlling+Running) Glideins/Startds per Group(and its limit) = 4(100000)
All Idlling---------------Glideins/Startds per Group(and its limit) = 0(1000)
Total(Idlling+Running) Glideins/Startds per Frone(and its limit) = 4(100000)
All Idlling---------------Glideins/Startds per Frone(and its limit) = 0(1000)
Total(Idlling+Running) Glideins/Startds per Globa(and its limit) = 4(100000)
All Idlling---------------Glideins/Startds per Globa(and its limit) = 0(1000)
#9 Updated by HyunWoo Kim about 5 years ago
I did some more testings to see how the new code works
In /etc/gwms-frontend/frontend.xml, I modified the following 3 max
<running_glideins_total curb="90000" max="2"/> per frontend
<running_glideins_total_global curb="90000" max="3"/> per global
<group name="main" enabled="True">
<running_glideins_total curb="90000" max="4"/> per group
My internal log shows
Total(Idlling+Running) Glideins/Startds per Group(and its limit) = 4(4)
Total(Idlling+Running) Glideins/Startds per Frone(and its limit) = 4(2)
Total(Idlling+Running) Glideins/Startds per Globa(and its limit) = 4(3)
[root@fermicloud159 frontend]# condor_status -any -l -constraint 'MyType=="glideresource"' | grep Dummy
TotalGlideinsPerGroup _HasBeenTriggered_NotRequestingMoreGlideins = "Dummy"
TotalGlideinsPerFrontend _HasBeenTriggered_NotRequestingMoreGlideins = "Dummy"
TotalGlideinsGlobal _HasBeenTriggered_NotRequestingMoreGlideins = "Dummy"
At this point, there were 4 running glideins and no more glideins were requested because the above
3 limits have been exceeded.
So,we can get some useful information from glideresource classad attributes..
I will talk with Parag tomorrow and proceed to "Feedback" status if this is satiafactory.
I will create a sub-ticket for a similar code for Factory..
#11 Updated by Parag Mhashilkar about 5 years ago
- Assignee changed from Parag Mhashilkar to HyunWoo Kim
Adding value "Dummy" to value is not useful and not a good practice. You can easily make this classad attribute more useful by assigning useful value.¶
Here is an example,
if (count_status['Total'] >= self.max_running): which_limits_triggered['TotalGlideinsPerEntry'] = 1
Change this to
if (count_status['Total'] >= self.max_running): limits_triggered['TotalGlideinsPerEntry'] = 'count=%i, limit=%i' % (count_status['Total'], self.max_running)
- Shorten classad attribute name
- Example: foo*_HasBeenTriggered_NotRequestingMoreGlideins -> GlideResource_Limit_foo*
- Example: Curb*_HasBeenTriggered_GlideinsRequestsReduced -> GlideResource_Curb_*. For values of Curbbed attributes you can have 'count=%i, limit=%i, curbed='
- Remove extra spaces from the comparisons the in the if statements