Project

General

Profile

Feature #11418

Feature #7920: Log all curbs & limits hit to the user collector ads

Curbs and Limits for Factory

Added by HyunWoo Kim about 4 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Factory Monitoring
Target version:
Start date:
01/13/2016
Due date:
% Done:

90%

Estimated time:
Stakeholders:

OSG

Duration:

Description

The original ticket 7920 is complete, and we decided to open a subtask to add the similar functionality to Factory.
This new feature will pick up various limits, compare them with corresponding values and add new attributes to the glidefactoryclient classad
so that these information can be eventually advertized in glideresource classad for users to be able to know which limits have been used for determining the current number of glideins.


Related issues

Has duplicate GlideinWMS - Feature #7826: Need additional debugging information in the glideresource classadClosed02/10/2015

History

#1 Updated by HyunWoo Kim about 4 years ago

  • Status changed from New to Feedback
  • Assignee changed from HyunWoo Kim to Parag Mhashilkar

Ready for Feedback.

#2 Updated by Parag Mhashilkar almost 4 years ago

  • Assignee changed from Parag Mhashilkar to HyunWoo Kim

Sent feedback separately

#3 Updated by HyunWoo Kim almost 4 years ago

I have contemplated over Parag's first feedback,
he said,

" 2. writeClassadsToFile()
All the new code does not below here. Main objective of the function is to write classad to file and not do any processing
Calling queryQueuedGlideins() is expensive in case of busy system. All this info is already computed elsewhere and you should be able to use it when accessed first time"

Here is the result of my analysis of his comment and this issue:

The reason why in my first version, I call glideFactoryLib.getCondorQData() in def writeClassadsToFile()
is because the first call-out of the same function inside iterate_one is in an forked process.
Although normally what is changed inside a function call affects the original object (pass-by-object-reference of Python)
the situation is different when the function call is made in an forked process.
(I have confirmed this by writing a simple Python that has a same basic structure as glideFactoryEntry.py)

In order for def writeClassadsToFile() to have necessary information, there can be two solutions to this issue:

1. keep the current structure and modify the code such that the first part of iterate_one passes necessary information to def writeClassadsToFile().
2. modify the current structure such that the init method of class Entry calls glideFactoryLib.getCondorQData()
and saves the result in an instance variable.
This way this instance variable can be kept until def writeClassadsToFile() is used..

I prefer option 2, because option 1 might require we modify too much things in other codes.
In order for option 2 to be accepted, we need to carefully examine how this change would affect the overall Factory operation..

#4 Updated by HyunWoo Kim almost 4 years ago

Here are my responses to Parag's comments

glideFactoryEntry.py

1. glideinsWithinLimits()
Why add total_max_glideins defination if you are not going to use it?
HK> it was just a bug (I forgot to remove the line)

2. writeClassadsToFile()

All the new code does not below here. Main objective of the function is to write classad to file and not do any processing
Calling queryQueuedGlideins() is expensive in case of busy system. All this info is already computed elsewhere and you should be able to use it when accessed first time
HK> As I updated yesterday, this issue can be resolved by modifying the code in a different way, I need to talk with Parag in person.

Why are the values of triggered limits different that what we agreed and implemented for frontend?
For example:
limits_triggered['TotalGlideinsPerEntry'] = 'count=%i, limit=%i' % (count_status['Total'], self.max_running)
HK> I modified the Factory for it to be consistent with the frontend format

glideFactoryInterface
createGlideinClientMonitoringFile()
You should be able to much better in terms of computing the values before hand and writing to classad file
If I remember correctly, the glidefactoryclient classads are uniq to sec class
or sec class + credid and you need not write the limits triggered for all sec classes in all the classads

HK> I reviewed the code and configurations and my current conclusion is factory can still advertizes different limits for each security class.
HK> My understanding is, security name is unique to one frontend but each glideclient can have different security class
HK> so, each factory entry can have different limits for each of possible security classes.

I am updating my local branch, and I will test this new changes this afternoon..

#5 Updated by HyunWoo Kim almost 4 years ago

  • Status changed from Feedback to Assigned

Parag showed me the current glideFactoryEntryGroup.py already has a mechanism to convey a certain set of information from a porked process to the main process.
It turns out this dictionary already contains self.glideinTotals.
This means I don't need to call condorq in writeClassadsToFile() of glideFactoryEntry.py,
I just need to make sure self.glideinTotals is not None and use it.
I have updated my local branch.
I will have to do one more test before I merge this into branch_v3_2

#6 Updated by HyunWoo Kim almost 4 years ago

In glideFactoryEntryGroup.py,
def find_and_perform_work(factory_in_downtime, glideinDescript, frontendDescript, group_name, my_entries):

for ent in work:
entry = my_entries[ent]
forkm_obj.add_fork(entry.name, forked_check_and_perform_work, factory_in_downtime, entry, work)
post_work_info = forkm_obj.bounded_fork_and_collect(parallel_workers)
for entry in my_entries:
if ( (entry in post_work_info) and (len(post_work_info[entry]) > 0) ):
groupwork_done[entry] = {'work_done': post_work_info[entry]['work_done']}
(my_entries[entry]).setState(post_work_info[entry])

namely, entry.glideinTotals has a value only when if ( (entry in post_work_info) and (len(post_work_info[entry]) > 0) ):
this is why I check first if self.glideinTotals is None
glideinTotals = self.glideinTotals
if glideinTotals != None:
in
def writeClassadsToFile(self, downtime_flag, gf_filename,
gfc_filename, append=True):

namely, limits_triggered will only be filled when setState was invoked in glideFactoryEntryGroup.py
Even if limits_triggered is empty, glideFactoryInterface.py will not raise an error because it will first check
if desired keys are available in limits_triggered.

I tested the new code in my local GWMS instance.
I will merge into branch_v3_2 after I have talked with Parag again.

#7 Updated by HyunWoo Kim almost 4 years ago

  • Status changed from Assigned to Feedback
  • Assignee changed from HyunWoo Kim to Parag Mhashilkar
  • % Done changed from 80 to 90

On Friday, Parag suggested that we should not repeat in the parent process the calculations that are conducted in the forked processes.
So, I modified glideFactoryEntry.py to move the part that sets limits_triggered to glideinsWithinLimits()
and also modified getState() and setState() in the same file to add a new entry(limits_triggered).
I confirmed that limits_triggered is available in the parent process for advertizing..
I am passing this to Parag again for feedback.

#8 Updated by Parag Mhashilkar almost 4 years ago

  • Assignee changed from Parag Mhashilkar to HyunWoo Kim

Sent feedback separately. Also found a bug in the frontend's limits/curbs advertising.

#9 Updated by HyunWoo Kim almost 4 years ago

  • Assignee changed from HyunWoo Kim to Parag Mhashilkar

These are feedback comments from Parag:

1. Please do not add extra spaces before and after function arg list. It makes lines unneccessarily long and hard to read. Also it violates pep8 guidance
2. Do not leave around unwanted markers like #### in the code after debugging is done.

3. Why the classad attribute naming format different that what we use in case frontends?

Frontend naming convention is (as found in glideinFrontendElement.py and glideinFrontendInterface.py)
limits_triggered['TotalGlideinsPerEntry'] = 'count=%i, limit=%i' % (count_status['Total'], self.max_running)

for k,v in limits_triggered.iteritems():
if k.startswith('Curb'):
classadmessage = "GlideResource_Curb_"+k
else:
classadmessage = "GlideResource_Limit_"+k
self.adParams[classadmessage] = v

Frontend code picks up some attributes of glidefactoryclient classad and add them to glideresource classad
only if these attributes start with a pre-defined prefix, factoryConfig.glidein_monitor_prefix
which is set to a value as shown below:

class FactoryConfig:
def init(self):
self.glidein_monitor_prefix = "GlideinMonitor"

In addition to this prefix, there is one more string that is required.
This additional string should be one of pre-defined values
and I chose to use "Status"
(GlideinMonitorStatus is converted in Frontend to GlideFactoryMonitorStatus)

so, that is why the factory code is

ent_key = 'IdleGlideinsPerEntry'
if limits_triggered.get(ent_key) is not None:
fd.write('%sStatus_%s = "%s"\n' % (factoryConfig.glidein_monitor_prefix, ent_key, limits_triggered[ent_key]) )
in glideFactoryInterface.py

But now I realized that the Frontend code has this string "GlideResource_Limit",
I modified Factory code accordingly..

4. How to use dictionary in glideFactoryInterface, createGlideinClientMonitoringFile()
I modified the code in question according to Parag's suggestion.

I also corrected the bug in glideinFrontendInterface.py def setCurbsAndLimits(self, limits_triggered):
self.adParams[classadmessage] = v

#10 Updated by Parag Mhashilkar almost 4 years ago

  • Target version changed from v3_2_13 to v3_2_14

#11 Updated by Parag Mhashilkar almost 4 years ago

  • Assignee changed from Parag Mhashilkar to HyunWoo Kim

As we talked after the meeting yesterday, please look at #11853 on how the changes are made to add a new class of attributes in different glide* classads. You should use same approach for consistency. Also, document this new class of attributes in http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.prd/factory/design_data_exchange.html

#12 Updated by HyunWoo Kim almost 4 years ago

new solution for 11418 and 7920 after I reviewed the naming convention in 11853

The basic naming rule must be:
In Fronted, GlideinConfig in glideclient => GlideClientConfig in glideresource
In Factory, GlideinConfig in glidefactory => GlideFactoryConfg in glideresource

So, if I follow this rule, I can come up with the following
In Fronted, GlideinLimit in glideclient => GlideClientLimit in glideresource
In Factory, GlideinLimit in glidefactory => GlideFactoryLimit in glideresource

The followings are how I am going to change the names of classad attributes for limits triggered:

1. Frontend : (I will modify glideinFrontendInterface.py def setCurbsAndLimits(self, limits_triggered): )
OLD in glideresource: GlideResource_Limit_TotalGlideinsPerEntry = "count=0, limit=10000"
NEW in glideresource: GlideClientLimitTotalGlideinsPerEntry = "count=0, limit=10000"

2. Factory : (I will modify glideFactoryInterface.py def createGlideinClientMonitoringFile(fname,): )
OLD in glidefactoryclient: GlideinMonitorStatus_GlideResource_Limit_TotalGlideinsPerEntry = "count=0, limit=10000"
New in glidefactoryclient: GlideinMonitorStatus_GlideFactoryLimitTotalGlideinsPerEntry = "count=0, limit=10000"

(Here, new format can use GlideinMonitorStatus_GlideinLimitTotalGlideinsPerEntry = "count=0, limit=10000"
But, we must use GlideClientConfig because we want these to be picked by Frontend code to be part of glideresource
where only GlideinMonitorStatus will be searched for by Frontend code and will be replaced by GlideFactoryMonitorStatus,
Namely, GlideinMonitorStatus => GlideFactoryMonitorStatus)

So, in glidesource the above attributes wlll be automatically changed by Frontend code as follows:
OLD in glideresource : GlideFactoryMonitorStatus_GlideResource_Limit_TotalGlideinsPerEntry = "count=0, limit=10000"
New in glideresource : GlideFactoryMonitorStatus_GlideFactoryLimitTotalGlideinsPerEntry = "count=0, limit=10000"

I will modify my local branches (7920 for Frontend and 11418 for Factory) tomorrow, test and push.

#13 Updated by HyunWoo Kim almost 4 years ago

  • Assignee changed from HyunWoo Kim to Parag Mhashilkar

#14 Updated by Parag Mhashilkar almost 4 years ago

  • Assignee changed from Parag Mhashilkar to HyunWoo Kim

Can you please please update the docs as well? Once done please merge it to branch_v3_2.

#15 Updated by HyunWoo Kim almost 4 years ago

OK, I will work on the documentation and then proceed to merging.

#16 Updated by HyunWoo Kim almost 4 years ago

The following will be added to 11418/glideinwms/doc/factory/design_data_exchange.html

1. To advertize Frontend limits/curbs, Frontend adds the following attributes in glideresource classad about Frontend limits/curbs.
These new attributes in glideresource all start with the prefix "GlideClientLimit" to indicate that this attribute provides information about limit triggered.
examples>
If TotalGlideinsPerEntry on Frontend side is triggered,    you will see in    glideresource classad
GlideClientLimitTotalGlideinsPerEntry = "count=11, limit=10" 
It basically says that "the limit is 10 and the actual count was 11, so the code had to do something about this" 

2. To advertize Factory limits, Factory adds the following attributes in glidefactoryclient classad
These new attributes in glidefactoryclient all start with 2 prefixes
GlideinMonitorStatus_GlideFactoryLimit
GlideinMonitorStatus is    prepended by Factory code for this attribute to    be picked by Frontend code to become attribute of glideresource    classad.
The second prefix GlideFactoryLimit indicates that this attribute provides information about limit triggered.

examples found in glidefactoryclient>
If TotalGlideinsPerEntry limit is triggered on Factory side, you will see
GlideinMonitorStatus_GlideFactoryLimitTotalGlideinsPerEntry = "count=11, limit=10" 
It basically says that "the limit is 10 and the actual count was 11, so the code had to do something about this" 

Frontend code looks at glidefactoryclient and picks up attributes that start with
GlideinMonitorStatus
and convert them to start with
GlideFactoryMonitorStatus

examples found in glideresource>
Again, If TotalGlideinsPerEntry limit is triggered on Factory side, you will see
GlideFactoryMonitorStatus_GlideFactoryLimitTotalGlideinsPerEntry = "count=11, limit=10" 

I will put this in my local branch and merge my local branch into branch_v3_2 now
and then push to the origin.

#17 Updated by HyunWoo Kim over 3 years ago

  • Status changed from Feedback to Resolved

I pushed this on April 28 Thursday.

#18 Updated by Parag Mhashilkar over 3 years ago

  • Status changed from Resolved to Closed

#19 Updated by Parag Mhashilkar over 3 years ago

  • Related to Feature #7826: Need additional debugging information in the glideresource classad added

#20 Updated by Parag Mhashilkar over 3 years ago

  • Related to deleted (Feature #7826: Need additional debugging information in the glideresource classad)

#21 Updated by Parag Mhashilkar over 3 years ago

  • Has duplicate Feature #7826: Need additional debugging information in the glideresource classad added


Also available in: Atom PDF