Project

General

Profile

Bug #3110

An entry in downtime does not show the glidein status

Added by Krista Larson about 7 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
11/06/2012
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

I have entries CMS_T1_DE_KIT_cream-4 and CMS_T1_DE_KIT_cream-5 with running glideins and status. If I put these entries in downtime, all the glideins and status go to zero. If I take them out of downtime, I can see all the numbers again. I've attached screen shots.

This affects factoryStatusNow and factoryEntryStatusNow and I'm running gwms v2.6.1.

before_downtime.png (199 KB) before_downtime.png Krista Larson, 11/06/2012 11:53 AM
after_downtime.png (198 KB) after_downtime.png Krista Larson, 11/06/2012 11:53 AM

History

#1 Updated by Douglas Strain about 7 years ago

  • Assignee set to Douglas Strain

#2 Updated by Douglas Strain about 7 years ago

  • Target version set to v2_7_x

#3 Updated by Parag Mhashilkar almost 7 years ago

  • Priority changed from Normal to Low

#4 Updated by Burt Holzman almost 7 years ago

  • Assignee changed from Douglas Strain to Parag Mhashilkar

#5 Updated by Parag Mhashilkar almost 4 years ago

  • Target version changed from v2_7_x to v3_x

#6 Updated by Parag Mhashilkar over 3 years ago

  • Assignee changed from Parag Mhashilkar to HyunWoo Kim
  • Target version changed from v3_x to v3_2_15

#7 Updated by Parag Mhashilkar over 3 years ago

  • Target version changed from v3_2_15 to v3_2_16

#8 Updated by HyunWoo Kim over 3 years ago

I started from creation/web_base/frontendStatusNow.html
and investigated how this html page shows the results of condor_q command
(and why it does not show any when an entry is in downtime).
My curent understanding is
- this html file gathers information from /var/lib/gwms-factory/web-area/monitor/schedd_status.xml
- and this XML file gathers and combines information of each entry from /var/lib/gwms-factory/web-area/monitor/entry/schedd_status.xml

When an entry is in downtime, glideFactoryMonitoring.data is empty and thus entry/schedd_status.xml is also empty.
I spent most of today in order to reach this conclusion.

And then, the question is, which part of Factory codes populates glideFactoryMonitoring.data or not
depending on whether an entry is in downtime or not?
I spent a couple of hours to find the answer to this question, but not successful yet.

I will need one or 2 more days for this.

#9 Updated by HyunWoo Kim over 3 years ago

I think I found a (candidate of) solution:

My findings are as follows:

1. At each entry level: First the following code writes /var/lib/gwms-factory/web-area/monitor/entry_Amazon_HKTest/schedd_status.xml

   
glideFactoryEntryGroup.py
def iterate(parent_pid, sleep_time, advertize_rate, glideinDescript,
    while 1:
        try:
            done_something = iterate_one(count==0, factory_in_downtime,glideinDescript, frontendDescript,group_name, my_entries)
            for entry in entrylists[cpu]:
                 entry.writeStats()
class Entry:
    def writeStats(self):
        self.gflFactoryConfig.qc_stats.finalizeClientMonitor()
        self.gflFactoryConfig.qc_stats.write_file(monitoringConfig=self.monitoringConfig)

2. At the Factory level: the a loop inside glideFactory.py refers to /var/lib/gwms-factory/web-area/monitor/entry_Amazon_HKTest/schedd_status.xml
and writes /var/lib/gwms-factory/web-area/monitor/schedd_status.xml
And creation/web_base/frontendStatusNow.html reads from this /var/lib/gwms-factory/web-area/monitor/schedd_status.xml

   
glideFactory.py
def aggregate_stats(in_downtime):
    try:
        _ = glideFactoryMonitorAggregator.aggregateStatus(in_downtime)

def spawn(sleep_time, advertize_rate, startup_dir, glideinDescript, frontendDescript, entries, restart_attempts, restart_interval):
    for group in range(len(entry_groups)):
            entry_names = string.join(entry_groups[group], ':')
            logSupport.log.info("Starting EntryGroup %s: %s" % (group, entry_groups[group]))
            command_list = [sys.executable, os.path.join(STARTUP_DIR,  "glideFactoryEntryGroup.py"),entry_names, str(group)]
            childs[group] = subprocess.Popen(command_list, preexec_fn=_set_rlimit)
    while 1:
        logSupport.log.info("Aggregate monitoring data")
        aggregate_stats(factory_downtimes.checkDowntime())

3. now the question is,
where does /var/lib/gwms-factory/web-area/monitor/entry_Amazon_HKTest/schedd_status.xml
gather the necessary data?

My solution is in glideFactoryEntry.py

   
def unit_work_v3(entry,,,):
    entry.log.info("Checking downtime for frontend %s security class: %s (entry %s)." % (client_security_name, credential_security_class, entry.name))
    if entry.isSecurityClassInDowntime(client_security_name, credential_security_class):
         entry.log.warning("Security class %s is currently in a downtime window for entry: %s. Ignoring request." 
#        return return_dict
        in_downtime = True

My investigation revealed that returing inside this if statement
caused the data to be empty which resulted in the empty values in the table in frontendStatusNow.html
So, I decided to comment out return return_dict.

Then, the next problem was,

 
def unit_work_v3(entry,): has the following codes
    entry.setDowntime(in_downtime)
    entry.gflFactoryConfig.qc_stats.set_downtime(in_downtime)

class condorQStats: has 
    def set_downtime(self, in_downtime):
        self.downtime = str(in_downtime)
        return

and this is used in
   
    def get_xml_downtime(self, leading_tab=xmlFormat.DEFAULT_TAB):
        xml_downtime = xmlFormat.dict2string({}, dict_name='downtime', el_name='', params={'status':self.downtime}, leading_tab=leading_tab)
#        xml_downtime = xmlFormat.dict2string({}, dict_name='downtime', el_name='', params={'status':'True'}, leading_tab=leading_tab)
     return xml_downtime

Somehow, in_downtime was not set to True even when my entry was down:
So, I had to insert in_downtime = True in the upper part of def unit_work_v3(entry,,,):
   
def unit_work_v3(entry,,,):
    if entry.isSecurityClassInDowntime(client_security_name, credential_security_class):
#        return return_dict
        in_downtime = True

I tested this (tentative) solution in my factory:
- first submit some jobs so that a couple of glideins are submitted
- service gwms-factory down entry Amazon_HKTest
in http://hepcloud-devfac.fnal.gov/factory/monitor/factoryStatusNow.html
make sure to see a downward arrow with red color
and the correct numbers are shown in the table:
- eventually these glidens are drained:

So, this solution appears to be working
but I will need to double-check the Factory code to make sure this solution does not damage other working part

#10 Updated by HyunWoo Kim over 3 years ago

  • Status changed from New to Feedback
  • Assignee changed from HyunWoo Kim to Marco Mambelli

The changes that I am proposing are

In glideFactoryEntry.py
def unit_work_v3():

    if entry.isSecurityClassInDowntime(client_security_name,  credential_security_class):
#HK> comment out this return 
#       return return_dict
#HK> insert a new line
        in_downtime = True

Now below, I explain how these changes work.
First we have to know how the Factory codes treat factory_in_downtime:

glideFactoryEntryGroup.py
def iterate(): has
    factory_in_downtime = factory_downtimes.checkDowntime(entry="factory")
    done_something = iterate_one( factory_in_downtime )

def iterate_one( factory_in_downtime ):
    groupwork_done = find_and_perform_work(factory_in_downtime)

def find_and_perform_work(factory_in_downtime):
    for ent in work:
        entry = my_entries[ent]
        forkm_obj.add_fork( forked_check_and_perform_work, factory_in_downtime, entry, work )

def forked_check_and_perform_work(factory_in_downtime, entry, work):
    work_done = glideFactoryEntry.check_and_perform_work( factory_in_downtime, entry, work[entry.name])

glideFactoryEntry.py
def check_and_perform_work(factory_in_downtime, entry, work):
    in_downtime = factory_in_downtime
    work_performed = unit_work_v3( in_downtime )

def unit_work_v3( in_downtime ):

#HK> up to this point, in_downtime indicates if the entire Factory is in downtime

    if entry.isSecurityClassInDowntime(client_security_name,  credential_security_class):
#       return return_dict
        in_downtime = True

#HK> At this point, in_downtime can indicate if either the entire Factory is in downtime or this specific Entry is in downtime.
#HK> The following code reveals that when the entire Factory is in downtime,
#HK> the Factory goes all the way to keepIdleGlideins() of glideFactoryLib.py where no glideins are submitted if idle_glideins = 0
#HK> So, my solution to set in_downtime = True above is guaranteed to work based on the argument that the entire Factory downtime has worked fine with idle_glideins=0.
if in_downtime or not can_submit_glideins:
idle_glideins=0

#HK> So, the net effect is,
- let the entry code (unit_work_v3 of glideFactoryEntry.py) go all the way to the end including the necessary information collection for monitoring
- and make sure no new glideins get submitted just in the same way as the entire Factory downtime happens where it is possible
that keepIdleGlideins() might call sanitizeGlidines() for some cleanup

I am placing this ticket under feedback and assign to Marco Mambelli

#11 Updated by Marco Mambelli over 3 years ago

  • Assignee changed from Marco Mambelli to HyunWoo Kim

Please indent the comment with the block. Although the current code is allowed by python (not an error), it would be more clear if it kept the block level indentation (like also other comments do). Then you can merge.

About the logic: the explanation makes sense but I did not test the code.

#12 Updated by HyunWoo Kim over 3 years ago

  • Status changed from Feedback to Closed

Merged into branch_v3_2
and closing this ticket..



Also available in: Atom PDF