Project

General

Profile

Bug #14501

Factory entries submitting glideins even after hitting the limit

Added by Parag Mhashilkar about 4 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
11/14/2016
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

This was observed during the HEPCloud demo. Entry was set with following

            <max_jobs>
               <default_per_frontend glideins="1000" held="200" idle="100"/>
               <per_entry glideins="1000" held="200" idle="100"/>
               <per_frontends>
               </per_frontends>
            </max_jobs>

And yet we saw, over 1000 Glideins

root@cmssrv280 ~]# condor_q -g -af GlideinEntryName | sort | uniq -c
      1 CMS_T1_US_FNAL_condce
      1 CMS_T1_US_FNAL_condce2
      2 CMS_T1_US_FNAL_condce4
   1220 Google_us_central1-a
   1170 Google_us_central1-b
   1212 Google_us_central1-c
    995 Google_us_central1-f

Log, iteration before entry stopped submitting glideins

[2016-11-14 10:15:45,724] INFO: glideFactoryEntry:346: Iteration initialized
[2016-11-14 10:15:45,959] DEBUG: glideFactoryEntry:1006: Checking security credentials for client cmssrv279-fnal-gov_OSG_gWMSFrontend.cms_google
[2016-11-14 10:15:45,962] INFO: glideFactoryEntry:1124: Checking downtime for frontend hepcloudFE security class: cms (entry Google_us_central1-a).
[2016-11-14 10:15:45,963] INFO: glideFactoryLib:873: Client cmssrv279-fnal-gov_OSG_gWMSFrontend.cms_google (secid: hepcloudFE_cms) requesting 1352 glideins, max running 3482, remove excess 'NO'
[2016-11-14 10:15:45,963] INFO: glideFactoryLib:874:   Params: {'CONDOR_VERSION': 'default', 'GLIDEIN_Job_Max_Time': 34800, 'GLIDECLIENT_ReqNode': 'cmssrv280.fnal.gov', 'VM_MAX_LIFETIME': '2600000', 'GLIDECLIENT_Rank': '1', 'GLIDEIN_Report_Failed': 'NEVER', 'MIN_DISK_GBS': 1, 'GLIDEIN_CCB': 'hepcsvc02.fnal.gov:9620-9730;hepcsvc03.fnal.gov:9620-9730', 'GLIDEIN_Max_Walltime': 2600000, 'VM_DISABLE_SHUTDOWN': 'False', 'CONDOR_ARCH': 'default', 'UPDATE_COLLECTOR_WITH_TCP': 'True', 'USE_CCB': 'True', 'USE_MATCH_AUTH': 'True', 'CONDOR_OS': 'default', 'GLIDEIN_Collector': 'cmssrv274.fnal.gov:9620-9730;cmssrv276.fnal.gov:9620-9730'}
[2016-11-14 10:15:45,963] INFO: glideFactoryLib:877:   Decrypted Param Names: ['AuthFile', 'SecurityName', 'VMId', 'GlideinProxy', 'SecurityClass']
[2016-11-14 10:15:46,028] INFO: glideFactoryLib:847: Client cmssrv279-fnal-gov_OSG_gWMSFrontend.cms_google (secid: hepcloudFE_cms) schedd status {1: 0, 2: 80}
[2016-11-14 10:15:46,129] INFO: glideFactoryEntry:1756: Using v3+ protocol and credential 359869
[2016-11-14 10:15:46,129] DEBUG: glideFactoryLib:616: Additional idle glideins exceeded entry max submit limits 20, adjusted add_glideins to entry max submit rate
[2016-11-14 10:15:46,129] DEBUG: glideFactoryLib:619: Submitting 20 glideins
[2016-11-14 10:15:46,134] DEBUG: glideFactoryLib:1429: params: {'CONDOR_VERSION': 'default', 'GLIDEIN_Job_Max_Time': 34800, 'GLIDECLIENT_ReqNode': 'cmssrv280.fnal.gov', 'VM_MAX_LIFETIME': '2600000', 'GLIDECLIENT_Rank': '1', 'GLIDEIN_Report_Failed': 'NEVER', 'MIN_DISK_GBS': 1, 'GLIDEIN_CCB': 'hepcsvc02.fnal.gov:9620-9730;hepcsvc03.fnal.gov:9620-9730', 'GLIDEIN_Max_Walltime': 2600000, 'VM_DISABLE_SHUTDOWN': 'False', 'CONDOR_ARCH': 'default', 'UPDATE_COLLECTOR_WITH_TCP': 'True', 'USE_CCB': 'True', 'USE_MATCH_AUTH': 'True', 'CONDOR_OS': 'default', 'GLIDEIN_Collector': 'cmssrv274.fnal.gov:9620-9730;cmssrv276.fnal.gov:9620-9730'}
[2016-11-14 10:15:46,135] DEBUG: glideFactoryLib:1430: submit_credentials.security_credentials: {'AuthFile': u'/var/lib/gwms-factory/client-proxies/user_cms_1/glidein_hepcloud_instance/credential_cmssrv279-fnal-gov_OSG_gWMSFrontend.cms_google_359869', 'GlideinProxy': u'/var/lib/gwms-factory/client-proxies/user_cms_1/glidein_hepcloud_instance/credential_cmssrv279-fnal-gov_OSG_gWMSFrontend.cms_google_463512_compressed'}
[2016-11-14 10:15:46,135] DEBUG: glideFactoryLib:1431: submit_credentials.identity_credentials: {'RemoteUsername': None, 'VMType': 'projects/fermilab-poc/zones/us-central1-a/machineTypes/custom-16-32768', 'VMId': 'projects/fermilab-poc/global/images/worker-6gb-2'}
[2016-11-14 10:15:46,135] DEBUG: glideFactoryLib:1479: Userdata ini file:
[glidein_startup]
[vm_properties]

[2016-11-14 10:15:46,135] DEBUG: glideFactoryLib:1481: Userdata ini file has been base64 encoded
[2016-11-14 10:15:46,851] DEBUG: glideFactoryLib:1195: ['Submitting job(s)..........', '10 job(s) submitted to cluster 111596.']
[2016-11-14 10:15:47,270] DEBUG: glideFactoryLib:1195: ['Submitting job(s)..........', '10 job(s) submitted to cluster 111444.']
[2016-11-14 10:15:47,271] INFO: glideFactoryLib:1230: Submitted 20 glideins to cmssrv280.fnal.gov: [(111596L, 0), (111596L, 1), (111596L, 2), (111596L, 3), (111596L, 4), (111596L, 5), (111596L, 6), (111596L, 7), (111596L, 8), (111596L, 9), (111444L, 0), (111444L, 1), (111444L, 2), (111444L, 3), (111444L, 4), (111444L, 5), (111444L, 6), (111444L, 7), (111444L, 8), (111444L, 9)]
[2016-11-14 10:15:47,271] INFO: glideFactoryEntry:1764: Submitted 20 glideins
[2016-11-14 10:15:48,194] INFO: glideFactoryEntry:679: Computing log_stats diff for Google_us_central1-a
[2016-11-14 10:15:48,200] INFO: glideFactoryEntry:681: log_stats diff computed
[2016-11-14 10:15:48,201] INFO: glideFactoryEntry:683: Writing log_stats for Google_us_central1-a
[2016-11-14 10:15:48,204] INFO: glideFactoryEntry:685: log_stats written
[2016-11-14 10:15:48,204] INFO: glideFactoryEntry:688: Writing qc_stats for Google_us_central1-a
[2016-11-14 10:15:48,206] INFO: glideFactoryEntry:690: qc_stats written
[2016-11-14 10:15:48,206] INFO: glideFactoryEntry:692: Writing rrd_stats for Google_us_central1-a
[2016-11-14 10:15:48,371] INFO: glideFactoryEntry:694: rrd_stats written

History

#1 Updated by Parag Mhashilkar almost 4 years ago

  • Target version changed from v3_2_17 to v3_2_18

#2 Updated by Parag Mhashilkar almost 4 years ago

  • Assignee changed from Parag Mhashilkar to HyunWoo Kim

#3 Updated by Marco Mambelli almost 4 years ago

  • Target version changed from v3_2_18 to v3_2_19

#4 Updated by HyunWoo Kim over 3 years ago

class GlideinTotals acquires the values of entry_held/idle/running and compares them against the limits (from the configuration file).
entry_idle is determined in the following code:

sum_idle_count(qc_status)
if qc_status.has_key(1):  # Idle==Jobstatus(1)
    self.entry_idle = qc_status[1]

where the definition of sum_idle_count function is:

def sum_idle_count(qc_status):
    #   Idle==Jobstatus(1)
    #   Have to integrate all the variants
    qc_status[1] = 0
    for k in qc_status.keys():
    if (k >= 1000) and (k <= 1100):
            qc_status[1] += qc_status[k]
    return

Here the keys of qc_status are determined in hash_status() function:

# Split idle depending on GridJobStatus
#   1001 : Unsubmitted
#   1002 : Submitted/Pending
#   1010 : Staging in
#   1100 : Other
#   4010 : Staging out
# All others just return the JobStatus
def hash_status(el):
    job_status = el["JobStatus"]
    if job_status == 1: # idle jobs, look of GridJobStatus
....skip
    elif job_status == 2:  # count only real running, all others become Other
        if el.has_key("GridJobStatus"):
            grid_status = str(el["GridJobStatus"]).upper()
            if   grid_status in ("ACTIVE", "REALLY-RUNNING", "INLRMS: R", "RUNNING", "INLRMS:R"):
                return 2
            elif grid_status in ("STAGE_OUT", "INLRMS: E", "EXECUTED", "FINISHING", "FINISHED", "DONE", "INLRMS:E"):
                return 4010
...skip

hash_status puts some categories related to "staging out" in 4010,
but sum_idle_count appears to not count these.
So, my suggestion will be to modify sum_idle_count() as follows;

def sum_idle_count(qc_status):
    qc_status[1] = 0
    for k in qc_status.keys():
        if (  (k >= 1000) and (k <= 1100)  ) or (k == 4010):
            qc_status[1] += qc_status[k]
    return

My guess is, this bug has been there so far, but has not manifested itself so far simply because we did not see these "idle from staging out" until the recent use of cloud resources ..
Questions at this point are:
1. Was the GCE demo the first time where we observed this problem?
2. What will be the best way to test my solution?

#5 Updated by HyunWoo Kim over 3 years ago

  • Status changed from New to Resolved

I talked with Parag
and Parag remembered that this ticket was created before this problem was solved by Doug Strain's debugging of pagination.
My finding also turns out to be irrelevant.
We can close this ticket.

#6 Updated by HyunWoo Kim over 3 years ago

  • Status changed from Resolved to Remission

#7 Updated by HyunWoo Kim over 3 years ago

  • Status changed from Remission to Accepted

#8 Updated by HyunWoo Kim over 3 years ago

  • Status changed from Accepted to Work in progress

#9 Updated by HyunWoo Kim over 3 years ago

  • Status changed from Work in progress to Closed

#10 Updated by Marco Mambelli over 3 years ago

HyunWoo and Parag said that this has already been solved

Also available in: Atom PDF