Project

General

Profile

Bug #1279

running_glideins_per_entry not doing what it sounds like

Added by Derek Weitzel over 8 years ago. Updated about 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Douglas Strain
Category:
-
Target version:
Start date:
05/02/2011
Due date:
% Done:

100%

Estimated time:
Spent time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

The element running_glideins_per_entry doesn't seem to be doing what is sounds like. It has an attribute, max, that sounds like it should limit the number of glideins running per entry.

But what it is doing is limiting the number of running jobs in the group before it stops asking for more glideins.

For example we have:

<running_glideins_per_entry max="1000" relative_to_queue="1.15"/>

And in our logs:

[2011-04-29T12:13:13-05:00 4071742] Jobs found total 3121 idle 2086 (old 2075) running 1035
[2011-04-29T12:13:13-05:00 4071742] Glideins found total 928 idle 0 running 928
[2011-04-29T12:13:13-05:00 4071742] Using 1 proxies
[2011-04-29T12:13:13-05:00 4071742] Match
[2011-04-29T12:13:13-05:00 4071742] Total matching idle 2086 (old 2075) running 1035 limit 1000
[2011-04-29T12:13:13-05:00 4071742]             Jobs in schedd queues                 |      Glideins     |   Request   
[2011-04-29T12:13:13-05:00 4071742] Idle (match  eff   old  uniq )  Run ( here  max ) | Total Idle   Run  | Idle MaxRun Down Factory
[2011-04-29T12:13:13-05:00 4071742]  2086( 2086  2086  2075  2086)  1035(  780  1000) |   780     0   780 |     0  3590 Up   CMS_T2_US_Purdue_rossmann@Production_v4_0@UCSD@glidein-1.t2.ucsd.edu
[2011-04-29T12:13:13-05:00 4071742]             Jobs in schedd queues                 |      Glideins     |   Request   
[2011-04-29T12:13:13-05:00 4071742] Idle (match  eff   old  uniq )  Run ( here  max ) | Total Idle   Run  | Idle MaxRun Down Factory
[2011-04-29T12:13:13-05:00 4071742]  2086( 2086  2086  2075  2086)  1035(  780  1000) |   780     0   780 |     0  3590 Up   Sum of useful factories
[2011-04-29T12:13:13-05:00 4071742]     0(    0     0     0     0)     0(    0     0) |     0     0     0 |     0     0 Down Sum of down factories
[2011-04-29T12:13:13-05:00 4071742]     0(    0     0     0     0)     0(    0     0) |     0     0     0 |     0     0 Down Unmatched
[2011-04-29T12:13:13-05:00 4071742] Advertizing 1 requests

This seems to indicate that since we're running 1000 jobs that match the query expression, it won't request any idle glideins. Even though we're actually only running 780 jobs on glideinwms glideins.

-Derek

glideinFrontendElement.py.correctrunning.patch (716 Bytes) glideinFrontendElement.py.correctrunning.patch Proposed PAtch. Derek Weitzel, 05/20/2011 01:40 PM

History

#1 Updated by Derek Weitzel over 8 years ago

Note - The bug is somewhere in this logic. Line 290 of glideinFrontendElement.py.

        count_jobs={}     # straight match
        prop_jobs={}      # proportional subset for this entry
        hereonly_jobs={}  # can only run on this site
        for dt in condorq_dict_types.keys():
            count_jobs[dt]=condorq_dict_types[dt]['count'][glideid]
            prop_jobs[dt]=condorq_dict_types[dt]['prop'][glideid]
            hereonly_jobs[dt]=condorq_dict_types[dt]['hereonly'][glideid]

        count_status={}
        for dt in status_dict_types.keys():
            status_dict_types[dt]['client_dict']=glideinFrontendLib.getClientCondorStatus(status_dict_types[dt]['dict'],frontend_name,group_name,request_name)
            count_status[dt]=glideinFrontendLib.countCondorStatus(status_dict_types[dt]['client_dict'])

        # effective idle is how much more we need
        # if there are idle slots, subtract them, they should match soon
        effective_idle=prop_jobs['Idle']-count_status['Idle']
        if effective_idle<0:
            effective_idle=0

        if total_running>=max_running:
            # have all the running jobs I wanted
            glidein_min_idle=0

#2 Updated by Derek Weitzel over 8 years ago

Propose the patch. This is just taking what the frontend logs, and using that for the total_running. I believe this mean running at a site according to condor_status, rather than condor_q.

#3 Updated by Dennis Box over 8 years ago

  • Assignee changed from Derek Weitzel to Douglas Strain

#4 Updated by Douglas Strain over 8 years ago

  • Status changed from Assigned to Closed
  • % Done changed from 50 to 100

Committing Derek's patch. The max running per entry was using running jobs to cap glidein submission and should use running glideins that match expr.

This bug should only affect pools that have non-glidein resources.

#5 Updated by Douglas Strain about 8 years ago

  • Target version set to v2_5_3


Also available in: Atom PDF