Project

General

Profile

Bug #8849

Do not count dagman/schedd univ jobs for max jobs running

Added by Parag Mhashilkar over 4 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
05/18/2015
Due date:
% Done:

90%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:

FIFE

Duration:

Description

When frontend does the accounting, Glideinwms should do what HTCondor does and exclude certain types of jobs when considering, max, running, held, idle jobs.

Currently for example when dagman jobs are running or held, frontend counts them against the respective limits and prevents request for more glideins.

History

#1 Updated by HyunWoo Kim over 4 years ago

  • Status changed from New to Assigned
  • % Done changed from 0 to 70

I have spent some time trying to understand the behavior of frontend based on the original description of this issue,
but the information that I could gather from frontend log files did not seem to exhibit the same result.
So, I contacted the person (Joe Boyd) who originally observed this issue and exchanged a couple of emails on this issue.
My conclusion from reading emails from Joe and Kevin is that,
what Joe observed was, in any way, a discrepancy in the total number of runnings between the scheduler (which sets MAX_JOBS_RUNNING) and the frontend.
When there is no dagman jobs, he did not observe any discrepancy.
but when there were dagman jobs, Joe observed some discrepancy, for example, the frontend thought it reached a limit in number of running jobs that caused it not to even consider asking for glideins while the scheduler was not really full.
There can be two two possible options at this point:
1. I can continue the investigation further by focusing on the behavior of frontend against the MAX_JOBS_RUNNING parameter and existence of dagman jobs.
2. As Kevin let us know, we can simply upgrade condor to 8.3.2 from 8.2.x which combines all jobs from all universes (including the scheduler universe where the dagman jobs run)
whereas in 8.3.2, the scheduler universe is not condidered when adding.

#2 Updated by HyunWoo Kim over 4 years ago

  • % Done changed from 70 to 90

I now fully understand this ticket.

What Joe Boyd observed is as follows:
1 fifebatchgpvmhead1:/var/log/gwms-frontend/group_FNAL_nova/FNAL_nova.err.log-20150613.gz showed
"Schedd fifebatch2.fnal.gov hit maxrun limit, blacklisting: has 17461 running with max 17100"
2 But "condor_status -schedd -autof MaxJobsRunning TotalSchedulerJobsRunning TotalRunningJobs" showed that
the maximum (the limit) is reached only when TotalSchedulerJobsRunning and TotalRunningJobs are added,
This made him/us think that even when there are more new jobs in schedd, the FE stops requesting glideins only because the maximum has been reached.
But I now think we were wrong.

My understanding of the relevant FE code(def identify_bad_schedds) is that FE simply
adds TotalSchedulerJobsRunning and TotalRunningJobs to compare the sum with MaxJobsRunning
and stops querying a schedd when the sum is near its configured maximum(MaxJobsRunning).
This is actually correct policy to cope with the HTCondor version 8.2.x where MaxJobsRunning is actually limit for all types of jobs.
The meaning of MaxJobsRunning has changed in HTCondor version 8.3.2 where MaxJobsRunning now does not include jobs from scheduler universe.

So, my conclusion is,
- the current code def identify_bad_schedds in glideinFrontendElement.py does not need to be modified when used with HTCondor 8.2.x
because it reacts appropriately to the behavior of schedd
- When GWMS is used with HTCondor version greater than 8.3.2, we need to modify def identify_bad_schedds
to compare only TotalRunningJobs with MaxJobsRunning.

#3 Updated by Parag Mhashilkar over 4 years ago

  • Target version changed from v3_2_11 to v3_2_12

#4 Updated by HyunWoo Kim over 4 years ago

I was advised to search the standard Condor classad attributes for a possible candidate that can be used in our code to replace MaxJobsRunning
but I could not find any other parameter or Scheduler classad attribute that can replace MaxJobsRunning..

We also need to know the following:
- did Joe Boyd mean to use MaxJobsRunning to limit only the actual running jobs?
- And whether Condor Scheduler actually uses MaxJobsRunning to limit Condor Scheduler performance?
- Another important factor here is, we usually use StartSchedulerUniverse = TotalSchedulerJobsRunning < 200
to limit the number of TotalSchedulerJobsRunning, but in fact, in fifebatch1, Joe does not set a limit to it.

Condor 8.2.7 manual recommends to use both MaxJobsRunning and StartSchedulerUniverse (boolean = TotalSchedulerJobsRunning < 200)
to limit both the sum (of TotalJobsRunning and TotalSchedulerJobsRunning) and the individual TotalSchedulerJobsRunning.

We first define a new parameter, called for example, MaxVanillaJobsRunning:
And outside of the code, the admins (or Frontend install script) should define a new classad MaxVanillaJobsRunning and advertize it to Collector.
Value for MaxVanillaJobsRunning can be
MaxVanillaJobsRunning = MaxJobsRunning - ( 200 from StartSchedulerUniverse = TotalSchedulerJobsRunning < 200 )

Based on these, the Frontend code can be modified as follows:

def identify_bad_schedds(self):
max_run = int( el['MaxVanillaJobsRunning'] * 0.95 + 0.5 )
current_run = el['TotalRunningJobs']
comment out: current_run += el.get('TotalSchedulerJobsRunning',0)

Note that we can not use MaxJobsRunning parameter because it has different definition across condor versions
Note that we can not define a new parameter MaxVanillaJobsRunning inside Frontend code because of this same condor version dependency.
Instead we need to take care of version-dependency ouside the code, at the time of configuration

But this can be adopted only after we are sure that we can not find a standard Condor information

#5 Updated by HyunWoo Kim over 4 years ago

I have now a new idea to resolve this issue.
I will discuss with Parag and proceed

#6 Updated by HyunWoo Kim about 4 years ago

I talked with Parag, his idea is as follows

first we need to survey how many sites have already upgraded to HTCondor 8.3 and over
I did the survey using condor_status command
the results are that many sites are still using HTCondor 8.2 and below although CMS appear to have upgraded

At this point, Parag wants to wait a bit more for people to upgrade their HTCondor
(he will change the Target version as we approach the deadline of v3_2_12)

For the moment, we have to consider 2 implications:
First in the case of current code with HTCondor 8.3, consider the following scenario:
- MaxJobsRunning(MJR) is set to 5,000
- Admin does not use StartSchedulerUniverse = TotalSchedulerJobsRunning < 200 to control number of Scheduler universe jobs
- then scheduler universe jobs can be 4,999 and still this scheduler can be blacklisted by GWMS before HTcondor suspend this scheduler based on MJR which only applies to Vanilla jobs in 8.3

Second, in case of using the modified code (in the future) with HTCondor 8.2:
there can be more running jobs before one scheduler is blacklisted by GWMS.

Anyway, we suspend this ticket for a while
and I will pick it up again when we have decided that sufficiently many sites have upgraded to HTCondor 8.3 and above and it is time to change the identify_bad_schedds method
in glideinFrontendElement.py

#7 Updated by Parag Mhashilkar about 4 years ago

  • Target version changed from v3_2_12 to v3_2_13

#8 Updated by Parag Mhashilkar almost 4 years ago

  • Target version changed from v3_2_13 to v3_2_14

#9 Updated by Parag Mhashilkar over 3 years ago

  • Target version changed from v3_2_14 to v3_2_15

#10 Updated by Parag Mhashilkar over 3 years ago

  • Target version changed from v3_2_15 to v3_2_16

#11 Updated by HyunWoo Kim over 3 years ago

  • Status changed from Assigned to Feedback
  • Assignee changed from HyunWoo Kim to Parag Mhashilkar

Current situation:

Most of the sites or users already have upgraded to condor version 8.4 or at least 8.3
condor will compare only TotalRunningJobs against MaxJobsRunning
but gwms code still compares MaxJobsRunning(that we set) against both TotalRunningJobs and TotalSchedulerJobsRunning
Implication of this situation is,
frontend can put a schedd in a blacklist if this schedd deploys a lot of scheduler universe jobs
and the sum of scheduler universe jobs and regular vanilla jobs are close to MaxJobsRunning

The change in the code will be commenting out one line "current_run += el.get('TotalSchedulerJobsRunning',0)" as shown below:

  def identify_bad_schedds(self):
    max_run = int( el['MaxJobsRunning'] * 0.95 + 0.5 )
    current_run = el['TotalRunningJobs']
    # current_run += el.get('TotalSchedulerJobsRunning',0)

We have to know how this change in gwms code will affect those sites that are still using old htcondor version(earlier than 8.3.2)

In HTCondor 8.2, TotalRunningJobs come from both Vanilla and Scheduler universe
condor will combine TotalRunningJobs and TotalSchedulerJobsRunning
to compare them against MaxJobsRunning(that we set)

but the new code now compares MaxJobsRunning(that we set) against only TotalRunningJobs

so, it is possible that the condor will stop using one specific scheduler
before the frontend blacklists the scheduler,
in other words, frontend can still use a scheduler even when the scheduler is not in operation.
Is this going to be a problem? maybe no..

Here is the result from a survey that probes the htcondor version of USER Scheduler nodes
(because that is where condor compares MaxJobsRunning against either only TotalJobsRunning or the sum of TotalJobsRunning and TotalSchedulerJobsRunning
in order to stop using a schedd)

command =  condor_status -schedd  -pool osg-ligo-1.t2.ucsd.edu -af CondorVersion
$CondorVersion: 8.4.7 May 27 2016 $
$CondorVersion: 8.4.8 Jun 30 2016 BuildID: 373513 $
$CondorVersion: 8.2.8 Apr 28 2015 $
$CondorVersion: 8.4.4 Feb 04 2016 $
$CondorVersion: 8.4.8 Jun 30 2016 BuildID: 373513 $

command =  condor_status -schedd  -pool fifebatchhead3.fnal.gov -af CondorVersion
$CondorVersion: 8.4.3 Dec 15 2015 $
$CondorVersion: 8.4.3 Dec 15 2015 $

command =  condor_status -schedd  -pool scatter.nanohub.org -af CondorVersion
$CondorVersion: 8.0.6 Feb 01 2014 BuildID: 225363 $

command =  condor_status -schedd  -pool glidein.unl.edu -af CondorVersion
$CondorVersion: 8.5.5 Jun 03 2016 $
$CondorVersion: 8.5.3 Apr 27 2016 $
$CondorVersion: 8.5.5 Jun 03 2016 $
$CondorVersion: 8.5.5 Jun 03 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.5.4 May 02 2016 $

command =  condor_status -schedd  -pool uclhc-fe-1.t2.ucsd.edu -af CondorVersion
$CondorVersion: 8.4.3 Dec 15 2015 $
$CondorVersion: 8.4.3 Dec 15 2015 $
$CondorVersion: 8.4.3 Dec 15 2015 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.4.3 Dec 15 2015 $
$CondorVersion: 8.4.3 Dec 15 2015 $

command =  condor_status -schedd  -pool glidein-collector.t2.ucsd.edu -af CondorVersion
$CondorVersion: 8.4.4 Feb 04 2016 $
$CondorVersion: 8.4.3 Dec 15 2015 $

command =  condor_status -schedd  -pool vocms032.cern.ch -af CondorVersion
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.3.8 Aug 27 2015 BuildID: 338845 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.4.6 Apr 21 2016 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $

command =  condor_status -schedd  -pool 134.174.140.230 -af CondorVersion
$CondorVersion: 8.2.10 Oct 27 2015 $

command =  condor_status -schedd  -pool osg-flock.grid.iu.edu -af CondorVersion
$CondorVersion: 8.4.7 Jun 03 2016 $
$CondorVersion: 8.4.3 Jan 22 2016 BuildID: RH-8.4.3-1.el6 $
$CondorVersion: 8.4.3 Dec 15 2015 $
$CondorVersion: 8.2.10 Oct 27 2015 $
$CondorVersion: 8.2.10 Oct 27 2015 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.2.10 Oct 27 2015 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.2.10 Oct 27 2015 $
$CondorVersion: 8.4.6 Apr 21 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.4.6 Apr 21 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.2.10 Oct 27 2015 $
$CondorVersion: 8.4.7 Jun 03 2016 $

command =  condor_status -schedd  -pool cmssrv221.fnal.gov -af CondorVersion
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.3.8 Aug 27 2015 BuildID: 338845 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.4.6 Apr 21 2016 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $
$CondorVersion: 8.5.4 May 02 2016 BuildID: 365871 $

command =  condor_status -schedd  -pool gremlin.phys.uconn.edu -af CondorVersion
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $
$CondorVersion: 8.4.8 Jul 01 2016 $

command =  condor_status -schedd  -pool glidein2.chtc.wisc.edu -af CondorVersion
$CondorVersion: 8.5.6 Jul 15 2016 BuildID: 375046 $
$CondorVersion: 8.5.1 Dec 14 2015 BuildID: 352064 $
$CondorVersion: 8.2.8 Apr 08 2015 BuildID: 313322 $
$CondorVersion: 8.2.10 Oct 21 2015 BuildID: 345812 $
$CondorVersion: 8.5.5 Jun 03 2016 BuildID: 369308 $
$CondorVersion: 8.5.6 Jul 15 2016 BuildID: 375046 $
$CondorVersion: 8.2.8 Apr 08 2015 BuildID: 313322 $
$CondorVersion: 8.3.8 Aug 27 2015 BuildID: 338845 $
$CondorVersion: 8.5.6 Jul 15 2016 BuildID: 375046 $
$CondorVersion: 8.5.5 May 03 2016 BuildID: 366162 $
$CondorVersion: 8.3.8 Aug 27 2015 BuildID: 338845 $
$CondorVersion: 8.5.6 Jul 15 2016 BuildID: 375046 $

command =  condor_status -schedd  -pool fifebatchhead4.fnal.gov -af CondorVersion
$CondorVersion: 8.4.3 Dec 15 2015 $
$CondorVersion: 8.4.3 Dec 15 2015 $

In summary, these 5 Pool Collector nodes are associated with scheduler nodes that are still using htcondor version 8.2.x
(u'osg-ligo-1.t2.ucsd.edu', 5, 1),
 (u'134.174.140.230', 1, 1),
 (u'scatter.nanohub.org', 1, 1),
 (u'osg-flock.grid.iu.edu', 19, 5),
 (u'glidein2.chtc.wisc.edu', 12, 3),

and the rest have upgraded to 8.4 or 8.5:
 (u'fifebatchhead3.fnal.gov', 2, 0),
 (u'glidein.unl.edu', 6, 0),
 (u'uclhc-fe-1.t2.ucsd.edu', 6, 0),
 (u'glidein-collector.t2.ucsd.edu', 2, 0),
 (u'vocms032.cern.ch', 36, 0),
 (u'cmssrv221.fnal.gov', 36, 0),
 (u'gremlin.phys.uconn.edu', 12, 0),
 (u'fifebatchhead4.fnal.gov', 2, 0),

#12 Updated by Parag Mhashilkar over 3 years ago

  • Assignee changed from Parag Mhashilkar to HyunWoo Kim

Looks ok to merge.

#13 Updated by HyunWoo Kim over 3 years ago

  • Status changed from Feedback to Closed

Merged into branch_v3_2.
Closing.



Also available in: Atom PDF