Project

General

Profile

Bug #11521

Negative running Core count

Added by Parag Mhashilkar over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
High
Assignee:
Parag Mhashilkar
Category:
-
Target version:
Start date:
01/27/2016
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

Burt noticed a negative count (-6) for number of running cores in the fifebatch frontend.

[2016-01-27 14:03:25,244] INFO: Iteration at Wed Jan 27 14:03:25 2016
[2016-01-27 14:03:25,245] INFO: Querying schedd, entry, and glidein status using child processes.
[2016-01-27 14:03:30,420] INFO: All children terminated
[2016-01-27 14:03:30,421] INFO: Jobs found total 53 idle 0 (good 0, old 0, grid 0, voms 0) running 53
[2016-01-27 14:03:30,422] INFO: Group glideins found total 12 limit 1000 curb 900; of these idle 3 limit 500 curb 200 running 12
[2016-01-27 14:03:30,422] INFO: Frontend glideins found total 6812 limit 100000 curb 90000; of these idle 335 limit 1000 curb 200
[2016-01-27 14:03:30,423] INFO: Overall slots found total 13515 limit 100000 curb 90000; of these idle 675 limit 1000 curb 200
[2016-01-27 14:03:30,423] INFO: Updating usermap
[2016-01-27 14:03:30,423] INFO: Match
[2016-01-27 14:03:30,427] INFO: Active forks = 3, Forks to finish = 8
[2016-01-27 14:03:30,441] INFO: Active forks = 3, Forks to finish = 6
[2016-01-27 14:03:30,458] INFO: Active forks = 3, Forks to finish = 3
[2016-01-27 14:03:30,469] INFO: Active forks = 0, Forks to finish = 0
[2016-01-27 14:03:30,470] INFO: All children terminated
[2016-01-27 14:03:30,470] INFO: Total matching idle 0 (old 0) running 53 limit 1000
[2016-01-27 14:03:30,471] INFO:             Jobs in schedd queues                 |         Glideins        |       Cores       |    Request
[2016-01-27 14:03:30,471] INFO: Idle (match  eff   old  uniq )  Run ( here  max ) | Total  Idle   Run  Fail | Total  Idle   Run | Idle MaxRun Down Factory
[2016-01-27 14:03:30,471] INFO:     0(    0     0     0     0)    53(    3  1000) |    12     3    12     0 |     9    15    -6 |     0     3 Up   FNAL_HEPCloud_1@gfactory_instance@gfactory_service@cmsgwms-factory.fnal.gov
[2016-01-27 14:03:30,501] INFO:             Jobs in schedd queues                 |         Glideins        |       Cores       |    Request
[2016-01-27 14:03:30,502] INFO: Idle (match  eff   old  uniq )  Run ( here  max ) | Total  Idle   Run  Fail | Total  Idle   Run | Idle MaxRun Down Factory
[2016-01-27 14:03:30,502] INFO:     0(    0     0     0     0)    53(    3  1000) |    12     3    12     0 |     9    15    -6 |     0     3 Up   Sum of useful factories
[2016-01-27 14:03:30,502] INFO:     0(    0     0     0     0)     0(    0     0) |     0     0     0     0 |     0     0     0 |     0     0 Down Sum of down factories
[2016-01-27 14:03:30,502] INFO:     0(    0     0     0     0)     0(    0     0) |     0     0     0     0 |     0     0     0 |     0     0 Down Unmatched
[2016-01-27 14:03:30,533] INFO: Advertising global and singular requests for factory cmsgwms-factory.fnal.gov
[2016-01-27 14:03:30,600] INFO: Advertising 1 glideresource classads to the user pool
[2016-01-27 14:03:30,603] INFO: There are 1 classads to advertise
[2016-01-27 14:03:30,694] INFO: Done advertising
[2016-01-27 14:03:30,695] INFO: iterate_one status: None
[2016-01-27 14:03:30,695] INFO: Writing stats
fifebatchgpvmhead1.fnal.gov (330 KB) fifebatchgpvmhead1.fnal.gov Parag Mhashilkar, 01/28/2016 03:56 PM
fifebatchgpvmhead2.fnal.gov (363 KB) fifebatchgpvmhead2.fnal.gov Parag Mhashilkar, 01/28/2016 03:56 PM

Related issues

Related to GlideinWMS - Bug #11645: GlideinWMS not submitting enough pilots to multicore sitesClosed02/05/2016

Related to GlideinWMS - Bug #11145: Glideins are not submitted because of errors in the counting of total, idle and running jobs when partitionalble slots are involvedClosed12/14/2015

History

#1 Updated by Parag Mhashilkar over 4 years ago

  • Priority changed from Normal to High

#2 Updated by Parag Mhashilkar over 4 years ago

  • Assignee changed from Parag Mhashilkar to Marco Mambelli

I suspect this is coming from the changes made to accounting in glideFrontendLib.py in December.

#3 Updated by Parag Mhashilkar over 4 years ago

Attaching frontend logs for both frontends used by fife.

#4 Updated by Marco Mambelli over 4 years ago

Work is in branch v3/11521
The problem was that some attributes used (like TotalCpus) were not in the projection form condor_status, so were evaluated as 0.
Running cpus for partitionable slots was: TotalCpus-Cpus and was negative.

This highlighted also some other problems:
1. TotalCpus counts all the cpus in the node (NUM_CPUS), if there are more slots this is incorrect, using TotalSlotCpus instead
2. SlotType was missing from the projection as well, so Dynamic slots were counted sometime
3. All slots include also Dynamic ones, which should be skipped not to over-count the cores, getAllCondorStatus filters them out
4. Counters of all glideins exclude dynamic slots (which are triggered by condor and are not glideins).

getAllCondorStatus() counts all static and partitionable glideins (one count each), including the ones in benchmarking state (so total may be > idle+running)
Should benchmarking glideins be skipped from the counting?
A single partitionable glidein can be at the same time idle (if not all CPUs are used) and running (if some CPUs are used), so total may be < idle+running

getRunningCondorStatus() - returns all glideins that are running one or more jobs (static slots and partitionable slots that lost some cpus to dynamic slots)
getRunningJobsCondorStatus() - returns all slots in running status (static and dynamic)

At the moment getRunningJobsCondorStatus is never used

There seems to be some duplication in the counting functions:
- getIdleCoresCondorStatus() and getIdleCondorStatus() are the same
- getRunningCoresCondorStatus() and getRunningCondorStatus() are the same
- in glideinFrontendElement:
- get_condor_status fetched a condor_status and does some counting
- subprocess_count_glidein() has req_dict_types which is set to the counters
- populate_status_dict_types() does the same
Hopefully the last 2 use the stored data in the query.

#5 Updated by Marco Mambelli over 4 years ago

  • Status changed from New to Feedback
  • Assignee changed from Marco Mambelli to Parag Mhashilkar
  • Occurs In v3_2_12, v3_2_12_1 added

#6 Updated by Parag Mhashilkar over 4 years ago

  • Related to Bug #11645: GlideinWMS not submitting enough pilots to multicore sites added

#7 Updated by Parag Mhashilkar over 4 years ago

  • Related to Bug #11145: Glideins are not submitted because of errors in the counting of total, idle and running jobs when partitionalble slots are involved added

#8 Updated by Parag Mhashilkar over 4 years ago

  • Status changed from Feedback to Resolved

Issues #11145 #11521 #11580 #11645 are addressed in the branch v3/pslot-accounting-review
Changes have been merged to branch_v3_2

#9 Updated by Parag Mhashilkar over 4 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF