Frontend group limits based on running cores rather than running glideins (condorg) jobs
From the logs below, number of running glideins is much smaller than the limit of 1000. Either we are looking at wrong info while requesting glideins or the log line is confusing. Based on the Steve Timm's observation, more glideins went through when limit of 1000 was increased to a higher number.
# Frontend group logs [2016-02-01 16:09:41,940] INFO: Total matching idle 642 (old 642) running 1358 limit 1000 [2016-02-01 20:29:20,214] INFO: Jobs in schedd queues | Glideins | Cores | Request [2016-02-01 20:29:20,214] INFO: Idle (match eff old uniq ) Run ( here max ) | Total Idle Run Fail | Total Idle Run | Idle MaxRun Down Factory [2016-02-01 20:29:20,214] INFO: 589( 589 516 589 589) 1372( 73 1000) | 687 73 687 0 | 614 74 540 | 1 86 Up FNAL_HEPCloud_1@gfactory_instance@firstname.lastname@example.org # From frontend config, <group ...><config> section <running_glideins_per_entry max="1000" relative_to_queue="1.15"/>
#1 Updated by Marco Mambelli over 4 years ago
The counters have been reviewed and behave correctly now.
In the tests the max was respected.
Anyway there could be cases where the # of running jobs bumps above the maximum:
1. GLIDEIN_CPUS is not known (auto, slot) and the glidein end up on worker nodes sttically partitioned in a bunch o 1 core slots.
2. pslot glidein are requested by 8cpu jobs (or 4 cpus) on an entry. When those jobs end the same slots are used by 1 cpu jobs. All the pslots are split in a bunch of dynamic slots and there is a bump in running jobs.