Project

General

Profile

Feature #3884

Frontend not considering MAX_JOBS_RUNNING - requesting too many glideins

Added by Igor Sfiligoi over 6 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
Parag Mhashilkar
Category:
Frontend
Target version:
Start date:
05/15/2013
Due date:
% Done:

0%

Estimated time:
Stakeholders:

CMS, OSG

Duration:

Description

The Frontend is not considering MAX_JOBS_RUNNING when deciding how many glideins to request.

So if one or more schedds hit this limit, the Frontend will continue requesting glideins even though they are unlikely to be matched.

History

#1 Updated by Burt Holzman over 6 years ago

  • Target version set to v3_x

The schedd publishes both MaxJobsRunning and TotalRunningJobs in the classad. We could just add an extra condor_status call to frontendElement and filter them out from @elementDescript.merged_data['JobSchedds'] for the condorq_dict?

#2 Updated by Igor Sfiligoi over 6 years ago

Almost... the problem is doing it without ruining the monitoring.

#3 Updated by Igor Sfiligoi almost 6 years ago

I have recently been hit by the schedd hitting the TransferQueueMaxUploading limit, too.
While more transient, the result is the same... wasted CPU cycles in the glideins.

I will add checking for that as well.

#4 Updated by Igor Sfiligoi almost 6 years ago

  • Target version changed from v3_x to v3_2_x

I think I have found a way to get this working;
basically, I do not count the idle jobs from the schedd that have hit the limit for the purposes of ReqIdle.
Not touching running jobs.

#5 Updated by Igor Sfiligoi almost 6 years ago

  • Status changed from New to Feedback
  • Assignee changed from Igor Sfiligoi to Parag Mhashilkar
  • Target version changed from v3_2_x to v3_2_5
  • Stakeholders updated (diff)

I have implemented the changes. Not much code changed, with only a couple local additions.
Committed to v3/3884.

Unfortunately, it touches code that was also changed by #5579 and #5691, so I created also the branch
v3/5579_v2_5691_3884
which solves the merge conflict there.

Please review.

#6 Updated by Parag Mhashilkar almost 6 years ago

  • Assignee changed from Parag Mhashilkar to Igor Sfiligoi

Sent feedback separately. Assigning back to Igor.

#7 Updated by Igor Sfiligoi almost 6 years ago

  • Status changed from Feedback to Resolved

After applying the changes requested by Parag,
I merged into both branch_v3_2 and master.

#8 Updated by Igor Sfiligoi over 5 years ago

  • Status changed from Resolved to Assigned

The code is not properly treating the setting of max to 0, which means unlimited.

Re-opening the ticket.

#9 Updated by Igor Sfiligoi over 5 years ago

  • Status changed from Assigned to Feedback
  • Assignee changed from Igor Sfiligoi to Parag Mhashilkar
  • Priority changed from Normal to Urgent

I have made the necessary fix in
v3/3884_v2 (branched from v3_2_5_rc1).

Please check and confirm I can merge it back.

#10 Updated by Parag Mhashilkar over 5 years ago

  • Status changed from Feedback to Closed

I took care of this and released rc2. Please assign issues to someone else for next 2 weeks while I am on vacation.



Also available in: Atom PDF