Project

General

Profile

Milestone #4990

Frontend scalability

Added by Burt Holzman almost 6 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Igor Sfiligoi
Category:
Frontend
Target version:
Start date:
11/21/2013
Due date:
% Done:

0%

Estimated time:
Stakeholders:

CMS, OSG

Duration:

Description

This is for tracking the milestones from the Oct-2013 stakeholders' meeting.

Queries to condor schedulers scales (at least linearly) with the number of frontend groups - for busy schedulers these queries are very expensive

History

#1 Updated by Burt Holzman almost 6 years ago

  • Target version changed from v3_2_x to v3_2_4

#2 Updated by Burt Holzman almost 6 years ago

  • Tracker changed from Feature to Milestone

#3 Updated by Igor Sfiligoi almost 6 years ago

  • Assignee changed from Burt Holzman to Igor Sfiligoi

I plan to start working on this starting mid Jan.

#4 Updated by Igor Sfiligoi almost 6 years ago

The proposed course of action is conceptually similar to that of the factory;
we semi-serialize the access to the condor_q's, with at most N running in parallel.
This would be achieved by forking "a frontendElementOne" at each cycle from the main process, and keeping at most N alive at any point in time.

This will semi-serialize also all the rest of the processing, but that's a good thing, since the other resources (memory and CPU) on the FE are finite as well.

The only trick needed is to preserve the history object. I plan to do this by committing the changes to disk. This will also have the advantage of surviving restarts. (but should not add significant load to the system)
BTW: I did not have the time to carefully read through all the code, but there does not seem to be anything config-specific in it, so a reconfig should not affect its semantics.

Let me know if you object and/or have a better idea.

#5 Updated by Igor Sfiligoi over 5 years ago

  • Category changed from Factory to Frontend
  • Status changed from Assigned to Feedback
  • Assignee changed from Igor Sfiligoi to Parag Mhashilkar

I have implemented the above idea and committed it to branch v3/4990.

Please review and let me know if I can merge it back to master and v3.

#6 Updated by Igor Sfiligoi over 5 years ago

I have created a new branch
v3/4990_v2
that is branched from the latest branch_v3_2 as of today.

Had to fix a bunch of merge conflicts due to #3967.
Still need to properly test it.

#7 Updated by Parag Mhashilkar over 5 years ago

Let me know when are done with the testing so I can review/merge it.

#8 Updated by Igor Sfiligoi over 5 years ago

The merge looks OK, please review.

#9 Updated by Parag Mhashilkar over 5 years ago

Sent feedback to Igor separately.

#10 Updated by Parag Mhashilkar over 5 years ago

  • Assignee changed from Parag Mhashilkar to Igor Sfiligoi

Reviewed. Igor made the changes as requested. Ok to merge once last commit is moved to v3/4990_v2

#11 Updated by Igor Sfiligoi over 5 years ago

  • Assignee changed from Igor Sfiligoi to Parag Mhashilkar

Parag noticed that I had disabled/obsoleted the "crashing too often" semantics in my original patch.
While I do not find that feature too useful, it was really not appropriate to just deprecate it as part of this patch.

So I added back the functionality.

The updated code can be found in v3/4990_v3.
Disclaimer: I think it should work, but have not had the chance to test it on a live system.

Please review.

#12 Updated by Parag Mhashilkar over 5 years ago

  • Status changed from Feedback to Closed
  • Assignee changed from Parag Mhashilkar to Igor Sfiligoi


Also available in: Atom PDF