Project

General

Profile

Feature #2975

Feature #2905: Refactor: multiple entry points per factory process

Refactor: multiple entry points per factory process (v2plus)

Added by Parag Mhashilkar about 8 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Parag Mhashilkar
Category:
Factory
Target version:
Start date:
09/28/2012
Due date:
% Done:

0%

Estimated time:
Stakeholders:
Duration:

Description

  • Implement a scaled down approach for v2plus as listed in #2905. This will reduce the refactoring required.
  • Use message passing mechanism between processes similar to that used in frontend.

History

#1 Updated by Parag Mhashilkar almost 8 years ago

Completed first pass on the work as of yesterday.
Changes in branch_v2plus_2975

The code still needs few minor changes and polishing

TODO:

  • Implement threshold for wait so that group does not wait indefinitely for the entries. After the threshold entry child process should be nuked. But have following concerns:
    • What to do if the process is in middle of doing work and genuinely busy? * What happens to monitoring and new classad submission during that time?
  • Address some of the TODO's and comments in the code
  • Do more testing on the monitoring
  • Have the entrygroup read the entries info rather than getting it from the command line to avoid getting over the command line args limit
  • Test with multiple frontends
  • When loading pickled entry info for log_stats we have to create and reload some of the objects. Can we speed it further?
  • Cleanup unused/old code from the glideinFactoryEntry and glideinFactoryEntryGroup
  • Gracefully handle the case when no frontends have any glideclient classad for a particular entry. Factory can be smarter in managing fewer child processes but still publish glidefactory classads.

#2 Updated by Igor Sfiligoi almost 8 years ago

Re forking rrd processing:
Doing multiple RRD updates in parallel does not seem to be a brilliant idea.

Without a ramdisk, you will have heavy disk trashing!

Plus, if rrd processing is really CPU bound, it is likely to overload the OS when one has O(500) entries.

Am I wrong?

#3 Updated by Burt Holzman almost 8 years ago

It needs to be tested without a ramdisk (which I'll do next).

Keep in mind that we already do the RRD updates in parallel in 2.6.3 and below since each
entry updates its own rrds. I can add a sleep in to better simulate that behavior.

I think for v3 we should consider having the RRDs asynchronously update with a different process -- we shouldn't slow down the main work of the factory for this.

- B

#4 Updated by Parag Mhashilkar almost 8 years ago

  • Status changed from Assigned to Feedback
  • Assignee changed from Parag Mhashilkar to Douglas Strain

This has been tested internally. While we wait for feedback from the factory operators, lets do the code review.

#5 Updated by Douglas Strain almost 8 years ago

  • Assignee changed from Douglas Strain to Parag Mhashilkar

I have sent my comments to Parag in email. In summary, the general code change looks good. I like the new organization a lot. The issues I see are minor tweaks to whitelist, a duplicate function in glideFactoryInterface, missing documentation, and a possible improvement by putting condor_q in parallel. I looked at the pickling of stats data as well, but it looks fine to me. Parag and I will discuss once he is back in the office.

Doug Strain

#6 Updated by Parag Mhashilkar almost 8 years ago

  • Target version changed from v2_7_x to v2_7

#7 Updated by Parag Mhashilkar almost 8 years ago

  • Status changed from Feedback to Resolved

I have merged the changes from branch_v2plus_2975 to branch_v2plus as one big diff. There is a different ticket to do this work in master. Resolving this one.

#8 Updated by Parag Mhashilkar over 7 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF