Feature #2905: Refactor: multiple entry points per factory process
Refactor: multiple entry points per factory process (v2plus)
- Implement a scaled down approach for v2plus as listed in #2905. This will reduce the refactoring required.
- Use message passing mechanism between processes similar to that used in frontend.
#1 Updated by Parag Mhashilkar almost 8 years ago
Completed first pass on the work as of yesterday.
Changes in branch_v2plus_2975
The code still needs few minor changes and polishing
- Implement threshold for wait so that group does not wait indefinitely for the entries. After the threshold entry child process should be nuked. But have following concerns:
- What to do if the process is in middle of doing work and genuinely busy? * What happens to monitoring and new classad submission during that time?
- Address some of the TODO's and comments in the code
- Do more testing on the monitoring
- Have the entrygroup read the entries info rather than getting it from the command line to avoid getting over the command line args limit
- Test with multiple frontends
- When loading pickled entry info for log_stats we have to create and reload some of the objects. Can we speed it further?
- Cleanup unused/old code from the glideinFactoryEntry and glideinFactoryEntryGroup
- Gracefully handle the case when no frontends have any glideclient classad for a particular entry. Factory can be smarter in managing fewer child processes but still publish glidefactory classads.
#2 Updated by Igor Sfiligoi almost 8 years ago
Re forking rrd processing:
Doing multiple RRD updates in parallel does not seem to be a brilliant idea.
Without a ramdisk, you will have heavy disk trashing!
Plus, if rrd processing is really CPU bound, it is likely to overload the OS when one has O(500) entries.
Am I wrong?
#3 Updated by Burt Holzman almost 8 years ago
It needs to be tested without a ramdisk (which I'll do next).
Keep in mind that we already do the RRD updates in parallel in 2.6.3 and below since each
entry updates its own rrds. I can add a sleep in to better simulate that behavior.
I think for v3 we should consider having the RRDs asynchronously update with a different process -- we shouldn't slow down the main work of the factory for this.
#5 Updated by Douglas Strain almost 8 years ago
- Assignee changed from Douglas Strain to Parag Mhashilkar
I have sent my comments to Parag in email. In summary, the general code change looks good. I like the new organization a lot. The issues I see are minor tweaks to whitelist, a duplicate function in glideFactoryInterface, missing documentation, and a possible improvement by putting condor_q in parallel. I looked at the pickling of stats data as well, but it looks fine to me. Parag and I will discuss once he is back in the office.