Project

General

Profile

Bug #21569

Avoid glideFactoryEntryGroup processe leaks

Added by Marco Mascheroni 4 months ago. Updated 3 months ago.

Status:
Closed
Priority:
High
Category:
Factory
Target version:
Start date:
12/18/2018
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:

Factory Ops

Duration:

Description

Similarly to what happened in https://cdcvs.fnal.gov/redmine/issues/19360 , a bug is causing process leaks in a production factory which consequently run out of memory. This should not happen. I already hotfixed the factory, the problem is in fork.py, where the child glideFactoryEntryGroup that is not writing anything for the parent in case of error.

History

#1 Updated by Marco Mascheroni 4 months ago

  • Status changed from New to Feedback
  • Assignee changed from Marco Mascheroni to Marco Mambelli

#2 Updated by Marco Mascheroni 4 months ago

See #21570

#3 Updated by Marco Mambelli 4 months ago

  • Status changed from Feedback to Work in progress
  • Assignee changed from Marco Mambelli to Marco Mascheroni

Marco, could you find out something more about the memory leak, if there were processes left hanging, or what else was causing the leak?
I thought the problem had been solved in [#19360]

#4 Updated by Marco Mascheroni 4 months ago

  • Status changed from Work in progress to Feedback

Yes, I thought I already pushed the branch with the fix, I have done it now. Anyway, this is the fixed code. The following os.write that I now moved to the finally clause was not executed in cas of exception. The parent process was having an exception and not cleaning things up.

        try:
            out = function_torun(*args)
        except:
            out = {}
            logSupport.log.warning("Forked process '%s' failed" % str(function_torun))
            logSupport.log.exception("Forked process '%s' failed" % str(function_torun))
        finally:
            os.write(w, cPickle.dumps(out))
            os.close(w)
            # Exit, immediately. Don't want any cleanup, since I was created
            # just for performing the work
            os._exit(0)

#5 Updated by Marco Mascheroni 4 months ago

  • Assignee changed from Marco Mascheroni to Marco Mambelli

#6 Updated by Marco Mascheroni 4 months ago

Since MArco Mambelli asked in the weekly meeting, this is the parent process exception:

[2018-12-16 14:50:25,269] ERROR: glideFactoryEntryGroup:374: Exception:
Traceback (most recent call last):
  File "/usr/sbin/glideFactoryEntryGroup.py", line 371, in iterate_one
    group_name, my_entries)
  File "/usr/sbin/glideFactoryEntryGroup.py", line 306, in find_and_perform_work
    post_work_info = forkm_obj.bounded_fork_and_collect(parallel_workers)
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/fork.py", line 323, in bounded_fork_and_collect
    post_work_info_subset = fetch_ready_fork_result_list(pipe_ids)
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/fork.py", line 222, in fetch_ready_fork_result_list
    out = fetch_fork_result(fd, pid)
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/fork.py", line 98, in fetch_fork_result
    out = cPickle.loads(rin)
EOFError
[2018-12-16 14:50:25,269] DEBUG: glideFactoryEntryGroup:376: Group Work done: {}

#7 Updated by Marco Mascheroni 4 months ago

  • Stakeholders updated (diff)

#8 Updated by Marco Mambelli 4 months ago

  • Assignee changed from Marco Mambelli to Marco Mascheroni

New changes in 34/21569_1
write() in finally could still fail in the previous solution 34/21569 (in case of system problems w/ the pipe)
Kept the exception in the reader, moved to the common fetch_fork_result and added FetchError for propagation.
Python3 will allow chaining

#9 Updated by Marco Mascheroni 4 months ago

  • Assignee changed from Marco Mascheroni to Marco Mambelli

#10 Updated by Marco Mambelli 3 months ago

  • Status changed from Feedback to Resolved

#11 Updated by Marco Mambelli 3 months ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF