Project

General

Profile

Bug #7037

[Urgent] Stuck child processes killing CERN factory

Added by Parag Mhashilkar over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Parag Mhashilkar
Category:
-
Target version:
Start date:
09/19/2014
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

Hello glideinWMS team,

Our CERN factory vocms0305 has gone into an unrecoverable state. It keeps failing in the same way when we try to run it. After a period of time, some kind of event causes the forked child processes of glideFactoryEntryGroup.py to stay alive. Rather than going away after they finish, they just cycle between sleeping and running.

Over time the number of glideFactoryEntryGroup.py children grows and grows to about 500, and at that point it consumes all of the memory of our machine and it becomes unrecoverable. Yesterday the machine even crashed and had to be rebooted.

I've attached a tarball of factory and group0 logs. Please note they always seem to go stuck when processing the same entry, I am not sure what to make of that, but in the group_0.err.log you will see lines like [1] always for CMS_T2_CH_CERN_ce208. I also included the logs for that entry. We don't have anything wrong with this entry, it is identical at the SDSC and GOC factories, and we don't observe this behavior at all on those factories.

I tried slowing the polling loop down really slow, to 10 minutes:
loop_delay="600"

But these stuck processes never terminate. We are running:
glideinwms v3_2_6
condor_version
$CondorVersion: 8.2.2 Aug 07 2014 BuildID: 265643 $
$CondorPlatform: x86_64_RedHat6 $

Our machine is running sl6. Please take a look as soon as convenient. I'll make myself available tomorrow to provide any other info you think may be useful.

Thanks,
Jeff

[1]
[2014-09-19 00:47:24,731] WARNING: fork:140: Failed to extract info from child 'CMS_T2_CH_CERN_ce208'
[2014-09-19 00:47:24,732] ERROR: fork:141: Failed to extract info from child 'CMS_T2_CH_CERN_ce208'
Traceback (most recent call last):
File "/opt/glideinwms/factory/../../glideinwms/lib/fork.py", line 137, in fetch_ready_fork_result_list
out = fetch_fork_result(fd, pid)
File "/opt/glideinwms/factory/../../glideinwms/lib/fork.py", line 83, in fetch_fork_result
out = cPickle.loads(rin)
EOFError
[2014-09-19 00:47:24,775] WARNING: glideFactoryEntryGroup:371: Error occurred while trying to find and do work.
[2014-09-19 00:47:24,776] ERROR: glideFactoryEntryGroup:372: Exception:
Traceback (most recent call last):
File "/opt/glideinwms/factory/glideFactoryEntryGroup.py", line 369, in iterate_one
group_name, my_entries)
File "/opt/glideinwms/factory/glideFactoryEntryGroup.py", line 305, in find_and_perform_work
post_work_info = forkm_obj.bounded_fork_and_collect(parallel_workers)
File "/opt/glideinwms/factory/../../glideinwms/lib/fork.py", line 223, in bounded_fork_and_collect
post_work_info_subset = fetch_ready_fork_result_list(pipe_ids)
File "/opt/glideinwms/factory/../../glideinwms/lib/fork.py", line 132, in fetch_ready_fork_result_list
readable_fds = select.select(fds_to_entry.keys(), [], [], 0)[0]
error: (9, 'Bad file descriptor')

History

#1 Updated by Parag Mhashilkar over 5 years ago

  • Subject changed from Urgent] Stuck child processes killing CERN factory to [Urgent] Stuck child processes killing CERN factory
  • Status changed from New to Feedback
  • Assignee changed from Parag Mhashilkar to Burt Holzman

I made the changes. We may just want to give Jeff fork.py rather than whole branch as v3/7037 has unreleased features as well. While you review, I am testing if nothing breaks, just to be sure.

#2 Updated by Parag Mhashilkar over 5 years ago

  • Status changed from Feedback to Resolved
  • Assignee changed from Burt Holzman to Parag Mhashilkar

These changes are already in production. Merging back.

#3 Updated by Parag Mhashilkar over 5 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF