Project

General

Profile

Bug #7799

File descriptor limit issues with large number of entries

Added by Parag Mhashilkar almost 6 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
02/06/2015
Due date:
% Done:

0%

Estimated time:
Stakeholders:
Duration:

Description

From: Jeff Dost <>
Subject: Factory glideFactoryEntryGroup.py not respecting FD limits
Date: February 5, 2015 at 7:12:16 PM CST
To: "" <>

Hello glideinWMS support,

I am in the process of trying out a test factory where I have about 600 entries in it. After reconfig, it is no longer able to stay up. Below [1] is the error I see in the factory log. It looks like the EntryGroup process is maintaining a hard coded maximum of 1024 FDs. Initially I tried manually increasing the hard and soft ulimit levels for the gfactory, but that only worked for the parent glideFactory process:

glideFactory:
grep 'open files' /proc/25342/limits
Max open files 4096 4096 files

EntryGroup:
grep 'open files' /proc/25349/limits
Max open files 1024 1024 files

Poking around the code I found this (line 329 glideFactory.py, version v3_2_7_2):
childs[group] = subprocess.Popen(command_list, shell=False,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
close_fds=True,
preexec_fn=_set_rlimit)

and sure enough, the _set_rlimit callback is hard coding to 1024:
def _set_rlimit():
resource.setrlimit(resource.RLIMIT_NOFILE, [1024, 1024])

In my opinion there is no good reason for this, it should just inherit whatever limits the glideFactory parent is set to. This way we can tweak the limits as needed on the gfactory user to avoid hitting them.

Can you please take a look?

Thanks,
Jeff

[1]
[2015-02-05 16:06:16,777] WARNING: glideFactory:447: EntryGroup 0 STDERR: Traceback (most recent call last):
File "/usr/sbin/glideFactoryEntryGroup.py", line 729, in ?
File "/usr/sbin/glideFactoryEntryGroup.py", line 672, in main
File "/usr/lib/python2.4/site-packages/glideinwms/factory/glideFactoryEntry.py", line 91, in init
File "/usr/lib/python2.4/site-packages/glideinwms/lib/logSupport.py", line 232, in add_processlog_handler
File "/usr/lib/python2.4/site-packages/glideinwms/lib/logSupport.py", line 76, in init
File "/usr/lib64/python2.4/logging/handlers.py", line 59, in init
File "/usr/lib64/python2.4/logging/__init__.py", line 757, in init
IOError: [Errno 24] Too many open files: u'/var/log/gwms-factory/server/entry_UKI-SOUTHGRID-OX-HEP_t2ce06_longfive/UKI-SOUTHGRID-OX-HEP_t2ce06_longfive.err.log'

[2015-02-05 16:06:16,777] WARNING: glideFactory:454: EntryGroup 0 exited. Checking if it should be restarted.
[2015-02-05 16:06:16,777] WARNING: glideFactory:464: Restarting EntryGroup 0.
[2015-02-05 16:06:17,017] ERROR: glideFactory:701: Exception occurred spawning the factory:
Traceback (most recent call last):
File "/usr/sbin/glideFactory.py", line 697, in main
frontendDescript, entries, restart_attempts, restart_interval)
File "/usr/sbin/glideFactory.py", line 484, in spawn
childs[group].tochild.close()
AttributeError: 'Popen' object has no attribute 'tochild'

History

#1 Updated by Parag Mhashilkar almost 6 years ago

  • Assignee set to Burt Holzman

#2 Updated by Burt Holzman almost 6 years ago

  • Occurs In v3_0, v2_7, v2_7_1, v3_1, v2_7_2, v3_2, v3_2_1, v3_2_2, v3_2_3, v3_2_4, v3_2_5, v3_2_5_1, v3_2_6, v3_2_7, v3_2_8, v3_2_9, v3_3, v3_2_x, v3_x added

Based on further investigation..

The gFEG instantiates glideFactoryEntry for every entry point.
Each gFE sets up logging for itself, opening two (I think) FDs per entry.
This isn't really necessary, since the entry log is only used in the children that are forked from the gFEG.

One solution might be to change gFE.log to a function and init logging on first use.

I think this has been around since v2_7_0 (the EntryGroup refactor) !

#3 Updated by Burt Holzman almost 6 years ago

To be clear: the setting of the RLIMIT_NOFILE to 1024 was not actually the problem. We should really only need a handful of FDs per process.

#4 Updated by Parag Mhashilkar over 5 years ago

  • Target version changed from v3_2_9 to v3_2_10

#5 Updated by Parag Mhashilkar over 5 years ago

  • Target version changed from v3_2_10 to v3_2_11

#6 Updated by Parag Mhashilkar over 5 years ago

  • Target version changed from v3_2_11 to v3_2_12

#7 Updated by Parag Mhashilkar almost 5 years ago

  • Target version changed from v3_2_12 to v3_2_13

#8 Updated by Parag Mhashilkar over 4 years ago

  • Target version changed from v3_2_13 to v3_2_14

#9 Updated by Parag Mhashilkar over 4 years ago

  • Priority changed from Normal to High
  • Target version changed from v3_2_14 to v3_2_15

#10 Updated by Parag Mhashilkar over 4 years ago

  • Assignee changed from Burt Holzman to Parag Mhashilkar

#11 Updated by Parag Mhashilkar over 4 years ago

  • Target version changed from v3_2_15 to v3_2_16

#12 Updated by Parag Mhashilkar about 4 years ago

  • Target version changed from v3_2_16 to v3_2_17

#13 Updated by Parag Mhashilkar almost 4 years ago

  • Target version changed from v3_2_17 to v3_2_18

#14 Updated by Marco Mambelli almost 4 years ago

  • Target version changed from v3_2_18 to v3_2_19

#15 Updated by Parag Mhashilkar over 3 years ago

  • Assignee changed from Parag Mhashilkar to Dennis Box

#16 Updated by Dennis Box over 3 years ago

  • Status changed from New to Feedback
  • Assignee changed from Dennis Box to Marco Mambelli

Updated old v3/7799 branch by merge with branch_v3_2. Made modifications factory/glideFactory.py to inherit file limits from parent process by default.

#17 Updated by Marco Mambelli over 3 years ago

  • Assignee changed from Marco Mambelli to Dennis Box

#18 Updated by Dennis Box over 3 years ago

  • Status changed from Feedback to Resolved

Feedback suggestions implemented, merged into branch_v3_2

#19 Updated by Parag Mhashilkar over 3 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF