Project

General

Profile

Bug #7544

GWMS is not robust against failing of HTCondor commands and runs out of files

Added by Marco Mambelli over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
Start date:
12/18/2014
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

I had a configuration error in the factory causing the error:

[2014-12-18 09:18:35,143] INFO: glideFactory:432: Checking EntryGroups 0
[2014-12-18 09:18:35,143] INFO: glideFactory:492: Aggregate monitoring data
[2014-12-18 09:18:35,181] INFO: glideFactory:512: Sleep 59.9105739594 secs
[2014-12-18 09:19:35,096] INFO: glideFactory:411: Checking for credentials ['ress_ITB_INSTALL_TEST_1']
[2014-12-18 09:19:35,131] DEBUG: glideFactoryCredentials:170: updating credential for frontend
[2014-12-18 09:19:35,131] DEBUG: glideFactoryCredentials:97: updating credential file /var/lib/gwms-factory/client-proxies/user_frontend/glidein_gfactory_instance/credential_fermicloud036-fnal-gov_OSG_gWMSFrontend.main_
199773
[2014-12-18 09:19:35,131] DEBUG: glideFactoryCredentials:100: updating using privsep
[2014-12-18 09:19:35,145] ERROR: glideFactory:429: Error occurred processing the globals classads:
Traceback (most recent call last):
  File "/usr/sbin/glideFactory.py", line 427, in spawn
    frontendDescript)
  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryCredentials.py", line 176, in process_global
    raise CredentialError(error_str)
CredentialError: Error occurred processing the globals classads.
Traceback:
['Traceback (most recent call last):\n', '  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryCredentials.py", line 172, in process_global\n    update_credential_file(username, cred_id, cred_data, re
quest_clientname)\n', '  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryCredentials.py", line 114, in update_credential_file\n    raise RuntimeError, "Failed to update credential %s in %s (user %s
): %s" % (client_id, proxy_dir, username, e)\n', "RuntimeError: Failed to update credential 199773 in /var/lib/gwms-factory/client-proxies/user_frontend/glidein_gfactory_instance (user frontend): Unexpected Error runnin
g '/usr/bin/../sbin/condor_root_switchboard exec 0 994'. Details: Command '/usr/bin/../sbin/condor_root_switchboard exec 0 994' returned non-zero exit status 1: \n"]
[2014-12-18 09:19:35,146] INFO: glideFactory:432: Checking EntryGroups 0

I left it unresolved overnight and the following morning the main process of the factory ran out of file descriptors:

[2014-12-18 11:39:41,867] DEBUG: glideFactoryMonitorAggregator:677: aggregateRRDStats /var/lib/gwms-factory/work-dir/monitor/entry_ress_ITB_INSTALL_TEST_1/rrd_Status_Attributes.xml exception: parse_xml, IOError
[2014-12-18 11:39:41,867] DEBUG: glideFactoryMonitorAggregator:677: aggregateRRDStats /var/lib/gwms-factory/work-dir/monitor/entry_ress_ITB_INSTALL_TEST_1/rrd_Log_Completed.xml exception: parse_xml, IOError
[2014-12-18 11:39:41,868] DEBUG: glideFactoryMonitorAggregator:677: aggregateRRDStats /var/lib/gwms-factory/work-dir/monitor/entry_ress_ITB_INSTALL_TEST_1/rrd_Log_Completed_Stats.xml exception: parse_xml, IOError
[2014-12-18 11:39:41,868] DEBUG: glideFactoryMonitorAggregator:677: aggregateRRDStats /var/lib/gwms-factory/work-dir/monitor/entry_ress_ITB_INSTALL_TEST_1/rrd_Log_Completed_WasteTime.xml exception: parse_xml, IOError
[2014-12-18 11:39:41,868] DEBUG: glideFactoryMonitorAggregator:677: aggregateRRDStats /var/lib/gwms-factory/work-dir/monitor/entry_ress_ITB_INSTALL_TEST_1/rrd_Log_Counts.xml exception: parse_xml, IOError
[2014-12-18 11:39:41,868] ERROR: glideFactory:504: Error advertizing global classads:
Traceback (most recent call last):
  File "/usr/sbin/glideFactory.py", line 502, in spawn
    glideinDescript.data['PubKeyObj'])
  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryInterface.py", line 689, in advertizeGlobal
    fd = file(tmpnam, "w")
IOError: [Errno 24] Too many open files: '/tmp/gfi_ad_gfg_368924381_11615'
[2014-12-18 11:39:41,869] INFO: glideFactory:512: Sleep 59.9942519665 secs
[2014-12-18 11:40:41,868] INFO: glideFactory:411: Checking for credentials ['ress_ITB_INSTALL_TEST_1']
[2014-12-18 11:40:41,869] ERROR: glideFactory:420: Error occurred retrieving globals classad -- is Condor running?
[2014-12-18 11:40:41,869] INFO: glideFactory:432: Checking EntryGroups 0

The main Factory process (python /usr/sbin/glideFactory.py /var/lib/gwms-factory/work-dir) had over 1000 /dev/null open when running lsof:

python  11615 gfactory  mem    REG  252,1    14632 517365 /usr/lib64/python2.6/lib-dynload/fcntlmodule.so
python  11615 gfactory  mem    REG  252,1 99154480 438324 /usr/lib/locale/locale-archive
python  11615 gfactory    0r   CHR    1,3      0t0   3787 /dev/null
python  11615 gfactory    1w   CHR    1,3      0t0   3787 /dev/null
python  11615 gfactory    2w   CHR    1,3      0t0   3787 /dev/null
python  11615 gfactory    3w   REG  252,1  1068758 395609 /var/log/gwms-factory/server/factory/factory.info.log
python  11615 gfactory    4w   REG  252,1  3386488 395622 /var/log/gwms-factory/server/factory/factory.all.log
python  11615 gfactory    5w   REG  252,1  2027278 395410 /var/log/gwms-factory/server/factory/factory.err.log
python  11615 gfactory    6uW  REG  252,1       45 352779 /var/lib/gwms-factory/work-dir/lock/glideinWMS.lock
python  11615 gfactory    7r  FIFO    0,8      0t0  28646 pipe
python  11615 gfactory    8w   CHR    1,3      0t0   3787 /dev/null
python  11615 gfactory    9r  FIFO    0,8      0t0  28647 pipe
python  11615 gfactory   10w   CHR    1,3      0t0   3787 /dev/null
python  11615 gfactory   11w   CHR    1,3      0t0   3787 /dev/null
python  11615 gfactory   12w   CHR    1,3      0t0   3787 /dev/null
python  11615 gfactory   13w   CHR    1,3      0t0   3787 /dev/null
python  11615 gfactory   14w   CHR    1,3      0t0   3787 /dev/null
python  11615 gfactory   15w   CHR    1,3      0t0   3787 /dev/null
python  11615 gfactory   16w   CHR    1,3      0t0   3787 /dev/null
python  11615 gfactory   17w   CHR    1,3      0t0   3787 /dev/null
python  11615 gfactory   18w   CHR    1,3      0t0   3787 /dev/null
...
python  11615 gfactory 1013w   CHR    1,3      0t0   3787 /dev/null
python  11615 gfactory 1014w   CHR    1,3      0t0   3787 /dev/null
python  11615 gfactory 1015w   CHR    1,3      0t0   3787 /dev/null
python  11615 gfactory 1016r  FIFO    0,8      0t0 283622 pipe
python  11615 gfactory 1017r  FIFO    0,8      0t0 283479 pipe
python  11615 gfactory 1018w  FIFO    0,8      0t0 283479 pipe
python  11615 gfactory 1019r  FIFO    0,8      0t0 283480 pipe
python  11615 gfactory 1020w  FIFO    0,8      0t0 283480 pipe

Note that all /dev/null are open in write mode.

History

#1 Updated by Parag Mhashilkar over 5 years ago

  • Assignee set to Marco Mambelli
  • Target version set to v3_2_8

#2 Updated by Marco Mambelli over 5 years ago

  • Status changed from New to Feedback
  • Assignee changed from Marco Mambelli to Parag Mhashilkar

branch v3/7544
a dup of stderror was not closed in case of exception.
closing the file descriptor and re-rising the original exception.

committed and tested on fermicloud365

#3 Updated by Parag Mhashilkar over 5 years ago

  • Assignee changed from Parag Mhashilkar to Marco Mambelli

sent feedback separately.

#4 Updated by Marco Mambelli over 5 years ago

  • Status changed from Feedback to Resolved

did v3/7544_v2 with the same change (this time from branch_v3_2), committed and merged.

#5 Updated by Parag Mhashilkar over 5 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF