Project

General

Profile

Bug #23636

Frontend Fails with KeyError

Added by Bruno Coimbra 23 days ago.

Status:
New
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
Start date:
11/19/2019
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

While deploying a frontend (v3.4.6) I found the following message in the error logs under "group_main":

[2019-11-19 16:51:28,892] DEBUG: glideinFrontendElement:377: 3 child query processes started
[2019-11-19 16:51:29,132] DEBUG: glideinFrontendElement:859: Schedd fermicloud086.fnal.gov has 0 running with max 5700
[2019-11-19 16:51:29,138] WARNING: fork:60: Forked process '<bound method glideinFrontendElement.subprocess_count_glidein of <__main__
.glideinFrontendElement instance at 0x7f78f82dfcb0>>' failed
[2019-11-19 16:51:29,156] DEBUG: glideinFrontendLib:605: Running glidein ids at (u'fermicloud180.fnal.gov', 'fermicloud367@gfactory_in
stance@gfactory_service', u'') (total glideins: 0, total jobs 0, cluster matches: 0):
[2019-11-19 16:51:29,138] ERROR: fork:61: Forked process '<bound method glideinFrontendElement.subprocess_count_glidein of <__main__.g
lideinFrontendElement instance at 0x7f78f82dfcb0>>' failed
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/glideinwms/lib/fork.py", line 57, in fork_in_bg
out = function_torun(*args)
File "/usr/sbin/glideinFrontendElement.py", line 1866, in subprocess_count_glidein
count_status_multi_per_cred[request_name][cred.getId()] = {}
File "/usr/lib/python2.7/site-packages/glideinwms/frontend/glideinFrontendInterface.py", line 346, in getId
self._id = self.file_id(self.getIdFilename())
File "/usr/lib/python2.7/site-packages/glideinwms/frontend/glideinFrontendInterface.py", line 430, in file_id
dn = x509Support.extract_DN(filename)
File "/usr/lib/python2.7/site-packages/glideinwms/lib/x509Support.py", line 13, in extract_DN
fd = open(fname, "r")
IOError: [Errno 13] Permission denied: u'/proxy/pilot_proxy'
[2019-11-19 16:51:29,172] ERROR: fork:108: Re-raising exception during read:
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/glideinwms/lib/fork.py", line 104, in fetch_fork_result
out = cPickle.loads(rin)
EOFError
[2019-11-19 16:51:29,173] WARNING: fork:248: Failed to extract info from child '('Glidein', 0)': Exception during read probably due to
worker failure, original exception and trace <type 'exceptions.EOFError'>:
[2019-11-19 16:51:29,173] ERROR: fork:249: Failed to extract info from child '('Glidein', 0)': Exception during read probably due to w
orker failure, original exception and trace <type 'exceptions.EOFError'>:
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/glideinwms/lib/fork.py", line 237, in fetch_ready_fork_result_list
out = fetch_fork_result(fd, pid)
File "/usr/lib/python2.7/site-packages/glideinwms/lib/fork.py", line 104, in fetch_fork_result
out = cPickle.loads(rin)
FetchError: Exception during read probably due to worker failure, original exception and trace <type 'exceptions.EOFError'>:
[2019-11-19 16:51:29,223] ERROR: glideinFrontendElement:1787: Terminating iteration due to errors:
Traceback (most recent call last):
File "/usr/sbin/glideinFrontendElement.py", line 1783, in do_match
pipe_out = forkm_obj.bounded_fork_and_collect(self.max_matchmakers)
File "/usr/lib/python2.7/site-packages/glideinwms/lib/fork.py", line 397, in bounded_fork_and_collect
raise ForkResultError(nr_errors, post_work_info)
ForkResultError: Found 1 errors
[2019-11-19 16:51:29,224] ERROR: glideinFrontendElement:258: Unhandled exception, dying: ['Traceback (most recent call last):\n', ' F
ile "/usr/sbin/glideinFrontendElement.py", line 252, in main\n rc = self.iterate()\n', ' File "/usr/sbin/glideinFrontendElement.py
", line 282, in iterate\n done_something = self.iterate_one()\n', ' File "/usr/sbin/glideinFrontendElement.py", line 524, in itera
te_one\n condorq_dict_types[\'Idle\'][\'total\'],\n', "KeyError: 'total'\n"]
Traceback (most recent call last):
File "/usr/sbin/glideinFrontendElement.py", line 252, in main
rc = self.iterate()
File "/usr/sbin/glideinFrontendElement.py", line 282, in iterate
done_something = self.iterate_one()
File "/usr/sbin/glideinFrontendElement.py", line 524, in iterate_one
condorq_dict_types['Idle']['total'],
KeyError: 'total'
[2019-11-19 16:51:29,792] DEBUG: glideinFrontendInterface:1659: CONDOR ADVERTISE /tmp/gfi_de_gc_524203889_2758 INVALIDATE_MASTER_ADS fermicloud180.fnal.gov False
[2019-11-19 16:51:29,982] DEBUG: glideinFrontendInterface:1659: CONDOR ADVERTISE /tmp/gfi_de_gcg_524203889_2758 INVALIDATE_MASTER_ADS fermicloud180.fnal.gov False
[2019-11-19 16:51:30,159] DEBUG: classadSupport:398: CONDOR ADVERTISE /tmp/gfi_ar_524203890_2758 INVALIDATE_ADS_GENERIC None False True

I'm not quite sure how to reproduce the problem. The error showed up after a few restarts.
My pilot proxy had the wrong ownership, but only changing that didn't produce the same error.



Also available in: Atom PDF