Project

General

Profile

Bug #12000

Frontend crashes if it fails to talk to the WMS collector

Added by Parag Mhashilkar almost 4 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Parag Mhashilkar
Category:
-
Target version:
Start date:
03/21/2016
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

Slave frontend calls to glideinFrontendInterface.findMAsterClassads() are not wrapped by try - except like calls made to findGlobals and findGlideinClientMonitoring(). This causes slave to crash if it has errors talking to he WMS collector.

History

#1 Updated by Parag Mhashilkar almost 4 years ago

Stacktrace from CMS slave frontend's error logs

-------------------- end script --------------------
[2016-03-21 11:27:33,802] ERROR: glideinFrontend:472: Exception occurred trying to spawn:
Traceback (most recent call last):
File "/usr/sbin/glideinFrontend", line 463, in main
restart_interval, restart_attempts)
File "/usr/sbin/glideinFrontend", line 275, in spawn
ha, mode, groups)
File "/usr/sbin/glideinFrontend", line 354, in shouldHibernate
master_classads = glideinFrontendInterface.findMasterFrontendClassads(factory_pool_node, master_frontend_name)
File "/usr/lib/python2.6/site-packages/glideinwms/frontend/glideinFrontendInterface.py", line 156, in findMasterFrontendClassads
status.load('(s)&x%x(s)' % (status_constraint, frontend_constraint))
File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorMonitor.py", line 285, in load
self.stored_data = self.fetch(constraint, format_list)
File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorMonitor.py", line 344, in fetch
return QueryExe.fetch(self, constraint=constraint, format_list=format_list)
File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorMonitor.py", line 271, in fetch
xml_data = condorExe.exe_cmd(self.exe_name,"%s -xml %s %s"
(self.resource_str,self.pool_str,constraint_str),env=self.env);
File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorExe.py", line 56, in exe_cmd
return iexe_cmd(cmd,stdin_data,env)
File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorExe.py", line 107, in iexe_cmd
raise ExeError, msg
ExeError: Unexpected Error running '/usr/bin/condor_status -any -xml -pool glidein.grid.iu.edu -constraint '((GlideinMyType=?="glideclientglobal")||(GlideinMyType=?="glideclient"))&&((FrontendName=?="CMSG-v1_0")&&(FrontendHAMode=!="slave"))''. Details: Command '/usr/bin/condor_status -any -xml -pool glidein.grid.iu.edu -constraint '((GlideinMyType=?="glideclientglobal")||(GlideinMyType=?="glideclient"))&&((FrontendName=?="CMSG-v1_0")&&(FrontendHAMode=!="slave"))'' returned non-zero exit status 1: Error: communication error
CEDAR:6001:Failed to connect to <129.79.53.27:9618>
Error: Couldn't contact the condor_collector on glidein.grid.iu.edu
(<129.79.53.27:9618>).

Extra Info: the condor_collector is a process that runs on the central
manager of your Condor pool and collects the status of all the machines and
jobs in the Condor pool. The condor_collector might not be running, it might
be refusing to communicate with you, there might be a network problem, or
there may be some other problem. Check with your system administrator to fix
this problem.

If you are the system administrator, check that the condor_collector is
running on glidein.grid.iu.edu (<129.79.53.27:9618>), check the ALLOW/DENY
configuration in your condor_config, and check the MasterLog and CollectorLog
files in your log directory for possible clues as to why the condor_collector
is not responding. Also see the Troubleshooting section of the manual.

#2 Updated by Parag Mhashilkar almost 4 years ago

  • Status changed from New to Feedback
  • Assignee changed from Parag Mhashilkar to Marco Mambelli

Can you please review this? There could be more changes depending on the logs from Edgar, but I expect them to be isolated.

#3 Updated by Parag Mhashilkar almost 4 years ago

  • Subject changed from Slave frontend crashes if it fails to talk to the WMS collector to Frontend crashes if it fails to talk to the WMS collector
  • Description updated (diff)

#4 Updated by Parag Mhashilkar almost 4 years ago

  • Description updated (diff)

For non HA setup LIGO frontend crashed with following in logs

2016-03-23 11:48:35,105] DEBUG: condorExe:103: Unexpected Error running '/usr/sbin/condor_advertise UPDATE_AD_GENERIC /tmp/gfi_afm_408758864_2795439'. Details: Command '/usr/sbin/condor_advertise UPDATE_AD_GENERIC /tmp/gfi_afm_408758864_2795439' returned non-zero exit status 1: failed to send classad to <169.228.130.104:9618>

[2016-03-23 11:48:35,105] DEBUG: condorExe:104: script to reproduce failure:
-------------------- begin script --------------------
#!/bin/bash
TERM=xterm-256color
SHELL=/bin/sh
SHLVL=2
X509_USER_PROXY=/tmp/vo_proxy
PWD=/
LOGNAME=frontend
USER=frontend
HOME=/home/frontend
PATH=/sbin:/usr/sbin:/bin:/usr/bin
_CONDOR_CERTIFICATE_MAPFILE=/var/lib/gwms-frontend/vofrontend/group_itb/group.mapfile
CONDOR_CONFIG=/var/lib/gwms-frontend/vofrontend/frontend.condor_config
_=/bin/nice
/usr/sbin/condor_advertise UPDATE_AD_GENERIC /tmp/gfi_afm_408758864_2795439
--------------------  end script  --------------------
[2016-03-23 11:48:35,111] DEBUG: classadSupport:352: CONDOR ADVERTISE /tmp/gfi_afm_408758915_2795439 INVALIDATE_ADS_GENERIC None False False
[2016-03-23 11:49:15,947] WARNING: glideinFrontend:112: [main]: Traceback (most recent call last):
  File "/usr/sbin/glideinFrontendElement.py", line 43, in <module>
    from glideinwms.frontend import glideinFrontendInterface
  File "/usr/lib/python2.6/site-packages/glideinwms/frontend/glideinFrontendInterface.py", line 27, in <module>
    from glideinwms.lib import condorMonitor
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorMonitor.py", line 128, in <module>
    local_schedd_cache=LocalScheddCache()
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorMonitor.py", line 64, in __init__
    self.my_ips=socket.gethostbyname_ex(socket.gethostname())[2]
socket.gaierror: [Errno -2] Name or service not known

[2016-03-23 11:52:12,363] ERROR: glideinFrontend:516: Exception occurred trying to spawn: 
Traceback (most recent call last):
  File "/usr/sbin/glideinFrontend", line 507, in main
    restart_interval, restart_attempts)
  File "/usr/sbin/glideinFrontend", line 316, in spawn
    restart_attempts, "run")
  File "/usr/sbin/glideinFrontend", line 197, in spawn_iteration
    fm_advertiser.advertiseAllClassads()
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/classadSupport.py", line 241, in advertiseAllClassads
    self.advertiseClassads(self.classads.keys())
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/classadSupport.py", line 221, in advertiseClassads
    self.advertiseClassad(ad)
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/classadSupport.py", line 233, in advertiseClassad
    self.doAdvertise(fname)
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/classadSupport.py", line 193, in doAdvertise
    use_tcp=self.tcpAdvertiseSupport)
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/classadSupport.py", line 354, in exe_condor_advertise
    is_multi, pool)
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorManager.py", line 161, in condorAdvertise
    return condorExe.exe_cmd_sbin("condor_advertise",cmd_opts)
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorExe.py", line 67, in exe_cmd_sbin
    return iexe_cmd(cmd,stdin_data,env)
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorExe.py", line 107, in iexe_cmd
    raise ExeError, msg
ExeError: Unexpected Error running '/usr/sbin/condor_advertise UPDATE_AD_GENERIC /tmp/gfi_afm_408758864_2795439'. Details: Command '/usr/sbin/condor_advertise UPDATE_AD_GENERIC /tmp/gfi_afm_408758864_2795439' returned non-zero exit status 1: failed to send classad to <169.228.130.104:9618>

[2016-03-23 13:45:51,620] DEBUG: glideinFrontend:465: Frontend startup time: 1458765949.91

#5 Updated by Marco Mambelli almost 4 years ago

  • Assignee changed from Marco Mambelli to Parag Mhashilkar

Sent email w/ feedback

#6 Updated by Parag Mhashilkar almost 4 years ago

Made minor changes and merged

#7 Updated by Parag Mhashilkar almost 4 years ago

  • Status changed from Feedback to Resolved

#8 Updated by Parag Mhashilkar over 3 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF