Project

General

Profile

Bug #9639

Glidein startup aborting fast for apparently no reason (in reality conor_master is timing out a name resolution)

Added by Marco Mambelli over 4 years ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
Start date:
07/15/2015
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

This problem has been reported in GGUS 114634 [[https://ggus.eu/index.php?mode=ticket_info&ticket_id=114634]] "pilot failures on newly added IT_T3_Bologna nodes"

The symptom is that the pilot were dying few seconds after starting up and was not clear why.

Here a description of the problem in a message from Giuseppe (in the ticket):

the problem occurred also in other cloud testbeds. Joining the effort with the local experts, we found the problem.

In the lines of the condor_startup.sh I reported, there is a call to condor_master:
- if the condor_master is ok, a pid file is created immediatly and the rest of the condor_startup.sh stays alive until the pid exists
- if the condor_master is not creating the pid for some reason, after 5 seconds, the condor_startup.sh script consider the life of the condor_master ended and proceed to the end of the script causing the termination with no error of the pilot itself.

Now: why the condor_master does not create the pidfile?
The reason is that it asks a hostname resolution and, if it fails, it retries every 3 seconds, 40 times, before going ahead. 120 seconds is much more of th 5 seconds wait set for the pid creation...

On the other hand, at least in the cloud environment, it happens that the hostname resolution does not make much sense, so any call to `hostname -f` does not give good results.
Just adding in the /etc/hosts a line for the hostname resolution, it fixes the problem.

From our side, we can patch the problem.
From the glidein side, it would be better at least:
- to avoid such a strong requirement for hostname resolution
- in any case, to be able to log and report at higher lever if the condor_master is stuck somewhere (launching a process in background and pretending to be perfectly working without much checks does not seems to me a good practice anyway)

I think if could be good to extend the timeout to 3/5 min if as it seems condor_master is starting up fine after timing out the name resolution.
And at least a meaningful error message should be reported back when the condor startup times out and the glidein is aborted.


Related issues

Related to GlideinWMS - Bug #21682: Glidein not killing condor processesClosed01/14/2019

History

#1 Updated by Marco Mambelli 9 months ago

  • Related to Bug #21682: Glidein not killing condor processes added

#2 Updated by Marco Mambelli 9 months ago

  • Target version set to v3_5
  • Assignee set to Marco Mambelli

#3 Updated by Marco Mambelli 9 months ago

  • Related to Bug #21682: Glidein not killing condor processes added

#4 Updated by Marco Mambelli 9 months ago

  • Related to deleted (Bug #21682: Glidein not killing condor processes)

#5 Updated by Marco Mambelli 6 months ago

  • Target version changed from v3_5 to v3_5_1

#6 Updated by Marco Mambelli about 2 months ago

  • Target version changed from v3_5_1 to v3_6_1


Also available in: Atom PDF