Glidein startup aborting fast for apparently no reason (in reality conor_master is timing out a name resolution)
This problem has been reported in GGUS 114634 [[https://ggus.eu/index.php?mode=ticket_info&ticket_id=114634]] "pilot failures on newly added IT_T3_Bologna nodes"
The symptom is that the pilot were dying few seconds after starting up and was not clear why.
Here a description of the problem in a message from Giuseppe (in the ticket):
the problem occurred also in other cloud testbeds. Joining the effort with the local experts, we found the problem. In the lines of the condor_startup.sh I reported, there is a call to condor_master: - if the condor_master is ok, a pid file is created immediatly and the rest of the condor_startup.sh stays alive until the pid exists - if the condor_master is not creating the pid for some reason, after 5 seconds, the condor_startup.sh script consider the life of the condor_master ended and proceed to the end of the script causing the termination with no error of the pilot itself. Now: why the condor_master does not create the pidfile? The reason is that it asks a hostname resolution and, if it fails, it retries every 3 seconds, 40 times, before going ahead. 120 seconds is much more of th 5 seconds wait set for the pid creation... On the other hand, at least in the cloud environment, it happens that the hostname resolution does not make much sense, so any call to `hostname -f` does not give good results. Just adding in the /etc/hosts a line for the hostname resolution, it fixes the problem. From our side, we can patch the problem. From the glidein side, it would be better at least: - to avoid such a strong requirement for hostname resolution - in any case, to be able to log and report at higher lever if the condor_master is stuck somewhere (launching a process in background and pretending to be perfectly working without much checks does not seems to me a good practice anyway)
I think if could be good to extend the timeout to 3/5 min if as it seems condor_master is starting up fine after timing out the name resolution.
And at least a meaningful error message should be reported back when the condor startup times out and the glidein is aborted.