Kill not handled properly by glidein_startup.sh
Diego Davila reported that he saw cases where a site admin killed running pilots (mostly Vanderbuilt) and then "zombie" pilots started appearing (i.e.: pilots that are not present in the facotry queue but condor is still running at the site). He tried to run a pilot by and and then killed it, and he said that the glidein_startup.sh were not waiting for the condor processes to finish before exiting.
#1 Updated by Marco Mascheroni over 2 years ago
- Assignee set to Marco Mascheroni
- Target version set to v3_4_x
- Stakeholders updated (diff)
Got this email from diego as well, reporting from completeness:
I saw this thing by launching a glidein by hand (executing glidein_startup.sh with the proper arguments and using my proxy), then waiting for it to connect to the ITBDEV pool and fetch a payload. Once the pilot is running a payload (sleep job) I sent a SIGTERM to the pilot.
What happens is that the process executing glidein_startup.sh exits almost immediately, but the condor process keeps running (you see this with "ps auxf").
I didn't see this happening at any Site (only in my local setup as described above), but I know that at Vanderbilt, they get slurm to send SIGTERM signals to the pilots, when they need their resources back, then wait for 5 min and hard-kill the pilot if still alive. My guess is that if slurm sees the main process (glidein_startup) exiting after the SIGTERM, then they proceed to clean everything and don't let condor to notify the collector they are leaving, hence the collector think the pilot is still alive and tries to send jobs there that will eventually fail.
I have seen many cases of the later (collector assigns matches to non-existing pilots) at many sites but very often on Vanderbilt, I think this could be the cause but not 100% sure.
Anyways I think it would be better if glideinWMS waits for condor to finish before exiting.