Bug #20202

Kill not handled properly by

Added by Marco Mascheroni over 2 years ago. Updated about 1 year ago.

Target version:
Start date:
Due date:
% Done:


Estimated time:
First Occurred:
Occurs In:




Diego Davila reported that he saw cases where a site admin killed running pilots (mostly Vanderbuilt) and then "zombie" pilots started appearing (i.e.: pilots that are not present in the facotry queue but condor is still running at the site). He tried to run a pilot by and and then killed it, and he said that the were not waiting for the condor processes to finish before exiting.

Related issues

Related to GlideinWMS - Support #22509: Singularity processes left orphaned on PBSClosed05/03/2019

Is duplicate of GlideinWMS - Bug #21682: Glidein not killing condor processesClosed01/14/2019


#1 Updated by Marco Mascheroni over 2 years ago

  • Assignee set to Marco Mascheroni
  • Target version set to v3_4_x
  • Stakeholders updated (diff)

Got this email from diego as well, reporting from completeness:

I saw this thing by launching a glidein by hand (executing with the proper arguments and using my proxy), then waiting for it to connect to the ITBDEV pool and fetch a payload. Once the pilot is running a payload (sleep job) I sent a SIGTERM to the pilot.
What happens is that the process executing exits almost immediately, but the condor process keeps running (you see this with "ps auxf").

I didn't see this happening at any Site (only in my local setup as described above), but I know that at Vanderbilt, they get slurm to send SIGTERM signals to the pilots, when they need their resources back, then wait for 5 min and hard-kill the pilot if still alive. My guess is that if slurm sees the main process (glidein_startup) exiting after the SIGTERM, then they proceed to clean everything and don't let condor to notify the collector they are leaving, hence the collector think the pilot is still alive and tries to send jobs there that will eventually fail.

I have seen many cases of the later (collector assigns matches to non-existing pilots) at many sites but very often on Vanderbilt, I think this could be the cause but not 100% sure.

Anyways I think it would be better if glideinWMS waits for condor to finish before exiting.

#2 Updated by Marco Mambelli about 2 years ago

  • Target version changed from v3_4_x to v3_5_x

#3 Updated by Marco Mambelli about 1 year ago

  • Related to Support #22509: Singularity processes left orphaned on PBS added

#4 Updated by Marco Mambelli about 1 year ago

  • Is duplicate of Bug #21682: Glidein not killing condor processes added

#5 Updated by Marco Mambelli about 1 year ago

  • Assignee changed from Marco Mascheroni to Marco Mambelli
  • Status changed from New to Rejected

This was solved in #21682 and #22509
Please reopen if problems persist

Also available in: Atom PDF