Project

General

Profile

Bug #20202

Kill not handled properly by glidein_startup.sh

Added by Marco Mascheroni over 1 year ago. Updated 15 days ago.

Status:
Rejected
Priority:
Normal
Category:
-
Target version:
Start date:
06/20/2018
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:

CMS

Duration:

Description

Diego Davila reported that he saw cases where a site admin killed running pilots (mostly Vanderbuilt) and then "zombie" pilots started appearing (i.e.: pilots that are not present in the facotry queue but condor is still running at the site). He tried to run a pilot by and and then killed it, and he said that the glidein_startup.sh were not waiting for the condor processes to finish before exiting.


Related issues

Related to GlideinWMS - Support #22509: Singularity processes left orphaned on PBSNew05/03/2019

Is duplicate of GlideinWMS - Bug #21682: Glidein not killing condor processesClosed01/14/2019

History

#1 Updated by Marco Mascheroni over 1 year ago

  • Assignee set to Marco Mascheroni
  • Target version set to v3_4_x
  • Stakeholders updated (diff)

Got this email from diego as well, reporting from completeness:

I saw this thing by launching a glidein by hand (executing glidein_startup.sh with the proper arguments and using my proxy), then waiting for it to connect to the ITBDEV pool and fetch a payload. Once the pilot is running a payload (sleep job) I sent a SIGTERM to the pilot.
What happens is that the process executing glidein_startup.sh exits almost immediately, but the condor process keeps running (you see this with "ps auxf").

I didn't see this happening at any Site (only in my local setup as described above), but I know that at Vanderbilt, they get slurm to send SIGTERM signals to the pilots, when they need their resources back, then wait for 5 min and hard-kill the pilot if still alive. My guess is that if slurm sees the main process (glidein_startup) exiting after the SIGTERM, then they proceed to clean everything and don't let condor to notify the collector they are leaving, hence the collector think the pilot is still alive and tries to send jobs there that will eventually fail.

I have seen many cases of the later (collector assigns matches to non-existing pilots) at many sites but very often on Vanderbilt, I think this could be the cause but not 100% sure.

Anyways I think it would be better if glideinWMS waits for condor to finish before exiting.

#2 Updated by Marco Mambelli about 1 year ago

  • Target version changed from v3_4_x to v3_5_x

#3 Updated by Marco Mambelli 15 days ago

  • Related to Support #22509: Singularity processes left orphaned on PBS added

#4 Updated by Marco Mambelli 15 days ago

  • Is duplicate of Bug #21682: Glidein not killing condor processes added

#5 Updated by Marco Mambelli 15 days ago

  • Assignee changed from Marco Mascheroni to Marco Mambelli
  • Status changed from New to Rejected

This was solved in #21682 and #22509
Please reopen if problems persist



Also available in: Atom PDF