Project

General

Profile

Support #21929

Follow up w/ OSG and HTCondor to allow a clean exit in PBS

Added by Marco Mambelli 9 months ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
Start date:
02/19/2019
Due date:
% Done:

0%

Estimated time:
Stakeholders:
Duration:

Description

As documented in #21682, when removing a job submitted to a PBS system, the signal is now sent correctly to condor that receives it and shuts down.

PBS still sends sigterm and sigkill only few milliseconds later.
This is enough for the trap to forward the first signal but not for the process termination (sending back logs, ...) and cleanup.

Either (1) a working parameter is found to increase the delay in PBS
OR (2) Either HTCondorCE/BLAHP or HTCondor will take advantage of qsig that allows to send a signal and do that before removing the job.
Solution (2) would have the advantage to control the signal use and distinguish a quick shutdown (sigquit) form a graceful one(sigterm)

The role of GlideinWMS here is to facilitate and coordinate and verify the solution.
I don't think changes in GWMS would be of help.

The advantage for GWMS would be to receive glidien log files also in the case of killed jobs



Also available in: Atom PDF