Follow up w/ OSG and HTCondor to allow a clean exit in PBS
As documented in #21682, when removing a job submitted to a PBS system, the signal is now sent correctly to condor that receives it and shuts down.
PBS still sends sigterm and sigkill only few milliseconds later.
This is enough for the trap to forward the first signal but not for the process termination (sending back logs, ...) and cleanup.
Either (1) a working parameter is found to increase the delay in PBS
OR (2) Either HTCondorCE/BLAHP or HTCondor will take advantage of qsig that allows to send a signal and do that before removing the job.
Solution (2) would have the advantage to control the signal use and distinguish a quick shutdown (sigquit) form a graceful one(sigterm)
The role of GlideinWMS here is to facilitate and coordinate and verify the solution.
I don't think changes in GWMS would be of help.
The advantage for GWMS would be to receive glidien log files also in the case of killed jobs
#1 Updated by Marco Mambelli 7 months ago
Some useful links (torque documentation):