Project

General

Profile

Bug #21682

Glidein not killing condor processes

Added by Marco Mambelli 9 months ago. Updated 7 months ago.

Status:
Closed
Priority:
Normal
Category:
Glidein
Target version:
Start date:
01/14/2019
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:

FactoryOps CMS

Duration:

Description

Seems that the Glidein (glidein_startup.sh and related scripts) is quitting (killed or for some timeout) but leaves behind HTCondor processes, i.e. the startd, that keep receiving and running user jobs and are orphaned processes not managed any more by the host batch system.

This may be related to [#20031]

Here an email thread by Factoy Ops

I think we've seen this enough times now that really it's time to report to glideinWMS and see if they can provide us a fix.

I'm adding glideinwms-support here. Diego have you been able to reproduce this by hand? If so it would help if you tell the devs what steps you took. In the meantime we can try changing MachineMaxVacateTime, but I'm not convinced it will help if Erik is reporting the pilots runs jobs >24h after glidein_startup goes away (this should not happen!) Where should it be set? factory or frontend?

Thanks,
Jeff

On 1/14/19 7:42 AM, Gough, Erik S wrote:

Hi Diego,
Thanks for the suggestions.  If I understand correctly, you are saying that the condor_master should send a graceful shutdown to to the startd and any payload jobs that don't exit within 10 mins are then forcefully killed.  So the glideins should run for max 10 mins before everything gets cleaned up? What I see on these pilots is that they continue to run and pull jobs for >24 hours.  Do you think these changes will prevent that or could another mechanism not be working correctly?
Thanks,
-Erik
________________________________________
From: Diego Davila Foyo <diego.davila@cern.ch>
Sent: Monday, January 14, 2019 7:33 AM
To: Jeffrey Michael Dost; Marian Zvada; Gough, Erik S; osg-gfactory-support@physics.ucsd.edu
Subject: RE: [Osg-gfactory-support] Force glidein to retire
Hello all,
As Jeff said I explored this issue in the past and found that when you SIGTERM a pilot, the glidein_startup.sh script sends a SIGTERM to condor and exits immediately[1] instead of waiting for the condor process to finish leaving the condor process without a parent. Then if condor is running a payload the condor_master will sent a graceful shutdown to its children (Startd)[2]. When gracefully killed the jobs are given 10 minutes (default) of grace time before getting hard killed.
I think we need to change 2 things here:
1. we should make MachineMaxVacateTime=0 and
2. the glidein_startup script should be modified to wait for condor after the SIGTERM is performed.
Hope this helps.
Diego
See More
Hi Diego,
Thanks for the suggestions.  If I understand correctly, you are saying that the condor_master should send a graceful shutdown to to the startd and any payload jobs that don't exit within 10 mins are then forcefully killed.  So the glideins should run for max 10 mins before everything gets cleaned up? What I see on these pilots is that they continue to run and pull jobs for >24 hours.  Do you think these changes will prevent that or could another mechanism not be working correctly?
Thanks,
-Erik
________________________________________
From: Diego Davila Foyo <diego.davila@cern.ch>
Sent: Monday, January 14, 2019 7:33 AM
To: Jeffrey Michael Dost; Marian Zvada; Gough, Erik S; osg-gfactory-support@physics.ucsd.edu
Subject: RE: [Osg-gfactory-support] Force glidein to retire
Hello all,
As Jeff said I explored this issue in the past and found that when you SIGTERM a pilot, the glidein_startup.sh script sends a SIGTERM to condor and exits immediately[1] instead of waiting for the condor process to finish leaving the condor process without a parent. Then if condor is running a payload the condor_master will sent a graceful shutdown to its children (Startd)[2]. When gracefully killed the jobs are given 10 minutes (default) of grace time before getting hard killed.
I think we need to change 2 things here:
1. we should make MachineMaxVacateTime=0 and
2. the glidein_startup script should be modified to wait for condor after the SIGTERM is performed.
Hope this helps.
Diego

[1] https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_glideinWMS_glideinwms_blob_master_creation_web-5Fbase_glidein-5Fstartup.sh-23L19&d=DwICaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=NbWt3Q8wafsLnsOboTEXQ1DVvRiI3F6qPVkTG_HQipBuV_2xnCSlU5ZbiNzn6xVI&m=7cCRqmp1ZZ7mnr517Vb8GpD200SNFn9I8nD0RRrgmhI&s=QJ9gdhhpUqPu-kymH3yGQOltuguoJMzSPaW401gLgH0&e=

[2] From the condor manual:
 For the condor_master, a graceful shutdown causes the condor_master to ask all of its children to perform their own graceful shutdown method
[3]MachineMaxVacateTime
An integer expression representing the number of seconds the machine is willing to wait for a job that has been soft-killed to gracefully shut down.
________________________________________
From: Jeffrey Dost [jdost@ucsd.edu]
Sent: 12 January 2019 00:03
To: Marian Zvada; Gough, Erik S; osg-gfactory-support@physics.ucsd.edu; Diego Davila Foyo
Subject: Re: [Osg-gfactory-support] Force glidein to retire

I think you can see them from the web monitoring:
https://urldefense.proofpoint.com/v2/url?u=http-3A__gfactory-2D1.t2.ucsd.edu_factory_monitor_factoryStatus.html-3Fentry-3DCMSHTPC-5FT2-5FUS-5FPurdue-5FHammer-26frontend-3Dtotal-26infoGroup-3Drunning-26elements-3DStatusRunningCores-2CClientCoresTotal-2CClientCoresRunning-2CClientCoresIdle-2C-26rra-3D0-26window-5Fmin-3D0-26window-5Fmax-3D0-26timezone-3D-2D8&d=DwICaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=NbWt3Q8wafsLnsOboTEXQ1DVvRiI3F6qPVkTG_HQipBuV_2xnCSlU5ZbiNzn6xVI&m=7cCRqmp1ZZ7mnr517Vb8GpD200SNFn9I8nD0RRrgmhI&s=P8F1LgpNmDw-6nOKHjIsCkHccgpJavRFzn14V6xh2nw&e=

Note the gap the registered cores count somehow higher than running
cores. This only makes sense if the startd's are still connected to the
global pool, but the parent glidein processes were killed. This also
makes sense based on Erik reporting they are orphaned (parent process = 1).

This gets tricky because when pilots are forceably removed, we usually
don't get logs back.

Erik are you ok if we leave things until next week to debug more
closely? I think Diego (CMS global pool admin) has seen / tried
debugging this situation for other sites in the past. I can follow up
with him and the factory ops team next week. We may need to get gwms
devs involved as well.

Thanks,
Jeff

On 1/11/19 2:16 PM, Marian Zvada wrote:
Hi Eric,

this will be harder to track and tell which are the ones to drain
since we're looking for "misbehaving" glidein in running status among
others "an-OK" running I suppose.

@Jeff, do you have an idea where to look? I'll try to poke around more
in the meantime...

Thanks,
Marian

On 1/11/19 2:54 PM, Gough, Erik S wrote:
Hi Marian,

These entries are all valid.  We have multiple node types in a
cluster served by one CE.  Not sure about those held ones... that
might be a different problem.

Here is some info about a glidein that is currently running outside
of PBS.
MASTER_NAME = glidein_324957_607730595
GLIDEIN_SiteWMS_JobId="6002581.hammer-adm.rcac.purdue.edu" 

That is the local job ID I should see in our local scheduler, but the
job does not exist.

[goughes@hammer-fe01 ~]$ qstat 6002581
qstat: Unknown Job Id Error 6002581.hammer-adm.rcac.purdue.edu

The PBS logs show the CE tried to delete the job and it sent a SIGTERM
    6002581: 01/10/2019 23:01:19.239;08;PBS_Server.27883;Job;Job
deleted at request of cmspilot@hammer-osg.rcac.purdue.edu
    6002581: 01/10/2019 23:01:19.256;08;PBS_Server.27883;Job;Job sent
signal SIGTERM on delete

If I look on the WN, the glidein_startup.sh and condor_startup.sh
scripts have exited, but condor_master is still running.

cmspilot 332711  0.0  0.0 104536  9572 ?        Ss   Jan10 0:01
/tmp/glide_9C9Aeu/main/condor/sbin/condor_master -f -pidfile
/tmp/glide_9C9Aeu/condor_master2.pid

The PPID of that condor_master process is 1.

 From the PBS side, I see the job get scheduled, it runs for maybe
1-2 mins.  Then the CE tries to delete it.  PBS removes it and thinks
the job has finished.  Yet the condor_master keeps on running and
starts to pull jobs.

Since this is affecting our accounting (we are providing more CoreHr
from payload jobs than we are from glideins) it would be good to have
a mechanism to drain these glideins while we troubleshoot the issue.

Thanks,
-Erik

________________________________________
From: Marian Zvada <marian.zvada@cern.ch>
Sent: Friday, January 11, 2019 3:07 PM
To: Gough, Erik S; Jeffrey Dost; osg-gfactory-support@physics.ucsd.edu
Subject: Re: [Osg-gfactory-support] Force glidein to retire

Hi Erik,

I narrowed down to several entries which are associated with this CE:

CMSHTPC_T2_US_Purdue_Hammer: multicore entry  -- Vassil 2016-03-23
CMSHTPC_T2_US_Purdue_HammerC: copy of CMSHTPC_T2_US_Purdue_Hammer per
Erik request; --Marian 2017-12-20
CMSHTPC_T2_US_Purdue_HammerHT: Copy of CMSHTPC_T2_US_Purdue_Hammer entry
except GLIDEIN_MaxMemMBs 2018-10-24 --Edita
CMSHTPC_T2_US_Purdue_Hammer_op

Are there all valid? I don't recall other details but can check what is
attribute difference for matchmaking the entries unless it rings the
bell about the purpose on your side.

I just did a quick look at entry status and all seem just run or is idle
except one entry which has quite some number of held pilots:
$ entry_q CMSHTPC_T2_US_Purdue_HammerHT -held

-- Schedd: schedd_glideins3@gfactory-1.t2.ucsd.edu :
<169.228.38.36:17490?... @ 01/11/19 11:48:19
   ID         OWNER          HELD_SINCE  HOLD_REASON
7188687.0   fecmsglobal     1/11 06:45 no jobId in submission script's
output (stdout:) (stderr: <blah> execute_cmd: 30 seconds timeout
expired, killing child process.-)
7188687.4   fecmsglobal     1/11 06:46 no jobId in submission script's
output (stdout:) (stderr: <blah> execute_cmd: 30 seconds timeout
expired, killing child process.-)
7188716.1   fecmsglobal     1/11 06:56 submission command failed (exit
code = -15) (stdout:) (stderr: <blah> execute_cmd: 30 seconds timeout
expired, killing child process.- <blah> killed by signal 15.-)
7188716.5   fecmsglobal     1/11 06:56 submission command failed (exit
code = -15) (stdout:) (stderr: <blah> execute_cmd: 30 seconds timeout
expired, killing child process.- <blah> killed by signal 15.-)
7188716.8   fecmsglobal     1/11 06:56 submission command failed (exit
code = -15) (stdout:) (stderr: <blah> execute_cmd: 30 seconds timeout
expired, killing child process.- <blah> killed by signal 15.-)
7188716.9   fecmsglobal     1/11 06:56 no jobId in submission script's
output (stdout:) (stderr: <blah> execute_cmd: 30 seconds timeout
expired, killing child process.-)
7188831.0   fecmsglobal     1/11 07:37 submission command failed (exit
code = -15) (stdout:) (stderr: <blah> execute_cmd: 30 seconds timeout
expired, killing child process.- <blah> killed by signal 15.-)
7188831.5   fecmsglobal     1/11 07:39 submission command failed (exit
code = -15) (stdout:) (stderr: <blah> execute_cmd: 30 seconds timeout
expired, killing child process.- <blah> killed by signal 15.-)

Not sure if that contributes to the problem, though. You said that your
PBS doesn't see those pilots while they are still catching up new
payload? Can you perhaps see what's JOBID those? Also, the ones I listed
they don't provide any additional logs back to factory.

Hope this helps us move forward.

Thanks,
Marian

On 1/11/19 12:08 PM, Gough, Erik S wrote:
Yes, only CMS ones.

-Erik
________________________________________
From: Jeffrey Dost <jdost@ucsd.edu>
Sent: Friday, January 11, 2019 1:07 PM
To: Gough, Erik S; osg-gfactory-support@physics.ucsd.edu
Subject: Re: [Osg-gfactory-support] Force glidein to retire

Thanks Erik,

Putting factory ops back in CC. Also I forgot to ask, these are CMS
pilots, right?

Thanks,
Jeff

On 1/11/19 9:56 AM, Gough, Erik S wrote:
Hi Jeff,

Yes, I see 90 of them currently. These are from
hammer-osg.rcac.purdue.edu.

-Erik
________________________________________
From: Jeffrey Dost <jdost@ucsd.edu>
Sent: Friday, January 11, 2019 12:32 PM
To: Gough, Erik S; osg-gfactory-support@physics.ucsd.edu
Subject: Re: [Osg-gfactory-support] Force glidein to retire

Hi Erik,

Are these untracked pilots still there? From our side it would help if
you can tell us which CE they came from, so we can look at the
correct logs.

Thanks,
Jeff

On 1/10/19 7:33 AM, Gough, Erik S wrote:
Hi,

I have a situation where a large amount of glideins are running
outside our scheduler.  It looks like the CE tried to SIGTERM
them, but they continue to run and are no longer tracked by PBS.
Is there I way I can force these glideins to retire?  I don't want
to just kill them as they are still pulling/running payload jobs.

Thanks,
-Erik
_______________________________________________
Osg-gfactory-support mailing list
Osg-gfactory-support@physics.ucsd.edu
https://urldefense.proofpoint.com/v2/url?u=https-3A__physics-2Dmail.ucsd.edu_mailman_listinfo_osg-2Dgfactory-2Dsupport&d=DwICaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=NbWt3Q8wafsLnsOboTEXQ1DVvRiI3F6qPVkTG_HQipBuV_2xnCSlU5ZbiNzn6xVI&m=7cCRqmp1ZZ7mnr517Vb8GpD200SNFn9I8nD0RRrgmhI&s=kbFaVFnPQFcnfVzr9-e5z0czzg_kvrzFXVlr0N0vo8A&e=
_______________________________________________
Osg-gfactory-support mailing list
Osg-gfactory-support@physics.ucsd.edu
https://urldefense.proofpoint.com/v2/url?u=https-3A__physics-2Dmail.ucsd.edu_mailman_listinfo_osg-2Dgfactory-2Dsupport&d=DwICaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=NbWt3Q8wafsLnsOboTEXQ1DVvRiI3F6qPVkTG_HQipBuV_2xnCSlU5ZbiNzn6xVI&m=7cCRqmp1ZZ7mnr517Vb8GpD200SNFn9I8nD0RRrgmhI&s=kbFaVFnPQFcnfVzr9-e5z0czzg_kvrzFXVlr0N0vo8A&e=

pbs_script.sh (5.21 KB) pbs_script.sh Marco Mambelli, 01/26/2019 09:16 PM
glidein_config (28.4 KB) glidein_config Marco Mambelli, 01/26/2019 09:16 PM
glidein_startup.sh (63.4 KB) glidein_startup.sh Marco Mambelli, 02/04/2019 07:43 PM
patch_21682_20190204.txt (5.47 KB) patch_21682_20190204.txt Marco Mambelli, 02/04/2019 07:43 PM
condor_startup.sh (45.5 KB) condor_startup.sh Marco Mambelli, 02/04/2019 07:43 PM

Related issues

Related to GlideinWMS - Support #20031: Check that jobs will run OK on all LRM also when USE_PROCESS_GROUPS is setNew05/26/2018

Related to GlideinWMS - Bug #9639: Glidein startup aborting fast for apparently no reason (in reality conor_master is timing out a name resolution)New07/15/2015

Has duplicate GlideinWMS - Bug #20202: Kill not handled properly by glidein_startup.shRejected06/20/2018

History

#1 Updated by Marco Mambelli 9 months ago

  • Related to Support #20031: Check that jobs will run OK on all LRM also when USE_PROCESS_GROUPS is set added

#2 Updated by Marco Mambelli 9 months ago

The topic was discussed in today's HTondor-CMS meeting.
I'm adding here some notes from the meeting:

condor_off and sending sig-term or sig-quit allow both to terminate a condor master and all its daemons.
sig-term and regular condor_off (default) allow a graceful termination, by default the jobs have 10 min to finish before being killed hard.
sig-quit and condor_off -fast will demand an immediate shutdown
any daemon under the master should kill itself if the master goes away also if the master did not send them the kill command (hard kill)

The daemon log (condor master and the others) should say that they received the signal (but the info is not sent to the collector, so you need access to the log)

The signal is more reliable than condor_off, less moving parts that could go wrong. W/ condor_off the daemon needs to be up and running, accepting connections, ...

GlideinWMS could wait after sending the signal and send a hard kill to the master (e.g. 15 min later if it did not quit). Other daemons should notice the absence of master and quit.
The process group cannot be used reliably if we ask condor to start a new process group to identify all the processes.

In any case there should be no processes hanging after the graceful shutdown limit and the startd should not accept new jobs after receiving the shutdown command (even if waiting for jobs to complete)

#3 Updated by Marco Mambelli 9 months ago

  • Target version changed from v3_4_3 to v3_5

#4 Updated by Marco Mambelli 9 months ago

Some updates form troubleshooting (summary of sessions and email thread).

For jobs w/ orphaned processes:
- stdout and stderr from the job (glidein_startup.sh) are not available.
- The glidein directory is intact
- There is no trace of a signal in the condor logs

We found out that glidein_startup.sh is not started directly by the OS but there is a helper script, pbs_script.sh, probably created by the HTCondor_CE.
Ad visible from the attached glidein_config of seems that also glidein_startup.sh did not receive any signal.
Test glideins from Diego seem tp confirm that the glidein is not being killed.
My suspicion is that the helper script is not propagating the signal and stopping the attempt of the OS to kill the job (as requested by the Factory).

Erik, Purdue's sysadmin will try to recover glidein_startup.sh stdout/err to confirm or dispute this suspicion

If my suspicion is correct the problem is in the helper script and not in the glidein and I'll submit an OSG/HTCondor ticket to get this fixed.

#5 Updated by Marco Mambelli 9 months ago

  • File glidein_startup.sh added
  • File condor_startup.sh added
  • File patch_21682_20190201.txt added

Here a patch to better handle and forward signals, especially remove the latency during the initial sleep.
Changes are in condor_startup.sh and glidein_startup.sh only

#6 Updated by Marco Mambelli 9 months ago

  • Related to Bug #9639: Glidein startup aborting fast for apparently no reason (in reality conor_master is timing out a name resolution) added

#7 Updated by Marco Mambelli 9 months ago

  • Related to Bug #9639: Glidein startup aborting fast for apparently no reason (in reality conor_master is timing out a name resolution) added

#8 Updated by Marco Mambelli 9 months ago

  • Related to deleted (Bug #9639: Glidein startup aborting fast for apparently no reason (in reality conor_master is timing out a name resolution))

#9 Updated by Marco Mambelli 9 months ago

  • File deleted (condor_startup.sh)

#10 Updated by Marco Mambelli 9 months ago

  • File deleted (glidein_startup.sh)

#11 Updated by Marco Mambelli 9 months ago

  • File deleted (patch_21682_20190201.txt)

#12 Updated by Marco Mambelli 9 months ago

  • File patch_21682_20190201b.txt added
  • File glidein_startup.sh added
  • File condor_startup.sh added

Improved startup files patch

#13 Updated by Marco Mambelli 9 months ago

Sent email to Brian Linn about improving the OSG PBS script, excerpt:

At the end PBS is sending a kill to the whole group, at least so it seems, because the glidein scripts are receiving it 
The problem seems that sigkill is sent right after the sigterm and not allowing some time to finish external commands and propagate the signal to condor.
Which is not receiving the signal even if it is in the same process group...
Still investigating: https://cdcvs.fnal.gov/redmine/issues/21682

Anyway the script can definitely be improved by adding an explicit forward of the signal.
The script we saw is attached.
I thing it may be generated by the CE/BLAHP

It includes a line:
trap 'wait $job_pid; cd $old_home; rm -rf $new_home; exit 255' 1 2 3 15 24

This could be split to forward the signal (except the 0 - condor-wise we are especially interested in 3 and 15):
trap 'signal -s 1 $job_pid; wait $job_pid; cd $old_home; rm -rf $new_home; exit 255' 1 
...

You could use also a wrapper function like

function trap_with_arg {
    func="$1" ; shift
    for sig ; do
        trap "$func $sig" "$sig" 
    done
}

function on_signal {
    signal -s $1 $job_pid;
    wait $job_pid; 
    cd $old_home; rm -rf $new_home; 
    exit 255
}

if you prefer to avoid multiple trap lines

#14 Updated by Marco Mambelli 9 months ago

Some more answers from Jaime about killing HTCondor. In brief:
- sending the signal is OK, probably even better than invoking condor_master k
TERM, QUIT and HUP are the only 3 signals (QUIT is actually not trapped and killing the process)
- a signal is idempotent (can be sent multiple times)
- HTCondor may wait if running some blocking call but will not loose the signal
- the pidfile should be there only if condor is running

Here the detailed answers:

Hi Jaime,
I have a couple of questions, hopefully quick:

1. currently when the glidien receives a signal (sigterm or sigint) if is killing condor with:
$CONDOR_DIR/sbin/condor_master -k $PWD/condor_master2.pid

As per our discussion in a meeting a signal could be a better way to do that.
Do you still think so?
Which signal should I use?

The condor_master -k <file> sends a SIGTERM to the pid named in the file. This results in a graceful shutdown, where daemons get a chance to do orderly cleanup. To do a fast shutdown, you would send a SIGQUIT to the condor_master process, something like this:
/bin/kill -s SIGQUIT `cat condor_master2.pid`

In either case, when the master receives the signal, it will immediately write a message to the log, then signal all of its children. When each child exits, the master will send a SIGKILL to any remaining descendants. Once all of the children exit, the master then exits.

Thanks Jaime, a few more questions about the killing (I did not find it in the docs).

0. So the signals are an accepted and OK way to terminate HTCondor. You confirm?

Yes, sending a signal to the condor_master is an acceptable way to terminate HTCondor.

1. You mentioned  signals 15) SIGTERM (graceful) and  3) SIGQUIT (fast). 
I know 1)  SIGHUP causes a reconfig.
Any other signal w/ some meaning, e.g. 2) SIGINT  ?

That’s basically it.
Odd, we don’t register a handler for SIGINT, and daemons will just exit immediately. That seems wrong.

Then in GdlieinWMS we have a sleep 5 that probably was added to allow condor to create the pid file. And I read the pidfile to know what to kill and wait. Now the sleep can be interrupted so things may change

2. I could get the pid with: 
$CONDOR_DIR/sbin/condor_master -f -pidfile $PWD/condor_master2.pid &
condor_pid=$!
Would that be equivalent or is better to get the PID form the file?

It’s roughly equivalent. The pid file is useful when a different process wants to kill the master, as it may not know the pid otherwise. It is possible, though highly unlikely, for the master to start and not write the pid file.

3. Is the signal idempotent (i.e. If I send multiple signal, e.g. same sigterm multiple times, is that the same as sending once)?

The master will react the same way whether it gets one signal or multiple signals of the same type.

4. If the pid file is gone (after being there), does it mean that condor_master is gone? So to make the kill more robust I could repeat later something like:
[ -f "$PWD/condor_master2.pid" ] && kill -s TERM `cat "$PWD/condor_master2.pid”`

If the master shuts down cleanly, it will remove the pid file. It’s possible for the file to be removed in some other fashion. And the file won’t be removed if the master crashes. So a better option is to try sending signal 0 to the master process until that fails.

5. In a previous GWMS ticket we had condor_master waiting and retrying a name resolution (which kept failing due to a host misconfiguration) and waiting 120sec before timing out and creating the pid file and continuing.
If we use the PID returned by the background (like in 1.) and kill the master in the mean time, this will still work and kill condor, correct?

Yes, that will work. The master may not react to the signal until after the blocking name resolution fails, though.

#15 Updated by Marco Mambelli 9 months ago

  • Assignee changed from Marco Mambelli to Dennis Box
  • Status changed from New to Feedback

Changes in v35/21682

#16 Updated by Marco Mambelli 9 months ago

  • File deleted (condor_startup.sh)

#17 Updated by Marco Mambelli 9 months ago

  • File deleted (patch_21682_20190201b.txt)

#18 Updated by Marco Mambelli 9 months ago

  • File deleted (glidein_startup.sh)

#20 Updated by Marco Mambelli 8 months ago

  • Assignee changed from Dennis Box to Marco Mambelli
  • Status changed from Feedback to Resolved

#21 Updated by Marco Mambelli 8 months ago

On 2/19 Diego, Erik and I had a troubleshooting session testing the changes and confirmed that these do solve the problem of the hanging HTCondor.
The signal is sent correctly to condor that receives it and shuts down.

PBS still sends sigterm and sigkill only few milliseconds later.
This is enough for the trap to forward the first signal but not for the process termination (sending back logs, ...) and cleanup.
A ticket to work on a clean shutdown has been opened: #21929

#22 Updated by Marco Mambelli 8 months ago

  • Target version changed from v3_5 to v3_4_4

#23 Updated by Marco Mambelli 8 months ago

  • Stakeholders updated (diff)

#24 Updated by Marco Mambelli 7 months ago

  • Status changed from Resolved to Closed

#25 Updated by Marco Mambelli 16 days ago

  • Has duplicate Bug #20202: Kill not handled properly by glidein_startup.sh added


Also available in: Atom PDF