Project

General

Profile

Bug #23459

The kill feature in projectgui.py appears to remove jobs one at a time for DAGs

Added by Kenneth Herner 27 days ago. Updated 26 days ago.

Status:
Feedback
Priority:
High
Start date:
10/21/2019
Due date:
% Done:

0%

Estimated time:
Duration:

Description

Hi,

If I read the code (projectapp.py) right, the kill feature simply loops over the associated cluster IDs from a given submission and run a jobsub_rm command. That's fine for most cases, but does this work correctly for DAGs? In that case one wants to run a single jobsub_rm command on the parent job id. Getting all associated clusters won't work because each job has its own cluster ID. If you just get the cluster IDs you end up issuing N jobsub_rm commands each acting on one job (where N could be in the thousands). Today we saw something like this on jobsub03, where there were repeated jobsub_rm commands coming from a user who had used the kill button on projectgui.py. That drove the schedd load very high and new job submissions to jobsub03 were blocked for a full hour. Does the function that gets the cluster IDs there (it looks like it is jobs = BatchStatus.get_jobs() in projectapp.py) handle DAG jobs in the proper way for this case (returns only the parent ID)?

History

#1 Updated by Lynn Garren 27 days ago

  • Assignee set to Herbert Greenlee
  • Status changed from New to Feedback

Herb, is this your code?

#2 Updated by Kenneth Herner 26 days ago

Hi Herb,

If you want to know a given job's parent dagman job ID, you can look at the JobsubParentJobId classad variable for that job. It will be the same for all jobs in a given DAG. If you had those, then you can just run a single kill command on those, and then HTCondor will do its own magic to remove the dependent jobs.



Also available in: Atom PDF