The kill feature in projectgui.py appears to remove jobs one at a time for DAGs
If I read the code (projectapp.py) right, the kill feature simply loops over the associated cluster IDs from a given submission and run a jobsub_rm command. That's fine for most cases, but does this work correctly for DAGs? In that case one wants to run a single jobsub_rm command on the parent job id. Getting all associated clusters won't work because each job has its own cluster ID. If you just get the cluster IDs you end up issuing N jobsub_rm commands each acting on one job (where N could be in the thousands). Today we saw something like this on jobsub03, where there were repeated jobsub_rm commands coming from a user who had used the kill button on projectgui.py. That drove the schedd load very high and new job submissions to jobsub03 were blocked for a full hour. Does the function that gets the cluster IDs there (it looks like it is jobs = BatchStatus.get_jobs() in projectapp.py) handle DAG jobs in the proper way for this case (returns only the parent ID)?
#2 Updated by Kenneth Herner 3 months ago
If you want to know a given job's parent dagman job ID, you can look at the JobsubParentJobId classad variable for that job. It will be the same for all jobs in a given DAG. If you had those, then you can just run a single kill command on those, and then HTCondor will do its own magic to remove the dependent jobs.