Project

General

Profile

Bug #17435

Holding one job in cluster causes entire cluster to disappear from jobsub_q

Added by Kevin Retzke almost 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
JobSub Client
Target version:
Start date:
08/11/2017
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

Testing jobs on the newest GPGgrid, after holding one job in a cluster of five, the entire cluster disappeared from jobsub_q [0]. Condor still shows them [1]. Releasing the job caused it to re-appear [2]. Doesn't seem to matter which job it is [3]. If all jobs in the cluster are held they show up [4].

Client: 1.2.3.2
Server: jobsub-1.2.4-0.1.rc4.noarch

[0]

kretzke@novagpvm12 ~> jobsub_q -G fermilab --jobsub-server=jobsub-dev.fnal.gov
JOBSUBJOBID                           OWNER           SUBMITTED     RUN_TIME   ST PRI SIZE CMD
416.0@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 I   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 
416.1@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 I   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 
416.2@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 I   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 
416.3@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 I   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 
416.4@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 I   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 

5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended
kretzke@novagpvm12 ~> jobsub_hold -G fermilab --jobsub-server=jobsub-dev.fnal.gov --jobid=416.0@htcjsdev01.fnal.gov 
Holding job with jobid=416.0@htcjsdev01.fnal.gov
1 Succeeded, 0 Failed, 0 Not Found, 0 Bad Status, 0 Already Done, 0 Permission Denied

kretzke@novagpvm12 ~> jobsub_q -G fermilab --jobsub-server=jobsub-dev.fnal.gov
JOBSUBJOBID                           OWNER           SUBMITTED     RUN_TIME   ST PRI SIZE CMD

[1] Condor still sees them:

[rexbatch@htcjsdev01 ~]$ condor_q kretzke

-- Schedd: htcjsdev01.fnal.gov : <131.225.154.88:9615?... @ 08/11/17 10:58:18
OWNER   BATCH_NAME                                        SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
kretzke CMD: probe_20170811_105254_1071555_0_1_wrap.sh   8/11 10:52      _      _      4      1      5 416.0-4

5 jobs; 0 completed, 0 removed, 4 idle, 0 running, 1 held, 0 suspended

[2] Releasing the held job causes them to re-appear:

kretzke@novagpvm12 ~> jobsub_release -G fermilab --jobsub-server=jobsub-dev.fnal.gov --jobid=416.0@htcjsdev01.fnal.gov 
Releasing job with jobid=416.0@htcjsdev01.fnal.gov
1 Succeeded, 0 Failed, 0 Not Found, 0 Bad Status, 0 Already Done, 0 Permission Denied

kretzke@novagpvm12 ~> jobsub_q -G fermilab --jobsub-server=jobsub-dev.fnal.gov
JOBSUBJOBID                           OWNER           SUBMITTED     RUN_TIME   ST PRI SIZE CMD
416.0@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 I   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 
416.1@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 I   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 
416.2@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 I   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 
416.3@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 I   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 
416.4@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 I   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 

5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended

[3] Doesn't seem to matter which job it is:

kretzke@novagpvm12 ~> jobsub_hold -G fermilab --jobsub-server=jobsub-dev.fnal.gov --jobid=416.1@htcjsdev01.fnal.gov 
Holding job with jobid=416.1@htcjsdev01.fnal.gov
1 Succeeded, 0 Failed, 0 Not Found, 0 Bad Status, 0 Already Done, 0 Permission Denied

kretzke@novagpvm12 ~> jobsub_q -G fermilab --jobsub-server=jobsub-dev.fnal.gov
JOBSUBJOBID                           OWNER           SUBMITTED     RUN_TIME   ST PRI SIZE CMD

kretzke@novagpvm12 ~> jobsub_release -G fermilab --jobsub-server=jobsub-dev.fnal.gov --jobid=416.1@htcjsdev01.fnal.gov 
Releasing job with jobid=416.1@htcjsdev01.fnal.gov
1 Succeeded, 0 Failed, 0 Not Found, 0 Bad Status, 0 Already Done, 0 Permission Denied

kretzke@novagpvm12 ~> jobsub_q -G fermilab --jobsub-server=jobsub-dev.fnal.gov
JOBSUBJOBID                           OWNER           SUBMITTED     RUN_TIME   ST PRI SIZE CMD
416.0@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 I   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 
416.1@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 I   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 
416.2@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 I   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 
416.3@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 I   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 
416.4@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 I   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 

5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended

[4] Holding all jobs and they show up:

kretzke@novagpvm12 ~> jobsub_hold -G fermilab --jobsub-server=jobsub-dev.fnal.gov --jobid=416.@htcjsdev01.fnal.gov 
Holding job with jobid=416.@htcjsdev01.fnal.gov
4 Succeeded, 0 Failed, 0 Not Found, 0 Bad Status, 0 Already Done, 0 Permission Denied

kretzke@novagpvm12 ~> jobsub_q -G fermilab --jobsub-server=jobsub-dev.fnal.gov
JOBSUBJOBID                           OWNER           SUBMITTED     RUN_TIME   ST PRI SIZE CMD
416.0@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 H   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 
416.1@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 H   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 
416.2@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 H   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 
416.3@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 H   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 
416.4@htcjsdev01.fnal.gov             kretzke         08/11 10:52   0+00:00:00 H   0   0.0 probe_20170811_105254_1071555_0_1_wrap.sh 

5 jobs; 0 completed, 0 removed, 0 idle, 0 running, 5 held, 0 suspended

History

#1 Updated by Dennis Box almost 2 years ago

  • Target version set to v1.2.4

I traced the problem to condor 8.7.1 - condor_q has two new options -allusers and -batch that need to be used to query remote schedds. Should have a fix out shortly.

#2 Updated by Dennis Box almost 2 years ago

  • Status changed from New to Assigned

fix checked in to git branch 17435 and tested on jobsub-dev, will cut release on monday

#3 Updated by Dennis Box almost 2 years ago

  • Status changed from Assigned to Resolved

This was due to changed behavior of condor_q starting at condor 8.7.

See release notes.

#4 Updated by Dennis Box almost 2 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF