Project

General

Profile

Task #17348

Milestone #16417: Prepare FIFE tools to handle HEP Cloud Submission

Task #16423: Prepare GPGrid for HEP Cloud

Create a comprehensive test plan for jobsub testing with refactored GPGrid

Added by Tanya Levshina almost 2 years ago. Updated 11 months ago.

Status:
Closed
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
08/03/2017
Due date:
11/01/2017
% Done:

100%

Estimated time:
Duration: 91

History

#1 Updated by Joe Boyd almost 2 years ago

  • Assignee changed from Joe Boyd to Bruno Coimbra

Tests that we need to run

-Normal jobsub submission
-dag job tests
-Managed proxies
-Non analysis roles
-Test glideintodie stuff, test nova's thing of specifying zero time requested
-test that the jobs show up in gracc (need to involve Kevin in this probably)
-test direct submission to gpgrid, make sure these jobs also show up in gracc
-Test that you can submit to offsite resources too (maybe not if Nick says there isn't a frontend or factory listening at the moment)

Anything else Ken and Shreyas can think up.

#2 Updated by Joe Boyd almost 2 years ago

Should also test that the haproxy is working. Submissions (and even jobsub_q) should rotate to different servers. If haproxy is set to always send the same client ip to the same backend host then at least different client machines should go to different machines.

#3 Updated by Joe Boyd almost 2 years ago

Hi,

I believe that we have (finally) overcome all the showstopper problems that we have encountered. We would like to request help from FIFE to test the dev cluster for expected behaviors and uncover any remaining issues. The CEs are htccedev01 and htcedev02. The Jobsub servers are behind an HAProxy. This replaces the DNS round robin alias. The HAProxy alias is jobsub-dev.fnal.gov.

Please feel free to report issues via Slack and/or email to .

Thanks,

--
Anthony Tiradani
+1 630 840 4479

#4 Updated by Bruno Coimbra almost 2 years ago

There seem to be a misbehavior of the alias jobsub-dev.fnal.gov at least with jobsub_q.

When you run a jobsub_q --jobsub-server=jobsub-dev.fnal.gov it shows either htcjsdev01.fnal.gov or htcjsdev02.fnal.gov but never both at the same time.

Please see the example below:

-bash-4.1$ jobsub_q --group nova --user coimbra --jobsub-server=jobsub-dev.fnal.gov
JOBSUBJOBID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh

14 jobs; 0 completed, 0 removed, 0 idle, 0 running, 14 held, 0 suspended
-bash-4.1$ jobsub_q --group nova --user coimbra --jobsub-server=htcjsdev01.fnal.gov
JOBSUBJOBID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh

15 jobs; 0 completed, 0 removed, 0 idle, 0 running, 15 held, 0 suspended
-bash-4.1$ jobsub_q --group nova --user coimbra --jobsub-server=htcjsdev02.fnal.gov
JOBSUBJOBID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh

14 jobs; 0 completed, 0 removed, 0 idle, 0 running, 14 held, 0 suspended

#5 Updated by Shreyas Bhat over 1 year ago

Testing DAG jobs:

Production control test jobs:
(no jobsub_submit args specified in DAG)
(jobsub_submit args specified in DAG)

#6 Updated by Shreyas Bhat over 1 year ago

DAG tests

Production works:

 -bash-4.1$ jobsub_submit_dag -G nova --role=Analysis --resource-provides=usage_model=DEDICATED,OFFSITE  --expected-lifetime=1h --disk=50MB  file:///nashome/s/sbhat/TestDAG/testdag.dag 
/fife/local/scratch/uploads/nova/sbhat/2017-08-24_110029.882998_1154

executing condor_submit_dag -dont_suppress_notification   -append "+Owner=\"sbhat\"" -append "+AccountingGroup=\"group_nova.sbhat\""  -append "+JobsubJobID=\"\$(Cluster).\$(Process)@fifebatch1.fnal.gov\"" /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110029.882998_1154/submit.20170824_110030.dag 

Submitting job(s).

1 job(s) submitted to cluster 19707984.

-----------------------------------------------------------------------

File for submitting this DAG to Condor           : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110029.882998_1154/submit.20170824_110030.dag.condor.sub

Log of DAGMan debugging messages                 : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110029.882998_1154/submit.20170824_110030.dag.dagman.out

Log of Condor library output                     : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110029.882998_1154/submit.20170824_110030.dag.lib.out

Log of Condor library error messages             : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110029.882998_1154/submit.20170824_110030.dag.lib.err

Log of the life of condor_dagman itself          : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110029.882998_1154/submit.20170824_110030.dag.dagman.log

-----------------------------------------------------------------------

JobsubJobId of first job: 19707984.0@fifebatch1.fnal.gov

Use job id 19707984.0@fifebatch1.fnal.gov to retrieve output

The previous control jobs in the last comment were also to production, and they ran as expceted.

But jobsub-dev doesn't work

-bash-4.1$ jobsub_submit_dag -G nova --role=Analysis --resource-provides=usage_model=DEDICATED,OFFSITE  --expected-lifetime=1h --disk=50MB --jobsub-server=jobsub-dev.fnal.gov file:///nashome/s/sbhat/TestDAG/testdag.dag 
ERROR:
Error running as user sbhat using command /opt/jobsub/server/webapp/jobsub_priv runCommand /opt/jobsub/server/webapp/jobsub_dag_runner.sh --resource-provides=usage_model=DEDICATED,OFFSITE --expected-lifetime=1h --disk=50MB '' /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110039.483410_1295/testdag.dag:
STDOUT:
STDERR:
Exception:Command 'sudo -u sbhat -E /opt/jobsub/server/webapp/jobsub_priv runCommand /opt/jobsub/server/webapp/jobsub_dag_runner.sh --resource-provides=usage_model=DEDICATED,OFFSITE --expected-lifetime=1h --disk=50MB '' /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110039.483410_1295/testdag.dag' returned non-zero exit status 1: 
EXITCODE:1
STDOUT:
STDERR:Error running command: /opt/jobsub/server/webapp/jobsub_dag_runner.sh --resource-provides=usage_model=DEDICATED,OFFSITE --expected-lifetime=1h --disk=50MB '' /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110039.483410_1295/testdag.dag
ERROR: 'runCommand' failed with exception: Command '/opt/jobsub/server/webapp/jobsub_dag_runner.sh --resource-provides=usage_model=DEDICATED,OFFSITE --expected-lifetime=1h --disk=50MB '' /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110039.483410_1295/testdag.dag' returned non-zero exit status 1: 
EXITCODE:1
STDOUT:error processing command jobsub_submit  ${JOBSUB_EXPORTS} ./probeA.sh
sh: jobsub_submit: command not found

STDERR:

JOBSUB SERVER CONTACTED     : https://htcjsdev01.fnal.gov:8443
JOBSUB SERVER RESPONDED     : https://htcjsdev01.fnal.gov:8443
JOBSUB SERVER RESPONSE CODE : 500 (Failed)
JOBSUB SERVER SERVICED IN   : 1.00370407104 sec
JOBSUB CLIENT FQDN          : novagpvm01.fnal.gov
JOBSUB CLIENT SERVICED TIME : 24/Aug/2017 11:00:40

fifebatch-dev does work

-bash-4.1$ jobsub_submit_dag -G nova --role=Analysis --resource-provides=usage_model=DEDICATED,OFFSITE  --expected-lifetime=1h --disk=50MB --jobsub-server=fifebatch-dev.fnal.gov file:///nashome/s/sbhat/TestDAG/testdag.dag 
/fife/local/scratch/uploads/nova/sbhat/2017-08-24_110125.877419_5593

executing condor_submit_dag -dont_suppress_notification   -append "+Owner=\"sbhat\"" -append "+AccountingGroup=\"group_nova.sbhat\""  -append "+JobsubJobID=\"\$(Cluster).\$(Process)@fife-jobsub-dev01.fnal.gov\"" /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110125.877419_5593/submit.20170824_110126.dag 

Submitting job(s).

1 job(s) submitted to cluster 13756.

-----------------------------------------------------------------------

File for submitting this DAG to Condor           : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110125.877419_5593/submit.20170824_110126.dag.condor.sub

Log of DAGMan debugging messages                 : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110125.877419_5593/submit.20170824_110126.dag.dagman.out

Log of Condor library output                     : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110125.877419_5593/submit.20170824_110126.dag.lib.out

Log of Condor library error messages             : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110125.877419_5593/submit.20170824_110126.dag.lib.err

Log of the life of condor_dagman itself          : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110125.877419_5593/submit.20170824_110126.dag.dagman.log

-----------------------------------------------------------------------

JobsubJobId of first job: 13756.0@fife-jobsub-dev01.fnal.gov

Use job id 13756.0@fife-jobsub-dev01.fnal.gov to retrieve output

#7 Updated by Tanya Levshina over 1 year ago

  • Status changed from Assigned to Resolved
  • % Done changed from 0 to 100

Tests have been performed and we have transitioned to production. We have still missed several HTCondor bugs (CPU utilization , memory reports etc). We have to include the new tests in test suit.

#8 Updated by Tanya Levshina over 1 year ago

  • Status changed from Resolved to Closed

#9 Updated by Tanya Levshina 11 months ago

  • Due date set to 11/01/2017


Also available in: Atom PDF