Task #17348
Milestone #16417: Prepare FIFE tools to handle HEP Cloud Submission
Task #16423: Prepare GPGrid for HEP Cloud
Create a comprehensive test plan for jobsub testing with refactored GPGrid
100%
History
#1 Updated by Joe Boyd over 3 years ago
- Assignee changed from Joe Boyd to Bruno Coimbra
Tests that we need to run
-Normal jobsub submission
-dag job tests
-Managed proxies
-Non analysis roles
-Test glideintodie stuff, test nova's thing of specifying zero time requested
-test that the jobs show up in gracc (need to involve Kevin in this probably)
-test direct submission to gpgrid, make sure these jobs also show up in gracc
-Test that you can submit to offsite resources too (maybe not if Nick says there isn't a frontend or factory listening at the moment)
Anything else Ken and Shreyas can think up.
#2 Updated by Joe Boyd over 3 years ago
Should also test that the haproxy is working. Submissions (and even jobsub_q) should rotate to different servers. If haproxy is set to always send the same client ip to the same backend host then at least different client machines should go to different machines.
#3 Updated by Joe Boyd over 3 years ago
Hi,
I believe that we have (finally) overcome all the showstopper problems that we have encountered. We would like to request help from FIFE to test the dev cluster for expected behaviors and uncover any remaining issues. The CEs are htccedev01 and htcedev02. The Jobsub servers are behind an HAProxy. This replaces the DNS round robin alias. The HAProxy alias is jobsub-dev.fnal.gov.
Please feel free to report issues via Slack and/or email to gco@fnal.gov.
Thanks,
--
Anthony Tiradani
+1 630 840 4479
tiradani@fnal.gov
#4 Updated by Bruno Coimbra over 3 years ago
There seem to be a misbehavior of the alias jobsub-dev.fnal.gov at least with jobsub_q.
When you run a jobsub_q --jobsub-server=jobsub-dev.fnal.gov it shows either htcjsdev01.fnal.gov or htcjsdev02.fnal.gov but never both at the same time.
Please see the example below:
-bash-4.1$ jobsub_q --group nova --user coimbra --jobsub-server=jobsub-dev.fnal.gov
JOBSUBJOBID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1415.0@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.1@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.2@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.3@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.4@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.5@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.6@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.7@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.8@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.9@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.10@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.12@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.13@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.14@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
14 jobs; 0 completed, 0 removed, 0 idle, 0 running, 14 held, 0 suspended
-bash-4.1$ jobsub_q --group nova --user coimbra --jobsub-server=htcjsdev01.fnal.gov
JOBSUBJOBID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1029.0@htcjsdev01.fnal.gov coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
1029.1@htcjsdev01.fnal.gov coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
1029.2@htcjsdev01.fnal.gov coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
1029.3@htcjsdev01.fnal.gov coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
1029.4@htcjsdev01.fnal.gov coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
1029.5@htcjsdev01.fnal.gov coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
1029.6@htcjsdev01.fnal.gov coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
1029.7@htcjsdev01.fnal.gov coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
1029.8@htcjsdev01.fnal.gov coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
1029.9@htcjsdev01.fnal.gov coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
1029.10@htcjsdev01.fnal.gov coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
1029.11@htcjsdev01.fnal.gov coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
1029.12@htcjsdev01.fnal.gov coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
1029.13@htcjsdev01.fnal.gov coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
1029.14@htcjsdev01.fnal.gov coimbra 08/18 10:42 0+00:00:00 H 0 0.0 probe_20170818_104243_1433902_0_1_wrap.sh
15 jobs; 0 completed, 0 removed, 0 idle, 0 running, 15 held, 0 suspended
-bash-4.1$ jobsub_q --group nova --user coimbra --jobsub-server=htcjsdev02.fnal.gov
JOBSUBJOBID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1415.0@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.1@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.2@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.3@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.4@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.5@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.6@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.7@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.8@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.9@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.10@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.12@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.13@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
1415.14@htcjsdev02.fnal.gov coimbra 08/18 10:37 0+00:00:00 H 0 0.0 probe_20170818_103718_1861623_0_1_wrap.sh
14 jobs; 0 completed, 0 removed, 0 idle, 0 running, 14 held, 0 suspended
#5 Updated by Shreyas Bhat over 3 years ago
Testing DAG jobs:
Production control test jobs:
19707284.0@fifebatch1.fnal.gov (no jobsub_submit args specified in DAG)
19707475.0@fifebatch1.fnal.gov (jobsub_submit args specified in DAG)
#6 Updated by Shreyas Bhat over 3 years ago
DAG tests¶
Production works:¶
-bash-4.1$ jobsub_submit_dag -G nova --role=Analysis --resource-provides=usage_model=DEDICATED,OFFSITE --expected-lifetime=1h --disk=50MB file:///nashome/s/sbhat/TestDAG/testdag.dag /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110029.882998_1154 executing condor_submit_dag -dont_suppress_notification -append "+Owner=\"sbhat\"" -append "+AccountingGroup=\"group_nova.sbhat\"" -append "+JobsubJobID=\"\$(Cluster).\$(Process)@fifebatch1.fnal.gov\"" /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110029.882998_1154/submit.20170824_110030.dag Submitting job(s). 1 job(s) submitted to cluster 19707984. ----------------------------------------------------------------------- File for submitting this DAG to Condor : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110029.882998_1154/submit.20170824_110030.dag.condor.sub Log of DAGMan debugging messages : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110029.882998_1154/submit.20170824_110030.dag.dagman.out Log of Condor library output : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110029.882998_1154/submit.20170824_110030.dag.lib.out Log of Condor library error messages : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110029.882998_1154/submit.20170824_110030.dag.lib.err Log of the life of condor_dagman itself : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110029.882998_1154/submit.20170824_110030.dag.dagman.log ----------------------------------------------------------------------- JobsubJobId of first job: 19707984.0@fifebatch1.fnal.gov Use job id 19707984.0@fifebatch1.fnal.gov to retrieve output
The previous control jobs in the last comment were also to production, and they ran as expceted.
But jobsub-dev doesn't work¶
-bash-4.1$ jobsub_submit_dag -G nova --role=Analysis --resource-provides=usage_model=DEDICATED,OFFSITE --expected-lifetime=1h --disk=50MB --jobsub-server=jobsub-dev.fnal.gov file:///nashome/s/sbhat/TestDAG/testdag.dag ERROR: Error running as user sbhat using command /opt/jobsub/server/webapp/jobsub_priv runCommand /opt/jobsub/server/webapp/jobsub_dag_runner.sh --resource-provides=usage_model=DEDICATED,OFFSITE --expected-lifetime=1h --disk=50MB '' /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110039.483410_1295/testdag.dag: STDOUT: STDERR: Exception:Command 'sudo -u sbhat -E /opt/jobsub/server/webapp/jobsub_priv runCommand /opt/jobsub/server/webapp/jobsub_dag_runner.sh --resource-provides=usage_model=DEDICATED,OFFSITE --expected-lifetime=1h --disk=50MB '' /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110039.483410_1295/testdag.dag' returned non-zero exit status 1: EXITCODE:1 STDOUT: STDERR:Error running command: /opt/jobsub/server/webapp/jobsub_dag_runner.sh --resource-provides=usage_model=DEDICATED,OFFSITE --expected-lifetime=1h --disk=50MB '' /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110039.483410_1295/testdag.dag ERROR: 'runCommand' failed with exception: Command '/opt/jobsub/server/webapp/jobsub_dag_runner.sh --resource-provides=usage_model=DEDICATED,OFFSITE --expected-lifetime=1h --disk=50MB '' /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110039.483410_1295/testdag.dag' returned non-zero exit status 1: EXITCODE:1 STDOUT:error processing command jobsub_submit ${JOBSUB_EXPORTS} ./probeA.sh sh: jobsub_submit: command not found STDERR: JOBSUB SERVER CONTACTED : https://htcjsdev01.fnal.gov:8443 JOBSUB SERVER RESPONDED : https://htcjsdev01.fnal.gov:8443 JOBSUB SERVER RESPONSE CODE : 500 (Failed) JOBSUB SERVER SERVICED IN : 1.00370407104 sec JOBSUB CLIENT FQDN : novagpvm01.fnal.gov JOBSUB CLIENT SERVICED TIME : 24/Aug/2017 11:00:40
fifebatch-dev does work¶
-bash-4.1$ jobsub_submit_dag -G nova --role=Analysis --resource-provides=usage_model=DEDICATED,OFFSITE --expected-lifetime=1h --disk=50MB --jobsub-server=fifebatch-dev.fnal.gov file:///nashome/s/sbhat/TestDAG/testdag.dag /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110125.877419_5593 executing condor_submit_dag -dont_suppress_notification -append "+Owner=\"sbhat\"" -append "+AccountingGroup=\"group_nova.sbhat\"" -append "+JobsubJobID=\"\$(Cluster).\$(Process)@fife-jobsub-dev01.fnal.gov\"" /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110125.877419_5593/submit.20170824_110126.dag Submitting job(s). 1 job(s) submitted to cluster 13756. ----------------------------------------------------------------------- File for submitting this DAG to Condor : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110125.877419_5593/submit.20170824_110126.dag.condor.sub Log of DAGMan debugging messages : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110125.877419_5593/submit.20170824_110126.dag.dagman.out Log of Condor library output : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110125.877419_5593/submit.20170824_110126.dag.lib.out Log of Condor library error messages : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110125.877419_5593/submit.20170824_110126.dag.lib.err Log of the life of condor_dagman itself : /fife/local/scratch/uploads/nova/sbhat/2017-08-24_110125.877419_5593/submit.20170824_110126.dag.dagman.log ----------------------------------------------------------------------- JobsubJobId of first job: 13756.0@fife-jobsub-dev01.fnal.gov Use job id 13756.0@fife-jobsub-dev01.fnal.gov to retrieve output
#7 Updated by Tanya Levshina over 3 years ago
- Status changed from Assigned to Resolved
- % Done changed from 0 to 100
Tests have been performed and we have transitioned to production. We have still missed several HTCondor bugs (CPU utilization , memory reports etc). We have to include the new tests in test suit.
#8 Updated by Tanya Levshina over 3 years ago
- Status changed from Resolved to Closed
#9 Updated by Tanya Levshina over 2 years ago
- Due date set to 11/01/2017