Project

General

Profile

Debugging SAM job sumbissions

So here we're going through the submission process when something is wrong, and
seeing how to debug it.

First -- get things setup:

novagpvm03$ typeset -f setup_nova
setup_nova ()
{
source /grid/fermiapp/nova/novaart/novasvn/srt/srt.sh;
export EXTERNALS=/nusoft/app/externals;
source $SRT_DIST/setup/setup_novasoft.sh "$@";
PRODUCTS=/grid/fermiapp/products/nova/db:$PRODUCTS;
setup ifdh_art v1_0_rc1 -q nu:e2:debug -k;
setup jobsub_tools
}
novagpvm03$ setup_nova

Then lets say we submit a job, but we misspell the dataset name we
defined:

novagpvm03$ cat launch_borked
jobsub -g \
-r S12-12-12 \
-N 3 \
--dataset_definition=misspelled_dataset_name \
$IFDH_ART_DIR/bin/art_sam_wrap.sh \
-X nova \
--dest /nova/data/users/mengel \
--rename uniq \
--limit 1 \
-c /nova/app/users/anorman/NOVA-OFFLINE/cosmictrackjob.fcl
novagpvm03$ sh launch_borked
/nova/data/condor-tmp/mengel/art_sam_wrap.sh_20130117_112319_15623_1.dag
submitting....
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 7363441.
 
-----------------------------------------------------------------------
File for submitting this DAG to Condor : /nova/data/condor-tmp/mengel/art_sam_wrap.sh_20130117_112319_15623_1.dag.condor.sub
Log of DAGMan debugging messages : /nova/data/condor-tmp/mengel/art_sam_wrap.sh_20130117_112319_15623_1.dag.dagman.out
Log of Condor library output : /nova/data/condor-tmp/mengel/art_sam_wrap.sh_20130117_112319_15623_1.dag.lib.out
Log of Condor library error messages : /nova/data/condor-tmp/mengel/art_sam_wrap.sh_20130117_112319_15623_1.dag.lib.err
Log of the life of condor_dagman itself : /nova/data/condor-tmp/mengel/art_sam_wrap.sh_20130117_112319_15623_1.dag.dagman.log
-----------------------------------------------------------------------

Then we watch it "fizzle" with condor_q:

novagpvm03$ condor_q mengel
 
-- Submitter: gpsn01.fnal.gov : <131.225.67.70:57013> : gpsn01.fnal.gov
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
7363441.0 mengel 1/17 11:23 0+00:01:52 R 0 7.3 condor_dagman
 
1 jobs; 0 idle, 1 running, 0 held
 
novagpvm03$ condor_q mengel
 
-- Submitter: gpsn01.fnal.gov : <131.225.67.70:57013> : gpsn01.fnal.gov
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
7363441.0 mengel 1/17 11:23 0+00:01:52 R 0 7.3 condor_dagman
7363442.0 mengel 1/17 11:23 0+00:00:00 I 0 0.0 art_sam_wrap.sh_20
 
2 jobs; 1 idle, 1 running, 0 held
 
novagpvm03$ condor_q mengel
 
-- Submitter: gpsn01.fnal.gov : <131.225.67.70:57013> : gpsn01.fnal.gov
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
 
0 jobs; 0 idle, 0 running, 0 held

So what went wrong? The secret is to look at the sambegin script output:

novagpvm03$ cd $CONDOR_TMP
novagpvm03$ ls -lt | head
total 2080
-rw-r--r-- 1 mengel nova 618 Jan 17 11:25 art_sam_wrap.sh_20130117_112319_15623_1.dag.dagman.log
-rw-r--r-- 1 mengel nova 29 Jan 17 11:25 art_sam_wrap.sh_20130117_112319_15623_1.dag.lib.out
-rw-r--r-- 1 mengel gpcf 6685 Jan 17 11:25 art_sam_wrap.sh_20130117_112319_15623_1.dag.dagman.out
-rw-r--r-- 1 mengel gpcf 951 Jan 17 11:25 art_sam_wrap.sh_20130117_112319_15623_1.dag.rescue001
-rw-r--r-- 1 mengel gpcf 626 Jan 17 11:25 art_sam_wrap.sh_20130117_112319_15623_1.dag.dot
-rw-r--r-- 1 mengel gpcf 642 Jan 17 11:25 sambegin-art_sam_wrap.sh_20130117_112319_15623.log
-rw-r--r-- 1 mengel gpcf 12826 Jan 17 11:25 sambegin-art_sam_wrap.sh_20130117_112319_15623.err
-rw-r--r-- 1 mengel gpcf 0 Jan 17 11:25 sambegin-art_sam_wrap.sh_20130117_112319_15623.out
-rw-r--r-- 1 mengel nova 0 Jan 17 11:23 art_sam_wrap.sh_20130117_112319_15623_1.dag.lib.err
novagpvm03$ tail sambegin-art_sam_wrap.sh_20130117_112319_15623.err
Error text is:
Definition 'misspelled_dataset_name' not found
 
+ ifdh startProject mengel-art_sam_wrap.sh_20130117_112319_15623 nova misspelled_dataset_name mengel nova
Exception:http://samweb.fnal.gov:8480/sam/nova/api/startProjectStatus: 404
Error text is:
Definition 'misspelled_dataset_name' not found

...and we see that we misspelled our dataset name.