Project

General

Profile

BatchSubmissions » History » Version 53

Version 52 (Olga Terlyga, 01/30/2013 02:59 PM) → Version 53/56 (Olga Terlyga, 01/30/2013 03:05 PM)

h1. Batch Submissions

h2. Contacts:

* FNAL: Howard Rubin <rubin@fnal.gov>, Olga Terlyga <terlyga@fnal.gov>

h2. Getting Started

* Job submission is done from minos25.fnal.gov under username minospro. Make sure you have access to that machine as yourself, as minfarm and minospro, current contact person is Arthur Kreymer <kreymer@fnal.gov>

* Set up grid access as described here
http://www-numi.fnal.gov/condor/
http://www-numi.fnal.gov/condor/proxy.html

* Setup cronjob to renew proxy needed for job submission in your personal crontab, for example
<pre>
07 1-23/2 * * * /usr/krb5/bin/kcron /local/scratch25/grid/kproxy
07 1-23/2 * * * /usr/krb5/bin/kcron /local/scratch25/grid/kproxy_pro
</pre>

* Obtain permissions to write output files to /pnfs/minos area, current contact person is Arthur Kreymer <kreymer@fnal.gov>
* Update list of submitters to include your username
<pre>/minos/data/minfarm/lists/.submitters</pre>

h3. [[bashrc|.bashrc example for setup]]

h2. Keep-up with current version

Current version is dogwood6 at the moment. Keep up is running daily. Keep-up processing is used for calibrations. More accurate physics processing is done in larger batches, after calibration sign off (see below).

h3. *Minos batch at a glance*

http://nusoft.fnal.gov/minos/MinosBatch_AtAGlance/MinosBatch_atAGlance.html

h3. *Cron jobs*

Daily keep-up cron jobs are currently running under minospro account. Main submission job is this

<pre>
04 07,15,23 * * * /grid/fermiapp/minos/minfarm/scripts/get_daq_submit.glide -v dogwood6 -F
</pre>

Other active cronjobs are
<pre>
MAILTO='terlyga@fnal.gov,rubin@fnal.gov'

# Use this crontab for jobs that require the vanilla condor_q

# Note that it is the responsibility of the process to do a sufficient setup

# For safety, keep an updated copy of this crontab
10 0 * * * /usr/bin/crontab -l > $HOME/cron-pro.minos25

# Clean out old logs, submits, and cores -- Run always
20 22 * * * /grid/fermiapp/minos/minfarm/scripts/rm_logs.glide -F

</pre>

Other cronjobs running on minos25 under user minospro
<pre>
MAILTO='terlyga@fnal.gov,rubin@fnal.gov'

# Use this crontab for jobs that don't require condor_q

# For safety, keep an updated copy of this crontab
10 0 * * * /usr/bin/crontab -l > $HOME/cron-pro.minos27

# Copy logs to AFS
55 10,22 * * * /grid/fermiapp/minos/minfarm/scripts/copy_logs

# Keep the good_runs, bad_runs, and farmsdb files up-to-date
00-55/5 * * * * /grid/fermiapp/minos/minfarm/scripts/gather_runs

# And the same for mc
02-57/5 * * * * /grid/fermiapp/minos/minfarm/scripts/gather_runs.mc

# Check that data is flowing from the detectors to pnfs
02 00-22/4 * * * /grid/fermiapp/minos/minfarm/scripts/check_delivery

# Refresh mclist from mcin_data when new stuff is coming in
# 04 06,14,22 * * * /grid/fermiapp/minos/minfarm/scripts/get_multi_mc dogwood5 near daikon_07
# 04 02,10,18 * * * /grid/fermiapp/minos/minfarm/scripts/get_multi_mc dogwood5 far daikon_07
# 04 00,08,16 * * * /grid/fermiapp/minos/minfarm/scripts/get_multi_mc dogwood5 near daikon_08

# Manage the keep-up lists
58 22 * * * /grid/fermiapp/minos/minfarm/scripts/keepup_lists B
10 23 * * Sun /grid/fermiapp/minos/minfarm/scripts/keepup_orphans
</pre>

h3. *Submit scripts options*

<pre>
PRO> /grid/fermiapp/minos/minfarm/scripts/get_daq_submit.glide -h
Usage: get_daq_submit.glide [-v VSN] [-V VSN2] [options]
Options -h print this message
-d print debug information in analyze
-g use root compiled with only -g
-O,o use root compiled with -g -O2
-Q n use mysql server minos-$n -- -Q db1 is default
-b bypass bfield check -- will produce ERR 100 in analyze
-n process near detector only
-f process far detector only
-a add ATMOS processing
-c do COSMIC processing ONLY
-s do SPILL processing ONLY
COSMIC and SPILL processing is the default
-v specify a version -- defaults to current_version
-y bypass field and beam checks and *do* pass b,B options to analyze
Use in shutdown when chambers and db updates run = -Bbc
-B beam down -- -F and don't signal missing lists
-F bypass beam check and run cosmic only -- will produce ERR 101
-G bypass beam check and run both passes -- will produce ERR 101
-L don't report on missing lists -- used when testing
-S do *not* submit jobs, only update bookkeeping
-T|X do *not* update tarfiles or delete daily list -- TEST MODE
-Z -S and don't write to datalist(s) -- supercedes -S
-V V add nearlist and farlist to alternate datalist.$V

PRO> /grid/fermiapp/minos/minfarm/scripts/cron_submit.glide -h
Usage: cron_submit.glide [-pn] [-asmoOACNM] [-t F|N] VSN Num_Jobs [List]
Options: -h - print this list
-d - print debug information in analyze
-D - allow duplicate submissions
-g - use root compiled with only -g
-O,o - use root compiled with -g -O2
-Q n - use mysql server minos- -- -Q mysql1 is default
-b - override bfield check
-B - override beam check
-p n - override pass check in submit_job and use pass n
-t f - count only F(ar) or N(ear); default is both
-m - allow multiple passes
-a - add ATMOS processing
-c - do COSMIC processing ONLY
-s - do SPILL processing ONLY
The following are generally useful if -c or -s. If none of these is
specified, all output streams are written, i.e. '' = -CNM
-A - write all output streams (default)
-C - write cand output (includes bcnd for FD)
-N - write ntuple output (includes bntp for FD)
-M - write mrnt output (for spill pass)

PRO> /grid/fermiapp/minos/minfarm/scripts/cron_submit.mc.glide -h
Usage: cron_submit.mc.glide -v date -V date [-pn] [-dgmoOACNM] [-t f|n] VSN NumJobs [InList]
Options: -h - print this list
-d - print debug information in ana_mc
-m - allow multiple passes
-p n - override pass check in submit_job and use pass n
-t f - count only f(ar), n(ear, F(mock), N(mock); default is all
-g - use root compiled with only -g
-O,o - use root compiled with -g -O2
-S - special handling of subrun > 99
-v s - string specifying start of time range: 'YYYY,MM,DD,hh,mm,ss'
-V s - string specifying end of time range: 'YYYY,MM,DD,hh,mm,ss'
The following control output streams. If none of these or -A
is specified, all output streams are written, i.e. '' = -CNM
-A - write all output streams (default)
-C - write cand output
-N - write ntuple output
-M - write mrnt output

</pre>

h3. *Location of output and log files*

There are no log files for submissions currently, all output from submission scripts is sent to mail, which is stored, for example, in
<pre>
/var/spool/mail/terlyga
</pre>
If you prefer to receive actual email you may add to the crontab that runs the job, for example
<pre>
MAILTO='rubin@fnal.gov'
</pre>
To see running jobs run
<pre>
/grid/fermiapp/minos/minfarm/scripts/lj
</pre>

Output files are written in cand_data directory by date. For example for dogwood6 processing of near detector data collected in February 2012:
<pre>
/pnfs/minos/reco_near/dogwood6/cand_data/2012-02/
</pre>

Log files from grid processing are available while (and after) the job is running on the grid, they are written to
<pre>
/minos/data/minfarm/logs
</pre>
Twice a day log files are archived to (see active cronjob above)
<pre>
/minos/data/users/minospro/FARMING
</pre>
Old location:
<pre>
/afs/fnal.gov/files/data/minos/farm_logs
</pre>
Note that all the files are gzipped. (Please don't unzip them!) To look at them use 'less' or copy to another location first.

Files that crashed will appear in the bad_runs file, for example
<pre>
/minos/data/minfarm/lists/bad_runs.dogwood6
</pre>
The error codes are as follows
<pre>
1: Input error, usually an srm problem -- rerun
2: No output streams
3: Unable to save an output stream -- dcache or farcat/nearcat -- rerun
7: Unable to locate loon script -- rerun after adding script to tar
8: Mysql server not available -- rerun
15: No asciidb files -- configuration error -- probably obsolete
90: Job runs extremely long without writing output -- killed by hand
91: Do not process -- not in measurement list -- manual entry in bad_runs
Should be caught as a suppressed run -- mostly used with atmos
processing
95: Temporary reassignment of 100 to allow flushing if not to be rerun
96: Temporary reassignment of 101 to allow flushing if not to be rerun
99: Job runs extremely long and writes massive output -- killed by hand
100: Gaps in bfield database -- usually rerun after db update
101: Gaps in beam spill database -- usually rerun after db update
132: Illegal Instruction
134: Invalid Data
136: FPE
137: Killed by system or user; rerun or manually change to 90 or 99
139: SEGV
</pre>

h3. Roundup (Concatenation)

Rashid Mehdiyev <rmehdi@fnal.gov> is currently running roundup under minospro account on minos27.fnal.gov. Roundup checks that files for all subruns from a given run are present, if any are missing, then the run will not be concatenated. Log files for roundup processing are stored, for example
<pre>
~/ROUNDUP/LOG/2012-02/dogwood6near.log
</pre>
To make a list of missing subruns from the last pass of roundup you can run something like
<pre>
cd /grid/fermiapp/minos/minfarm/scripts/
./pend2list d6 n
</pre>
or
<pre>
cd /grid/fermiapp/minos/minfarm/scripts/
./pend2list dogwood6 far
</pre>
Unless runs have crashed due to missing beam or b-field data in the database, the missing file will appear in the list
<pre>
/minos/data/minfarm/lists/far_dogwood6.B
/minos/data/minfarm/lists/far_dogwood6.C
</pre>
To include runs that have crashed due to missing beam or b-field data in the database, for example after data have been filled to database, run pend2list with option -k (keep)
<pre>
./pend2list -k dogwood6 far
</pre>

h3. Troubleshooting

To check if particular subrun (or list of subruns) have been processed with particular version AND already concatenated, follow the example
<pre>
minos25$ cat mmm.d4
F00047650_0004 2011-05
F00047670_0006 2011-05
F00047685_0010 2011-05
F00047692_0009 2011-06
F00047949_0016 2011-06
F00048191_0009 2011-08
F00048350_0005 2011-08

minos25$ while read r m; do sam_find -b c -t sntp -v d4 -m $m $r; done <
mmm.d4
</pre>

There is a possibility of duplicate files. When duplicate file comes in to near_cat (before round up) and it already exists in near_cat it is moved to
<pre>
/minos/data/minfarm/neardet
</pre>

If duplicate files comes in after round up already concatenated first version of it, then duplicates are found by round up script, and are moved to
<pre>
/minos/data/minfarm/DUP
</pre>

There are two routines that clean up duplicates caught by analyze (while submitting jobs) det2dcache and det2cat.
<pre>
PRO> det2dcache
Usage: det2dcache [-RNYnd] VSN f|n
Options: -N: Do NOT ask whether to replace non-zero files - DELETE LOCAL
-Y: Do NOT ask whether to replace non-zero files - DO IT
-d: Run srmcp with debug=true
-n: Don't copy or delete -- just show what would be done

(-R is deprecated in favor of -N)

minos25$ det2cat
Usage: det2cat [-n] VSN F|N
Option: -n - debug mode -- just show what would be done
</pre>
Running routine det2dcache requires srm and sam setup, it should be run under minospro account and on other then minos27. See example of [[bashrc|.bashrc]] for setup. You may run
<pre>
det2dcache -n d6 n
</pre>
and see what happens and then run
<pre>
det2dcache -N d6 n
</pre>
to actually do the copies/deletions. Similarly for ntuples, run
<pre>
det2cat -n d6 n
</pre>
and then
<pre>
det2cat d6 n
</pre>

h4. Missing files
In case that job ran successfully to completion on the grid, but is unable to copy file to /pnfs/minos (for example due to authorization problems, or problems with pnfs), the run will appear in good_runs.* list, and actual file will be moved to the same directories as duplicate files. The cleanup is the same as for duplicates.
<pre>
/minos/data/minfarm/neardet or /minos/data/minfarm/fardet
</pre>

h4. To delete files from pnfs

If you need to permanently delete files from sam and pnfs for any reason (if any of the system fail and duplicates occur during concatenation for example), use this example
<pre>
ssh minospro@minos27
PRO> . /minos/app/app/OSG1/setup.sh
PRO> SRMV2_PATH="srm://fndca1:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos"
PRO> export X509_USER_PROXY=/minos/data/minfarm/.grid/minospro_proxy
PRO> DIR=reco_far/dogwood6/${tdir}_data/2012-03
PRO> tdir=sntp
PRO> DIR=reco_far/dogwood6/${tdir}_data/2012-03
PRO> FILE=F00049284_0001.spill.sntp.dogwood6.0.root
PRO> SRM_DIR=$SRMV2_PATH/$DIR
PRO> srmrm -2 $SRM_DIR/${FILE}
PRO> FILE=F00049287_0001.spill.sntp.dogwood6.0.root
PRO> srmrm -2 $SRM_DIR/${FILE}
PRO> FILE=F00049287_0001.cosmic.sntp.dogwood6.0.root
PRO> tdir=mrnt
PRO> DIR=reco_far/dogwood6/${tdir}_data/2012-03
PRO> SRM_DIR=$SRMV2_PATH/$DIR
PRO> FILE=F00049284_0001.spill.mrnt.dogwood6.0.root
PRO> srmrm -2 $SRM_DIR/${FILE}
PRO> FILE=F00049287_0001.spill.mrnt.dogwood6.0.root
PRO> srmrm -2 $SRM_DIR/${FILE}
PRO> tdir=.bntp
PRO> DIR=reco_far/dogwood6/${tdir}_data/2012-03
PRO> SRM_DIR=$SRMV2_PATH/$DIR
PRO> FILE=F00049287_0001.spill.bntp.dogwood6.0.root
PRO> srmrm -2 $SRM_DIR/${FILE}
minos53$ sam undeclare F00049284_0001.spill.sntp.dogwood6.0.root
minos53$ sam undeclare F00049287_0001.spill.sntp.dogwood6.0.root
minos53$ sam undeclare F00049287_0001.cosmic.sntp.dogwood6.0.root
minos53$ ls /pnfs/minos/reco_far/dogwood6/
.bntp_data/ cand_data/ mrnt_data/ sntp_data/
minos53$ ls /pnfs/minos/reco_far/dogwood6/
minos53$ sam undeclare F00049284_0001.spill.mrnt.dogwood6.0.root
minos53$ sam undeclare F00049287_0001.spill.mrnt.dogwood6.0.root
minos53$ sam undeclare F00049287_0001.spill.bntp.dogwood6.0.root
PRO> tdir=sntp
PRO> DIR=reco_far/dogwood6/${tdir}_data/2012-03
PRO> SRM_DIR=$SRMV2_PATH/$DIR
PRO> FILE=F00049287_0001.cosmic.sntp.dogwood6.0.root
PRO> srmrm -2 $SRM_DIR/${FILE}
PRO> cd /minos/data/reco_far/dogwood6/sntp_data/2012-03/
PRO> for f in `ls -l | grep 0001 | awk '{print$9}'`; do echo rm $f; done
rm F00049284_0001.spill.sntp.dogwood6.0.root
rm F00049287_0001.cosmic.sntp.dogwood6.0.root
rm F00049287_0001.spill.sntp.dogwood6.0.root
PRO> rm F00049284_0001.spill.sntp.dogwood6.0.root
PRO> rm F00049287_0001.cosmic.sntp.dogwood6.0.root
PRO> rm F00049287_0001.spill.sntp.dogwood6.0.root
PRO> cd /minos/data/reco_far/dogwood6/mrnt_data/2012-03/
PRO> for f in `ls -l | grep 0001 | awk '{print$9}'`; do echo rm $f; done
rm F00049284_0001.spill.mrnt.dogwood6.0.root
rm F00049287_0001.spill.mrnt.dogwood6.0.root
PRO> rm F00049284_0001.spill.mrnt.dogwood6.0.root
PRO> rm F00049287_0001.spill.mrnt.dogwood6.0.root
PRO> cd /minos/data/reco_far/dogwood6/.bntp_data/2012-03/
PRO> for f in `ls -l | grep 0001 | awk '{print$9}'`; do echo rm $f; done
rm F00049287_0001.spill.bntp.dogwood6.0.root
PRO> rm F00049287_0001.spill.bntp.dogwood6.0.root
</pre>

h3. Job monitoring on grid

To see a list of running jobs type
<pre>
/grid/fermiapp/minos/minfarm/scripts/lj
</pre>
To remove a job
<pre>
condor_rm id#
OR
condor_rm -force jobid#
</pre>

h2. Physics Processing

When calibrations are done, the decision is made to process a certain time period with dogwood7. You need to make a list of files from list archives and put it in the lists directory, then add cron_submit to the crontab, for example

<pre>
04-58/10 * * * * /usr/krb5/bin/kcron "/grid/fermiapp/minos/minfarm/scripts/cron_submit.glide dogwood7 300 march.dogwood7"
</pre>



h2. Monte Carlo Requests Processing

Requests are submitted by email, for example
<pre>
Files were generated with daikon 07 and can be found in
/pnfs/minos/mcin_data/near/daikon_07/L010185N_r1/8*/n113*

They should be reconstructed with dogwood5 and for this sample please use following validity dates
2005-05-21 to 2006-02-25
</pre>

The file list is created using a cron job which looks for new files in /pnfs/minos/mcin_data
<pre>
???get_multi_mc
</pre>
The job is get_multi_mc and it updates files in lists/mc and writes a measurement list of the form
mclist_near.VSN, for example
<pre>
/minos/data/minfarm/lists/mclist_near.dogwood5
minos25$ cat mclist_near.dogwood5
n11338020_0028_L010185N_D07_r1
n11338020_0029_L010185N_D07_r1
n11338020_0030_L010185N_D07_r1
n11338020_0031_L010185N_D07_r1
n11338020_0032_L010185N_D07_r1
n11338020_0033_L010185N_D07_r1
n11338021_0000_L010185N_D07_r1
n11338021_0001_L010185N_D07_r1
n11338021_0002_L010185N_D07_r1
n11338021_0003_L010185N_D07_r1
n11338021_0004_L010185N_D07_r1
n11338021_0005_L010185N_D07_r1
n11338021_0006_L010185N_D07_r1
n11338021_0007_L010185N_D07_r1
n11338021_0008_L010185N_D07_r1
n11338021_0009_L010185N_D07_r1
n11338021_0010_L010185N_D07_r1
n11338021_0011_L010185N_D07_r1
n11338021_0012_L010185N_D07_r1
n11338021_0013_L010185N_D07_r1
n11338021_0014_L010185N_D07_r1
</pre>
The validity dates are passed to the loon script. They are -v start_date-time and -V end_date-time, most easily in the form 'YY,MM,DD,hh,mm,ss'. Here are some examples of MC submissions
<pre>
# 06,26,46 * * * * /usr/krb5/bin/kcron "/grid/fermiapp/minos/minfarm/scripts/cron_submit.mc.glide dogwood3 600 mclist_near.dogwood5"
02-52/20 * * * * /usr/krb5/bin/kcron "/grid/fermiapp/minos/minfarm/scripts/cron_submit.mc.glide -v '2005,05,21,0,0,0' -V '2006,02,25,23,59,59' dogwood5 400 mclist_near.dogwood5"
# 06-56/10 * * * * /usr/krb5/bin/kcron "/grid/fermiapp/minos/minfarm/scripts/cron_submit.mc.glide -v '2011,06,01,0,0,0' -V '2011,06,30,23,59,59' dogwood5 450 mclist_far.NOVA"
</pre>
The main run script is ana_mc analogous to analyze.

One actual example:
<pre>
* Running MC request

A new CosmicMu sample is ready to reco. It is daikon07 far detector
sample and it was alread transfered to
/pnfs/minos/mcin_data/far/daikon_07/CosmicMu

* Running

ssh minos25
bash
minos25$ get_multi_mc dogwood5 far daikon_07
far daikon_07 AtmosNu: old and new have 7931 entries.
far daikon_07 CosmicLE: old and new have 120 entries.
far daikon_07 CosmicMu: new list mc/far_daikon_07_CosmicMu has 880 entries.
It has also been copied to /minos/data/minfarm/lists/new_far_daikon_07_CosmicMu
Appending new_far_daikon_07_CosmicMu to mclist_far.dogwood5 and
removing new_far_daikon_07_CosmicMu
far daikon_07 L010185N_r1: old and new have 2291 entries.
far daikon_07 L010185N_r2: old and new have 1320 entries.
far daikon_07 L010185N_r3: old and new have 3127 entries.
far daikon_07 L010185R: only empty groups.
far daikon_07 L010185R_r4: old and new have 4300 entries.
far daikon_07 L100200N_r7: old and new have 261 entries.
far daikon_07 L100200R_r7: old and new have 264 entries.
far daikon_07 L250200N_r1: old and new have 1144 entries.
Check far daikon_07 L250200N_r2: blocked by NORECO
far daikon_07 L250200N_r7: old and new have 258 entries.
far daikon_07 L250200R_r7: old and new have 263 entries.

cd lists
mv mclist_far.dogwood5 far_daikon_07_CosmicMu.dogwood5
crontab -e

12-52/20 * * * * /usr/krb5/bin/kcron "/grid/fermiapp/minos/minfarm/scripts/cron_submit.mc.glide -v '2007,11,19,0,0,0' -V '2009,06,12,23,59,59' dogwood5 250 far_daikon_07_CosmicMu.dogwood5"
</pre>

h2. New Version

h2. Special Processing

h2. Notes

If files crashed with error 100 or 101 (missing beam or bfield) database might have been updated later, to check if data is now there do the following. If the data is there now, remove from bad_runs and resubmit.

To check if bfield was on for a particular run
<pre>
source /grid/fermiapp/minos/minossoft/setup/setup_minossoft_FNALU.sh
setup sam

resolve_bfld_fail -r N00021656_0013
</pre>

To check if beam was on for a particular run
<pre>
source /grid/fermiapp/minos/minossoft/setup/setup_minossoft_FNALU.sh
setup sam

resolve_beam_fail -r N00021656_0013
</pre>