Project

General

Profile

BatchSubmissions » History » Version 46

« Previous - Version 46/56 (diff) - Next » - Current version
Olga Terlyga, 01/30/2013 02:10 PM


Batch Submissions

Contacts:

Getting Started

  • Job submission is done from minos25.fnal.gov under username minospro. Make sure you have access to that machine as yourself, as minfarm and minospro, current contact person is Arthur Kreymer <>
  • Setup cronjob to renew proxy needed for job submission in your personal crontab, for example
    07          1-23/2 * * * /usr/krb5/bin/kcron  /local/scratch25/grid/kproxy
    07          1-23/2 * * * /usr/krb5/bin/kcron  /local/scratch25/grid/kproxy_pro
    
  • Obtain permissions to write output files to /pnfs/minos area, current contact person is Arthur Kreymer <> * Update list of submitters to include your username
    /minos/data/minfarm/lists/.submitters

.bashrc example for setup

Keep-up with current version

Current version is dogwood6 at the moment. Keep up is running daily. Keep-up processing is used for calibrations. More accurate physics processing is done in larger batches, after calibration sign off (see below).

Minos batch at a glance

http://nusoft.fnal.gov/minos/MinosBatch_AtAGlance/MinosBatch_atAGlance.html

Cron jobs

Daily keep-up cron jobs are currently running under minospro account. Main submission job is this

  04 07,15,23 * * * /grid/fermiapp/minos/minfarm/scripts/get_daq_submit.glide -v dogwood6 -F

Other active cronjobs are

MAILTO='terlyga@fnal.gov,rubin@fnal.gov'

# Use this crontab for jobs that require the vanilla condor_q

# Note that it is the responsibility of the process to do a sufficient setup

# For safety, keep an updated copy of this crontab
  10 0 * * * /usr/bin/crontab -l > $HOME/cron-pro.minos25

# Clean out old logs, submits, and cores -- Run always
  20 22 * * * /grid/fermiapp/minos/minfarm/scripts/rm_logs.glide -F

Other cronjobs running on minos25 under user minospro

MAILTO='terlyga@fnal.gov,rubin@fnal.gov'

# Use this crontab for jobs that don't require condor_q

# For safety, keep an updated copy of this crontab
  10 0 * * * /usr/bin/crontab -l > $HOME/cron-pro.minos27

# Copy logs to AFS
  55 10,22 * * * /grid/fermiapp/minos/minfarm/scripts/copy_logs

# Keep the good_runs, bad_runs, and farmsdb files up-to-date
  00-55/5 * * * * /grid/fermiapp/minos/minfarm/scripts/gather_runs

# And the same for mc
  02-57/5 * * * * /grid/fermiapp/minos/minfarm/scripts/gather_runs.mc

# Check that data is flowing from the detectors to pnfs
  02 00-22/4 * * * /grid/fermiapp/minos/minfarm/scripts/check_delivery

# Refresh mclist from mcin_data when new stuff is coming in
# 04 06,14,22 * * * /grid/fermiapp/minos/minfarm/scripts/get_multi_mc dogwood5 near daikon_07
# 04 02,10,18 * * * /grid/fermiapp/minos/minfarm/scripts/get_multi_mc dogwood5 far daikon_07
# 04 00,08,16 * * * /grid/fermiapp/minos/minfarm/scripts/get_multi_mc dogwood5 near daikon_08

# Manage the keep-up lists
  58 22 * * *   /grid/fermiapp/minos/minfarm/scripts/keepup_lists B
  10 23 * * Sun /grid/fermiapp/minos/minfarm/scripts/keepup_orphans

Submit scripts options

PRO> /grid/fermiapp/minos/minfarm/scripts/get_daq_submit.glide -h
Usage:  get_daq_submit.glide [-v VSN] [-V VSN2] [options]
Options -h   print this message
        -d   print debug information in analyze
        -g   use root compiled with only -g
        -O,o use root compiled with -g -O2
        -Q n use mysql server minos-$n -- -Q db1 is default
        -b   bypass bfield check -- will produce ERR 100 in analyze
        -n   process near detector only
        -f   process far detector only
        -a   add ATMOS processing
        -c   do COSMIC processing ONLY
        -s   do SPILL processing ONLY
             COSMIC and SPILL processing is the default
        -v   specify a version -- defaults to current_version
        -y   bypass field and beam checks and *do* pass b,B options to analyze
             Use in shutdown when chambers and db updates run = -Bbc
        -B   beam down -- -F and don't signal missing lists
        -F   bypass beam check and run cosmic only -- will produce ERR 101
        -G   bypass beam check and run both passes -- will produce ERR 101
        -L   don't report on missing lists -- used when testing
        -S   do *not* submit jobs, only update bookkeeping
        -T|X do *not* update tarfiles or delete daily list -- TEST MODE
        -Z   -S and don't write to datalist(s) -- supercedes -S
        -V V add nearlist and farlist to alternate datalist.$V

PRO> /grid/fermiapp/minos/minfarm/scripts/cron_submit.glide -h
Usage:   cron_submit.glide [-pn] [-asmoOACNM] [-t F|N] VSN Num_Jobs [List]
Options: -h   - print this list
         -d   - print debug information in analyze
         -D   - allow duplicate submissions
         -g   - use root compiled with only -g
         -O,o - use root compiled with -g -O2
         -Q n - use mysql server minos- -- -Q mysql1 is default
         -b   - override bfield check
         -B   - override beam check
         -p n - override pass check in submit_job and use pass n
         -t f - count only F(ar) or N(ear); default is both
         -m   - allow multiple passes
         -a   - add ATMOS processing
         -c   - do COSMIC processing ONLY
         -s   - do SPILL processing ONLY
The following are generally useful if -c or -s.  If none of these is
    specified, all output streams are written, i.e. '' = -CNM
         -A   - write all output streams (default)
         -C   - write cand output (includes bcnd for FD)
         -N   - write ntuple output (includes bntp for FD)
         -M   - write mrnt output (for spill pass)

PRO> /grid/fermiapp/minos/minfarm/scripts/cron_submit.mc.glide -h
Usage: cron_submit.mc.glide -v date -V date [-pn] [-dgmoOACNM] [-t f|n] VSN NumJobs [InList]
Options: -h   - print this list
         -d   - print debug information in ana_mc
         -m   - allow multiple passes
         -p n - override pass check in submit_job and use pass n
         -t f - count only f(ar), n(ear, F(mock), N(mock); default is all
         -g   - use root compiled with only -g
         -O,o - use root compiled with -g -O2
         -S   - special handling of subrun > 99
         -v s - string specifying start of time range: 'YYYY,MM,DD,hh,mm,ss'
         -V s - string specifying end of time range: 'YYYY,MM,DD,hh,mm,ss'
The following control output streams.  If none of these or -A
    is specified, all output streams are written, i.e. '' = -CNM
         -A   - write all output streams (default)
         -C   - write cand output
         -N   - write ntuple output
         -M   - write mrnt output

Location of output and log files

There are no log files for submissions currently, all output from submission scripts is sent to mail, which is stored, for example, in

/var/spool/mail/terlyga

If you prefer to receive actual email you may add to the crontab that runs the job, for example
MAILTO='rubin@fnal.gov' 

To see running jobs run
/grid/fermiapp/minos/minfarm/scripts/lj 

Output files are written in cand_data directory by date. For example for dogwood6 processing of near detector data collected in February 2012:

/pnfs/minos/reco_near/dogwood6/cand_data/2012-02/

Log files from grid processing are available while (and after) the job is running on the grid, they are written to

 
 /minos/data/minfarm/logs

Twice a day log files are archived to (see active cronjob above)
/minos/data/users/minospro/FARMING

Old location:
/afs/fnal.gov/files/data/minos/farm_logs

Note that all the files are gzipped. (Please don't unzip them!) To look at them use 'less' or copy to another location first.

Files that crashed will appear in the bad_runs file, for example

/minos/data/minfarm/lists/bad_runs.dogwood6

The error codes are as follows
   1: Input error, usually an srm problem -- rerun
   2: No output streams
   3: Unable to save an output stream -- dcache or farcat/nearcat -- rerun
   7: Unable to locate loon script -- rerun after adding script to tar
   8: Mysql server not available -- rerun
  15: No asciidb files -- configuration error -- probably obsolete
  90: Job runs extremely long without writing output -- killed by hand
  91: Do not process -- not in measurement list -- manual entry in bad_runs
      Should be caught as a suppressed run -- mostly used with atmos 
processing
  95: Temporary reassignment of 100 to allow flushing if not to be rerun
  96: Temporary reassignment of 101 to allow flushing if not to be rerun
  99: Job runs extremely long and writes massive output -- killed by hand
 100: Gaps in bfield database -- usually rerun after db update
 101: Gaps in beam spill database -- usually rerun after db update
 132: Illegal Instruction
 134: Invalid Data
 136: FPE
 137: Killed by system or user; rerun or manually change to 90 or 99
 139: SEGV

Roundup (Concatenation)

Rashid Mehdiyev <> is currently running roundup under minfarm account on minos27.fnal.gov. Roundup checks that files for all subruns from a given run are present, if any are missing, then the run will not be concatenated. Log files for roundup processing are stored, for example

~/ROUNTMP/LOG/2012-02/dogwood6near.log

To make a list of missing subruns from the last pass of roundup you can run something like
cd /grid/fermiapp/minos/minfarm/scripts/
./pend2list d6 n

or
cd /grid/fermiapp/minos/minfarm/scripts/
./pend2list dogwood6 far

Unless runs have crashed due to missing beam or b-field data in the database, the missing file will appear in the list
/minos/data/minfarm/lists/far_dogwood6.B
/minos/data/minfarm/lists/far_dogwood6.C

To include runs that have crashed due to missing beam or b-field data in the database, for example after data have been filled to database, run pend2list with option -k (keep)
./pend2list -k dogwood6 far

Troubleshooting

To check if particular subrun (or list of subruns) have been processed with particular version AND already concatenated, follow the example

minos25$ cat mmm.d4
F00047650_0004 2011-05
F00047670_0006 2011-05
F00047685_0010 2011-05
F00047692_0009 2011-06
F00047949_0016 2011-06
F00048191_0009 2011-08
F00048350_0005 2011-08

minos25$ while read r m; do sam_find -b c -t sntp -v d4 -m $m $r; done <
mmm.d4

There is a possibility of duplicate files. When duplicate file comes in to near_cat (before round up) and it already exists in near_cat it is moved to

/minos/data/minfarm/neardet 

If duplicate files comes in after round up already concatenated first version of it, then duplicates are found by round up script, and are moved to

/minos/data/minfarm/DUP

There are two routines that clean up duplicates caught by analyze (while submitting jobs) det2dcache and det2cat.

PRO> det2dcache
Usage:   det2dcache [-RNYnd] VSN f|n
Options: -N: Do NOT ask whether to replace non-zero files - DELETE LOCAL
         -Y: Do NOT ask whether to replace non-zero files - DO IT
         -d: Run srmcp with debug=true
         -n: Don't copy or delete -- just show what would be done

(-R is deprecated in favor of -N)

minos25$ det2cat
Usage:  det2cat [-n] VSN F|N
Option: -n - debug mode -- just show what would be done 

Running routine det2dcache requires srm and sam setup, it should be run under minospro account and on other then minos27. See example of .bashrc for setup. You may run
det2dcache -n d6 n

and see what happens and then run
det2dcache -N d6 n

to actually do the copies/deletions. Similarly for ntuples, run
det2cat -n d6 n

and then
det2cat d6 n 

h4. Missing files
In case that job ran successfully to completion on the grid, but is unable to copy file to /pnfs/minos (for example due to authorization problems, or problems with pnfs), the run will appear in good_runs.* list, and actual file will be moved to the same directories as duplicate files. The cleanup is the same as for duplicates.
/minos/data/minfarm/neardet or /minos/data/minfarm/fardet

To delete files from pnfs

If you need to permanently delete files from sam and pnfs for any reason (if any of the system fail and duplicates occur during concatenation for example), use this example

ssh minospro@minos27
PRO> . /minos/app/app/OSG1/setup.sh
PRO> SRMV2_PATH="srm://fndca1:8443/srm/managerv2?SFN=/pnfs/fnal.gov/usr/minos" 
PRO>  export X509_USER_PROXY=/minos/data/minfarm/.grid/minospro_proxy
PRO>  DIR=reco_far/dogwood6/${tdir}_data/2012-03
PRO>  tdir=sntp
PRO> DIR=reco_far/dogwood6/${tdir}_data/2012-03
PRO> FILE=F00049284_0001.spill.sntp.dogwood6.0.root
PRO> SRM_DIR=$SRMV2_PATH/$DIR
PRO> srmrm -2 $SRM_DIR/${FILE}
PRO> FILE=F00049287_0001.spill.sntp.dogwood6.0.root
PRO> srmrm -2 $SRM_DIR/${FILE}
PRO> FILE=F00049287_0001.cosmic.sntp.dogwood6.0.root
PRO>  tdir=mrnt
PRO> DIR=reco_far/dogwood6/${tdir}_data/2012-03
PRO>  SRM_DIR=$SRMV2_PATH/$DIR
PRO>  FILE=F00049284_0001.spill.mrnt.dogwood6.0.root
PRO> srmrm -2 $SRM_DIR/${FILE}
PRO>  FILE=F00049287_0001.spill.mrnt.dogwood6.0.root
PRO> srmrm -2 $SRM_DIR/${FILE}
PRO>  tdir=.bntp
PRO> DIR=reco_far/dogwood6/${tdir}_data/2012-03
PRO>  SRM_DIR=$SRMV2_PATH/$DIR
PRO>  FILE=F00049287_0001.spill.bntp.dogwood6.0.root
PRO> srmrm -2 $SRM_DIR/${FILE}
minos53$ sam undeclare F00049284_0001.spill.sntp.dogwood6.0.root
minos53$ sam undeclare F00049287_0001.spill.sntp.dogwood6.0.root
minos53$ sam undeclare F00049287_0001.cosmic.sntp.dogwood6.0.root
minos53$ ls /pnfs/minos/reco_far/dogwood6/
.bntp_data/ cand_data/  mrnt_data/  sntp_data/  
minos53$ ls /pnfs/minos/reco_far/dogwood6/
minos53$ sam undeclare F00049284_0001.spill.mrnt.dogwood6.0.root
minos53$ sam undeclare F00049287_0001.spill.mrnt.dogwood6.0.root
minos53$ sam undeclare F00049287_0001.spill.bntp.dogwood6.0.root
PRO>  tdir=sntp
PRO> DIR=reco_far/dogwood6/${tdir}_data/2012-03
PRO> SRM_DIR=$SRMV2_PATH/$DIR
PRO>  FILE=F00049287_0001.cosmic.sntp.dogwood6.0.root
PRO> srmrm -2 $SRM_DIR/${FILE}
PRO>  cd  /minos/data/reco_far/dogwood6/sntp_data/2012-03/
PRO> for f in `ls -l | grep 0001 | awk '{print$9}'`; do echo rm $f; done
rm F00049284_0001.spill.sntp.dogwood6.0.root
rm F00049287_0001.cosmic.sntp.dogwood6.0.root
rm F00049287_0001.spill.sntp.dogwood6.0.root
PRO> rm F00049284_0001.spill.sntp.dogwood6.0.root
PRO> rm F00049287_0001.cosmic.sntp.dogwood6.0.root
PRO> rm F00049287_0001.spill.sntp.dogwood6.0.root
PRO>  cd  /minos/data/reco_far/dogwood6/mrnt_data/2012-03/
PRO> for f in `ls -l | grep 0001 | awk '{print$9}'`; do echo rm $f; done
rm F00049284_0001.spill.mrnt.dogwood6.0.root
rm F00049287_0001.spill.mrnt.dogwood6.0.root
PRO> rm F00049284_0001.spill.mrnt.dogwood6.0.root
PRO> rm F00049287_0001.spill.mrnt.dogwood6.0.root
PRO>  cd  /minos/data/reco_far/dogwood6/.bntp_data/2012-03/
PRO> for f in `ls -l | grep 0001 | awk '{print$9}'`; do echo rm $f; done
rm F00049287_0001.spill.bntp.dogwood6.0.root
PRO> rm F00049287_0001.spill.bntp.dogwood6.0.root

Job monitoring on grid

To see a list of running jobs type

/grid/fermiapp/minos/minfarm/scripts/lj 

To remove a job
condor_rm id#
OR
condor_rm -force jobid#

h2. Physics Processing

Monte Carlo Requests Processing

Requests are submitted by email, for example

Files were generated with daikon 07 and  can be found in 
/pnfs/minos/mcin_data/near/daikon_07/L010185N_r1/8*/n113*

They should be reconstructed with dogwood5 and for this sample please use following validity dates
  2005-05-21 to 2006-02-25 

The file list is created using a cron job which looks for new files in /pnfs/minos/mcin_data

???get_multi_mc

The job is get_multi_mc and it updates files in lists/mc and writes a measurement list of the form
mclist_near.VSN, for example
/minos/data/minfarm/lists/mclist_near.dogwood5
minos25$ cat mclist_near.dogwood5
n11338020_0028_L010185N_D07_r1
n11338020_0029_L010185N_D07_r1
n11338020_0030_L010185N_D07_r1
n11338020_0031_L010185N_D07_r1
n11338020_0032_L010185N_D07_r1
n11338020_0033_L010185N_D07_r1
n11338021_0000_L010185N_D07_r1
n11338021_0001_L010185N_D07_r1
n11338021_0002_L010185N_D07_r1
n11338021_0003_L010185N_D07_r1
n11338021_0004_L010185N_D07_r1
n11338021_0005_L010185N_D07_r1
n11338021_0006_L010185N_D07_r1
n11338021_0007_L010185N_D07_r1
n11338021_0008_L010185N_D07_r1
n11338021_0009_L010185N_D07_r1
n11338021_0010_L010185N_D07_r1
n11338021_0011_L010185N_D07_r1
n11338021_0012_L010185N_D07_r1
n11338021_0013_L010185N_D07_r1
n11338021_0014_L010185N_D07_r1

The validity dates are passed to the loon script. They are -v start_date-time and -V end_date-time, most easily in the form 'YY,MM,DD,hh,mm,ss'. Here are some examples of MC submissions
# 06,26,46 * * * * /usr/krb5/bin/kcron "/grid/fermiapp/minos/minfarm/scripts/cron_submit.mc.glide dogwood3 600 mclist_near.dogwood5" 
  02-52/20 * * * * /usr/krb5/bin/kcron "/grid/fermiapp/minos/minfarm/scripts/cron_submit.mc.glide -v '2005,05,21,0,0,0' -V '2006,02,25,23,59,59' dogwood5 400 mclist_near.dogwood5" 
# 06-56/10 * * * * /usr/krb5/bin/kcron "/grid/fermiapp/minos/minfarm/scripts/cron_submit.mc.glide -v '2011,06,01,0,0,0' -V '2011,06,30,23,59,59' dogwood5 450 mclist_far.NOVA" 

The main run script is ana_mc analogous to analyze.

New Version

Special Processing

Notes

source /grid/fermiapp/minos/minossoft/setup/setup_minossoft_FNALU.sh
setup sam

resolve_bfld_fail  -r  N00021656_0013