Project

General

Profile

Minos production overview

Production workflow

Software releases

Work in progress!

Collecting data:

About the submission:

  • The submission run at minos51 machine (SLF5 Machine, June 2016)
  • The scripts run in a cron job that is executed at 11:30 pm every day. This cronjob use the following arguments:

/grid/fermiapp/minos/minfarm/scripts/KEYGEAR_listbuilder.sh -v elm6 -k -O -q

  • The submission use minospro user and the proxy located at : /minos/app/home/minospro/grid/scratch/minospro.Production.proxy

Two main scripts involve in the submission:

/grid/fermiapp/minos/minfarm/scripts/KEYGEAR_listbuilder.sh
/grid/fermiapp/minos/minfarm/scripts/AMBROSIA_submit.sh

KeyGear-listbuider.sh

If KEYGEAR determines that a subrun in the list is currently running, it will disregard it. Likewise, each subrun is checked against the good_runs and bad_runs lists in $LIST_DIR for the reconstruction of choice. If the subrun has already been successfully reconstructed, it will also be disregarded. If the subrun is found in the bad_runs list, that job will be submitted to the grid, but it will give a warning that you should remove that subrun from the bad_runs list.

If you provide a list of files the script is going to ask 3 logical question per each file in the list.

  • is this running? (if yes don't submit)
  • did I run this successfully? (if yes don't submit)
  • did I fail at running this? (if yes submit but show a warning)

After that the script is going to generate a corrected list witch is going to be used for the submission.

Meaning of the script arguments (keep up)

-v: version of reconstruction "elm6" is the current one. 
-k: represent that is going to be a keep up submission. (if you use -k you can't use -l)
-O, -o: use root compiled in a optimize mode.
-q: allow the use of SAM Web project during the submission.

AMBROSIA_submit.sh
AMBROSIA_submit.sh is the actual reconstruction job running on the worker node

Reconstruction job error codes

Minos+ experiment has done a terrific task in clearly defining the error codes for failing reconstruction jobs. Following, you can find a list of them. This list is based on the information given by KEYGEAR script when executed with -E flag (source /grid/fermiapp/minos/minfarm/scripts/KEYGEAR_listbuilder.sh -E)

Jobs
1: Input error, usually an srm problem -- rerun
2: No output streams
3: Unable to save an output stream -- dcache or farcat/nearcat -- rerun
4: Job-Restart problem -- The job was restarted and SAMWeb had already delivered all the files.
7: Unable to locate loon script -- rerun after adding script to tar
8: Mysql server not available -- rerun
15: No asciidb files -- configuration error -- probably obsolete
90: Job runs extremely long without writing output -- killed by hand
91: Do not process -- not in measurement list -- manual entry in bad_runs
Should be caught as a suppressed run -- mostly used with atmos processing
95: Reassignment of 100 to allow roundup to flush if not to be rerun
96: Reassignment of 101 to allow roundup to flush if not to be rerun
99: Job runs extremely long and writes massive output -- killed by hand
100: Gaps in bfield database -- usually rerun after db update
101: Gaps in beam spill database -- usually rerun after db update
132: Illegal Instruction
134: Invalid Data
136: FPE
137: Killed by system or user; rerun or manually change to 90 or 99
139: SEGV