Minos production overview¶
Work in progress!
About the submission:¶
- The submission run at minos51 machine (SLF5 Machine, June 2016)
- The scripts run in a cron job that is executed at 11:30 pm every day. This cronjob use the following arguments:
/grid/fermiapp/minos/minfarm/scripts/KEYGEAR_listbuilder.sh -v elm6 -k -O -q
- The submission use minospro user and the proxy located at : /minos/app/home/minospro/grid/scratch/minospro.Production.proxy
Two main scripts involve in the submission:
If KEYGEAR determines that a subrun in the list is currently running, it will disregard it. Likewise, each subrun is checked against the good_runs and bad_runs lists in $LIST_DIR for the reconstruction of choice. If the subrun has already been successfully reconstructed, it will also be disregarded. If the subrun is found in the bad_runs list, that job will be submitted to the grid, but it will give a warning that you should remove that subrun from the bad_runs list.
If you provide a list of files the script is going to ask 3 logical question per each file in the list.
- is this running? (if yes don't submit)
- did I run this successfully? (if yes don't submit)
- did I fail at running this? (if yes submit but show a warning)
After that the script is going to generate a corrected list witch is going to be used for the submission.
Meaning of the script arguments (keep up)
-v: version of reconstruction "elm6" is the current one.
-k: represent that is going to be a keep up submission. (if you use -k you can't use -l)
-O, -o: use root compiled in a optimize mode.
-q: allow the use of SAM Web project during the submission.
AMBROSIA_submit.sh is the actual reconstruction job running on the worker node
Reconstruction job error codes¶
Minos+ experiment has done a terrific task in clearly defining the error codes for failing reconstruction jobs. Following, you can find a list of them. This list is based on the information given by KEYGEAR script when executed with -E flag (source /grid/fermiapp/minos/minfarm/scripts/KEYGEAR_listbuilder.sh -E)
1: Input error, usually an srm problem -- rerun
2: No output streams
3: Unable to save an output stream -- dcache or farcat/nearcat -- rerun
4: Job-Restart problem -- The job was restarted and SAMWeb had already delivered all the files.
7: Unable to locate loon script -- rerun after adding script to tar
8: Mysql server not available -- rerun
15: No asciidb files -- configuration error -- probably obsolete
90: Job runs extremely long without writing output -- killed by hand
91: Do not process -- not in measurement list -- manual entry in bad_runs
Should be caught as a suppressed run -- mostly used with atmos processing
95: Reassignment of 100 to allow roundup to flush if not to be rerun
96: Reassignment of 101 to allow roundup to flush if not to be rerun
99: Job runs extremely long and writes massive output -- killed by hand
100: Gaps in bfield database -- usually rerun after db update
101: Gaps in beam spill database -- usually rerun after db update
132: Illegal Instruction
134: Invalid Data
137: Killed by system or user; rerun or manually change to 90 or 99