Keepup » History » Version 16

« Previous - Version 16/19 (diff) - Next » - Current version
Adam Schreckenberger, 06/09/2016 12:10 PM



KEYGEAR & AMBROSIA (KGA) are the latest pretty, mechanical, software workhorses of the keepup / batch processing production script fleet used by MINOS, MINOS+ and MINERvA. With the dropping of SL5 support at Fermi, the old scripts written by Howie Rubin no longer were able to communicate with the allocated grid slots. This necessitated the creation of new software.

Old Framework

Previously, the chain of submission worked as follows:
1. keepup_lists --> keepup_list (generated the nightly list of subruns to be submitted for processing)
2. get_daq_submit.glide (performed keepup bookkeeping and handed list to subsequent steps)
3. submit.glide (start point for manual submission) --> submit_jobs.glide (handled duplicate submission & good runs checks)
4. analyze_driver.glide --> analyze (dealt with stdout and stderr and was the reco task running on the worker node)

New Framework

KGA represents a significant shift in the paradigm of our submission process. The old software predated minos_jobsub and submitted each subrun as a single job using condor_submit commands. With the advent of the new jobsub tools, we aim to submit the entire job list as a single entity, meaning that multiple subruns will get a single cluster ID - but carry different process numbers. Additionally, many of the functions of the old scripts have been wrapped into standard jobsub commands. This allows us to consolidate our framework and make it a bit easier for the user to understand.

The KGA chain looks as follows:
1. keepup_lists --> keepup_list (now running samweb versions) [cron under minospro@minos27]
2. (sets various arguments and passes a final, checked list to the job that runs on the worker node) [cron under minospro@minos51]
3. (the reconstruction task that runs on the worker node)
4. keepup_orphans runs outside the normal chain to check that daq_lists is in good upkeep [cron under minospro@minos27]

Useful Directories
1. SCRIPTS -- /grid/fermiapp/minos/minfarm/scripts [home of KEYGEAR, AMBROSIA, WRAITH, keepup_lists, keepup_orphans]
2. LISTS/LIST_DIR -- /minos/data/minfarm/lists [where all lists are assumed to originate with manual l submissions, and where good/bad/zap_runs bookkeeping sits]
3. daq_lists -
/minos/data/minfarm/lists/daq_lists [where the massive bookkeeping sits along with k keepup lists]
4. $detcat -
/minos/data/minfarm/${det}cat [where output sntps from det=near,far wait for concatenation and roundup]

KEYGEAR is run either by a user to process a list of subruns to be reconstructed manually - or as a crontab job for the nightly keepup reconstruction. This script fulfils both steps 2 and 3 of the old framework through the careful introduction of flags. The script is housed in /grid/fermiapp/minos/minfarm/scripts. To see a detailed list of command line options run . -h. While this list gives a thorough overview of the options available, I will expand upon a few details to aid with posterity - and to highlight a few of the necessary inputs.

KEYGEAR requires that a reconstruction version be set. This is done using the -v flag. For example, if I wanted to run elm4 reconstruction, I would have -v elm4 as an argument.

The script also requires a list of subruns to process. If the -k flag is used, this tells the script to run in keepup mode. Note that the use of this flag is generally only used during the nightly keepup in the crontab. It should only be used otherwise by an expert who has full knowledge of how the -k option effects the bookkeeping in the lists/daq_lists directory. If the -k flag is given, the script will look for input lists in the lists/daq_lists directory. It will also append information to various daq lists and tar archives. It does not work with a given, manual list. In fact, if you attempt to use -k and -l, the script will exit without doing anything.

The -q flag has been added to make use of sam-projects. This functionality was added so that OPOS could take over monitoring of production tasks. If -q is used, a sam-project is created from the corrected list produced by KEYGEAR, and AMBROSIA will choose a file to process based on get-next-file as opposed to the traditional method of using the process number to select a file from the corrected list.

Let's say I wanted to setup the crontab on the appropriate keepup submitting machine, the line below in the crontab would be appropriate for running nightly keepup for elm2 and elm4 reconstructions with optimized code.

30 23 * * * . /grid/fermiapp/minos/minfarm/scripts/ -v elm4 -V elm2 -k -O -q

Once upon a time, the -I flag was used to set integration options, which made use of Adam's personal cert as that was the only way to get cron to play nice with earlier version of the jobsub client. This function is no longer needed with the current jobsub version.

Resubmissions to deal with errors/job crashes are obviously done manually. Also, when we submit a production run for analysis, this is done by manually submitting a joblist. Lists are specified with the use of the -l (as in list, not i) flag
. -v elm4 -l testlist -O
The line above is an example. The script will look for testlist in $LIST_DIR, which is currently assigned to /minos/data/minfarm/lists. Make sure that your joblist is located in this directory. To make use of sam-projects, the command line above would become
. -v elm4 -l testlist -O -q

Once KEYGEAR has a list to work with, it will scan through to see if the jobs you wish to submit are either currently running, have completed successfully, or match a failed job. The script checks for currently running jobs by looking in $LIST_DIR/running-jobs for a file that matches the submission. When a job is successfully submitted, a file is created in this directory of the following format D000R####_S###.PASS.RECO. (e.g. F00061645_0020.0.elm2 for far detector, run 61645, subrun 20, pass 0, with elm2 reconstruction). If KEYGEAR determines that a subrun in the list is currently running, it will disregard it. Likewise, each subrun is checked against the good_runs and bad_runs lists in $LIST_DIR for the reconstruction of choice. If the subrun has already been successfully reconstructed, it will also be disregarded. If the subrun is found in the bad_runs list, that job will be submitted to the grid, but it will give a warning that you should remove that subrun from the bad_runs list.

After performing these checks, KEYGEAR stores the subruns that passed the cuts in a file located in $LIST_DIR/KGA/. These files are identified as (manual/keepup).timestamp. Note that these files must be kept in the directory until the jobs are completed, as AMBROSIA will copy specified lists to the grid nodes to actually perform reconstruction.

KEYGEAR additionally sets up the arguments for the AMBROSIA_submit script depending upon the flags given. It then communicates with the grid nodes through the jobsub command, which means that we have reconstruction jobs running on worker nodes.

AMBROSIA is the actual reconstruction job running on the worker node. It creates ntuples of choice as well as DSTs for data quality purposes. It takes the arguments provided by and adjusts accordingly. AMBY also removes the associated file in $LIST_DIR/running-jobs, indicating that the job has finished one way or another, and updates the good_runs & bad_runs lists of the relevant reconstruction version.

Useful Blurbs

By default KGA will run both cosmic and spill passes and produce standard ntuples only. This works well during nightly keepup. However, during an analysis production run, we generally want both standard ntuples and muon-removed ntuples. If these scripts had existed for - say - the elm5 production run, I would have used a submission such as:
. -v elm5 -l giantelm5list -ONM
where N and M specify I want both sntp and mrnt files produced.

Remember that you have to source /grid/fermiapp/minos/scripts/ to use the new jobsub tool. If you submit a job and the cluster ID number looks odd, it is highly likely that the jobs have gone into the wrong pool and will crash. This might necessitate manual cleanup of the running-jobs directory, so take care with this step.

It's also worth noting that care must be taken in submitting FD jobs as to not overload the database. These jobs generally run quickly. In the past, a cron job was used to partition the FD list; however, that functionality does not currently exist in the KGA framework. The -j and -r flags exist, which allow the user to skip a certain number of lines in a large list and set the number of subruns to submit respectively. A new wrapper will be written to make use of these flags and handle this issue in the future.

Error Codes Reported by Ambrosia

1: Input error, usually an srmcp / ifdh cp problem -- rerun
2: No output streams
3: Unable to save an output stream -- dcache or farcat/nearcat -- rerun
4: SAM-PROJECTS problem -- rerun with careful project accounting
7: Unable to locate loon script -- rerun after adding script to tar
8: Mysql server not available -- rerun
15: No asciidb files -- configuration error -- probably obsolete
90: Job runs extremely long without writing output -- killed by hand
91: Do not process -- not in measurement list -- manual entry in bad_runs
Should be caught as a suppressed run -- mostly used with atmos processing
95: Reassignment of 100 to allow roundup to flush if not to be rerun
96: Reassignment of 101 to allow roundup to flush if not to be rerun
99: Job runs extremely long and writes massive output -- killed by hand
100: Gaps in bfield database -- usually rerun after db update
101: Gaps in beam spill database -- usually rerun after db update
132: Illegal Instruction"
134: Invalid Data"
136: FPE"
137: Killed by system or user; rerun or manually change to 90 or 99"
139: SEGV"
255: No spills - Essentially a 101 with both BeamMonSpill and SpillTimeND returning 0 Spills
If multiple passes are run e.g. B, Amby is smart and says run cosmics only.
The only time you should see a 255 as an error code is if you're only running spills (S)


Beam Database issues

Magnet Database issues

Resolved Magnet Database issues

How spill times are built