Project

General

Profile

Minos2015-03-26

  • Monitoring
    • What do they want to see?
  • Jobs
    • List of types of production job types
      • Production Reco Produces ntuples, dsts
      • Daily Keepup
      • Montecarlo reco
      • Montecarlo generation is all offsite (outside scope?)
      • All involve concatenation to merge output
    • How launched
      • Wrapper around jobsub client
      • SAM Projects -- willing to look into it.
    • Success cri/terea
  • Workflow
    • Info in request to approvers
  • Metrics
    • what reports/metrics would you want from system?
Two tasks when sent to the grid
  • generate ntuples
  • produces keepup dst (data quality monitoring)
    We perform a lot of checks, by error code; we keep track of that
    for standard ntuples; keepup dsts are not part of that...
    Recovery is centered on the ntuples.

Concern on sam projects -- bookkeeps through text files. Since minos only
has 18 months left, we should maintain our bookeeping, not try to redo it.
Noone has been pegged to do it, Adam and Paul are the folks running things
now.

What's important from bookkeeping now;

Originally when Howie's scripts were in use; running jobs via condor; not
reasonable anymore, too many jobs. NOw when it says I'm going to submit this
batch, we touch a file, job then removes it when it finishes...
When a job is successful, it populates a goodruns list. THis sibasis for
success determination, otherwise you get an entry in the badruns list.
Summarize from badruns -- using error codes heavily. Used to decide
if we resubmit, etc. More recently
1) need to separate errors from hard crash vs soft error (database problem, etc0
more frequent of late, if beam is down, the databse would have some truoble, same
error as database not there -- now separated, and some recovery.
One thing we do need, if you do a productoin pass, and submit far detector jobs,
we can hammer our datbase into oblivioin; do them in incremental chunks. Need
a means to protect database from overload; --max-concurrent was tried; they
were using process number to pick the file, so DAG's were problematic.

Good question for future discussion.. process number used in scripts?

Old system was 7 different scripts; now one with flags; now he can pass in a skip
so processnumber + skip, and submit small batches.

Using badruns list as resubmit list, already in that format.

Mike D. You mentioned all your things do a merge; what are the parameters of
what gets merged into what... We merge them when all the files that shoould
go into a run exist. Determinging what is okay not to be there is somewhat manual.

Other Notes:

Minos has only 18 months as an experiment. Is it worth rewriting their scripts to use SAM projects?
What would be huge jobs are broken down into smaller pieces.