Mu2e2015-04-07 » History » Version 1

Version 1/2 - Next » - Current version
Marc Mengel, 04/07/2015 05:10 PM


  • Monitoring
    • What do they want to see?
  • Jobs
    • List of types of production job types
      • N stages of MC
      • Event mixing -- maybe
      • MC reco
      • eventually -- real reco?
      • eventually -- calibration?
        • probably won't be production group for initial phase
        • eventually
      • Get to where they're running smoothly, then hand off to production
    • How launched
      • scripts wrapped around jobsub-client
      • So far none are SAM project based; relatively soon (weeks) to use sam projects...
    • Success criterea Categories
      • success reported by script
      • success post-hoc
      • data integrity
      • arbitrary user provided logfile check..
        • maybe script per job/per project/ per campaign
  • Workflow
    • Info in request to approvers
      • **
  • Metrics
    • what reports/metrics would you want from system? **

Data disks full?
98% of time in diagnosing/triage of problems
Can the division spend time on reliability to reduce above?
error codes largely Art -- if internal.
May havea period where productionjobs are SAM based, and other work isn't.

Once condor_q showed empty, scanned logfiles for completion codes, and one failure was
same job could complete multiple times (condor resubmit?)
THis happened more often than expected... much discussion.

In this upcomoing phase, if our success rate is in the 90's need not do anything.

Idea of black hole nodes. cvmfs errors, bus errors, etc. eating jobs

Provide tools to check things, etc. and we'll call them.
cvmfs up to date checks in jobsub wrapper?


Thing I'd want to see is sort of progress bars, percent complete vs time, etc.
on each campaign.

Concatenation/merge stage projects?

Merging -- we care about when we run a grid cluster; MC generation within one cluster gets a
unique run number and subruns are a cluster number. Subruns not split across files.
Bookkepign corners we havent explored -- would like as much as possible to have subruns
made contiguous and in order in a merge phase.