• Monitoring
    • What do they want to see?
  • Jobs
    • List of types of production job types
      • N stages of MC
      • Event mixing -- maybe
      • MC reco
      • eventually -- real reco?
      • eventually -- calibration?
        • probably won't be production group for initial phase
        • eventually
      • Get to where they're running smoothly, then hand off to production
    • How launched
      • scripts wrapped around jobsub-client
      • So far none are SAM project based; relatively soon (weeks) to use sam projects...
    • Success criterea Categories
      • success reported by script
      • success post-hoc
      • data integrity
      • arbitrary user provided logfile check..
        • maybe script per job/per project/ per campaign
  • Workflow
    • Info in request to approvers
      • **
  • Metrics
    • what reports/metrics would you want from system? **

Data disks full?
98% of time in diagnosing/triage of problems
Can the division spend time on reliability to reduce above?
error codes largely Art -- if internal.
May havea period where productionjobs are SAM based, and other work isn't.

Once condor_q showed empty, scanned logfiles for completion codes, and one failure was
same job could complete multiple times (condor resubmit?)
THis happened more often than expected... much discussion.

In this upcomoing phase, if our success rate is in the 90's need not do anything.

Idea of black hole nodes. cvmfs errors, bus errors, etc. eating jobs

Provide tools to check things, etc. and we'll call them.
cvmfs up to date checks in jobsub wrapper?


Thing I'd want to see is sort of progress bars, percent complete vs time, etc.
on each campaign.

Concatenation/merge stage projects?

Merging -- we care about when we run a grid cluster; MC generation within one cluster gets a
unique run number and subruns are a cluster number. Subruns not split across files.
Bookkepign corners we havent explored -- would like as much as possible to have subruns
made contiguous and in order in a merge phase.

Other Notes

Analysis computing may want to use this production system, but the scope is just for the production group for now.

Normal operation procedure:
Sit down with them (OPG) and define specs. Rob is skeptical of the request form. Feels there will need to be human contact.

They have their own script wrapped around job_sub client.

If their script writes log files, we need to tell the experiment where to write them, so we can analyze them later.

They use a check script to check a bunch of things. After that, they use their cleanup script.

Write log information on the worker node, then use ifdh to send the logs to BlueArc.

Have generic base tests for success that will work for all experiments.

Stage 1: 50% complete
Stage 2: 25% complete //depends on stage 1 output

Give the jobs the ability to communicate with GlideinWMS. Does the node have CVMFS? Store this data.

IFDH stages the files in, and out.

Mu2e has 3 flavors of jobs.

Perhaps have the jobs broadcast its logs, via http, to a dedicated log server. Maybe just the tail of the log files? This log server will of course contain a database, which could be queried to get relevant information...

Log stuff should go to dCache, and eventually to tape. (Mengel)