Project

General

Profile

Minerva2015-03-24

Overview:

Sam projects?

Sam project completion vs data on tape, etc.

Validation of compleation vs what is...

Reliability getting stuff off tape. (wasn't SAM, was SFA...)

Heidi:
SAM -- We can do it. Some noticable amount of bit-rot since past attempt. Should be doing it anyways.

Be very careful not to have too many tentacles.

People want the production group to do something; we need a schema
we want this code on this dataset under these conditions...
This could be very helpful; to have it written down ina database
rather than bouncing around emails...
Forces people to say what they want.

herb Greenlee knows a lot about this user mc request thing...

This could be incredibly useful. We have an existing script...

Gabe

Goal: one stop shopping for info. may refer out, but one place to watch.

May want an analysis coordinatoin package for within the experiment as well.

Other reqirements

  • Monitoring
    • What do they want to see?
      • Listing of datasets/state time in state/date requested/output files on tape
      • (overall completion of campaigns... mengel)
      • failures by type/time/site,etc.
      • users looking at jobs understanding priority/estimated start/what queue
  • Jobs
    • List of types of production job types
      • reco
      • calibrations
        • some iterative bit
      • some jobs don't get submitted until calibration vx is ready
      • keepup/rawdigits
      • mc
        • okay if partially completed, as far as next stage
    • How launched
      • minerva scripts over jobsub,etc.
    • Success criterea
      • output exists
      • job terminated normally
  • Workflow
    • Info in request to approvers
  • Metrics
    • what reports/metrics would you want from system?
      • What is the useful stuff?
        • digestible
        • per user/overall efficiency
        • slots of allocationused
        • opportunistic used
        • preemption
        • re-runs
        • failure patterns

Some of these you would want to also ask this about our analyzers, too...

Paola -- Met with CMS, one thing that Jen mentioned was thaty they spend a lot of
time working on error code manangement -- classify error codes. Does Minerva
have a error-code based classification?

Providing result code summaries would perhaps help us to construct such a
classification. (Gabe)

Other Notes

Need some coordination in the priority of jobs.

D0 had its own production database that Michael Diesburg knows about. It's a watered down version of what we will be building.

Should Minerva use SAM projects?
-Not intimately integrated with SAM is Heidi's request.

SAM has a robust tracking mechanism and monitoring tied to SAM projects. Why reinvent the wheel?

SAM project could have completed, but you would not know if it made it to tape.

SAM projects with consumers is a big beast of problems according to Heidi.

Analysis jobs being run by the Offline Production Group is feasible for Minerva.
Gabe Perdue to Heidi- is this something we would like to do?
Heidi- yes.

Success criteria for a job?: (Diesburg to Heidi)
Heidi- is there a descendant in sam that has a location x //a query
File exists on tape is another sanity check, but if you have lots of small files this could be a problem.
Their code either works, or it doesn't. It is very deterministic.

SAM dataset definitions that have a state sorted by date.
How long has it been in this state? If over 3 days we should be concerned.

Some possible states:
-requested
-approved
-submitted
-queued
-running
-completed

When and what was the last state change?
Keeping track of progress: how many files have been processed, how many more to go?

Complete providence of the job. (Diesburg)
Stats of the machine like how many cores, what type of processor...

Query the batch system to see what your priority is would be a nice feature, but this is a condor thing, not our thing. (Heidi)

General analysis people will not be using this system. However, this is an 8x5 operation, other people will need to know how to use the tool and understand the monitoring.

Some jobs have dependencies on other jobs. For instance, calibration needs to be done before reconstruction.

Submit job, grab the job number and send it to the sky. (Heidi)

We can have jobsub people tell our system about job submission. Is it a good idea to have these systems communicate? The more generic now, the easier it will be later. Perhaps have the people instrument their jobs to communicate to the home station?

End of year report would be nice to have.

Failure patterns, tied to hardware?

Have the ability to analyze the analyzers.

We need to be careful. We can run into the wall with DB issues trying to monitor everything. Queries can really kill performance on a DB...

Have the ability to track the efficiency of users. Margaret emails Gabe when there is a Minerva user with a low efficiency rate. Perhaps have an alert mechanism?

We need the users to just see everything in one place, and not many different places as it is now.

Run this on this run number with this datatier is an example of a simple request.

Examples of why Minerva jobs would fail:
-I couldn't get my input file so I bailed.
-I couldn't write my result file so I bailed.

Using defined error codes in the job script is a slippery slope according to Heidi.