Project

General

Profile

Success

Thoughts on when a job is "successful" and "failed". It seems we really want a state diagram like:

         /-(Held)<-\
>(New)--+-->(Idle)--+->(Running)-->(Completed)-+--->((Succeeded/Located))
          \          \          \               \-->((Failed))
           \--------------------------------------->((Removed))        

where when the job completes we either go to a Located state where we're Successful and our output files are declared, or we're Failed because we can tell something went wrong. Similarly, our overall Submissions(Tasks) will go to a Failed state if most (?) of the jobs are Failed. This will stop a workflow from proceeding -- things that depend on this Submission will not trigger if it's Failed.

Scenarios for success:

  • Normal Processing:
    • has a project,
    • read >0 input files,
    • used some measurable cpu,
    • copied out >0 output files,
    • exited ok
  • End of Project Processing:
    • has a project,
    • read 0 input files,
    • used little cpu,
    • copied out 0 output files,
    • exited ok
  • MC-gen:
    • has no project
    • read 0 input files,
    • used some cpu
    • copied out output files
    • exited ok
  • Non-SAM Processing
    • no project
    • read >0 input files
    • used some cpu
    • copied out output files
    • exited ok

So basically the presence of a SAM project and number of input files puts us into
one of 4 categories, where we get points for whether we match on cpu, output files, and (user executable) exit status. If we get 2 out of 3, we call it successful(?)

Possible additional success-point sources:
  • successful job regex-match on job output file
  • condor job exit code
  • others?