Project

General

Profile

MeetingRequirements

This is an outline of where we meet the IF Requirements for Job Monitoring, and what needs to be added to
do so.

  1. Use cases for Users
    • Whether the system is working well -- batch monitoring page should good/failed jobs, queues, etc.
    • What types of resources are available -- need more specifics on what this means.
    • Whether these resources will match their jobs needs (see above)
    • Time before job will run -- [ ] need to have a completion rate * queued jobs number, userprio for user
    • State of their job -- showing running jobs, [ ] need to list queued, recently completed.
    • Failed jobs -- exit status, etc.
    • Estimate time job will start (?!?) for queued jobs. [ ] some mix of condor_userprio & jobs...
    • Once jobs begin... have per-job cpu/loadaverage graph. Condor update rate issue...
    • When jobs stop... have failed/succeeded graphs, [ ] need exit status info in completed table
    • Users may also want summary... Condor sends emails.
    • Others working/having problems -- can see everyones jobs the same
    • Is another user hogging resources -- have group job graphs, can do users...
    • Is another experiment -- have group job graphs
  2. Use cases for Operators
    • Is system working -- batch monitoring page gives summaries -- [ ] need per-system info, poss. from ganglia, or net-snmp sent with job...
    • Failed jobs breakdown -- users impacted? error distribution? --[ ] multiple exit code status graph?
    • Data handling fails -- have cpn graph [x] done for now, can add sam graphs later...
    • Slots & load averages -- [ ] todo need slot utilization graph
    • user graphs (see above) [x] done
    • History of above (that's what RRD is good at) [x] done
  3. Requirements
    1. Overall system monitoring
      • slots avail per experiment [ ] todo -- priority info instead?
      • Slots with big memory, fast cpu available [ ] todo need such a slots table/graph
      • Place to post downtime info, etc. [ ] todo html table/file-let upload facility
      • Running [x]done /queued [ ] todo jobs table
      • List of jobs stalled -- running but no CPU [ ] todo
      • Jobs with low CPU/Wall Clock time [ ] todo
      • Jobs started in interval (start time info) [ ] todo
      • Jobs finished in ineterval w/exitcodes efficiency [ ] todo
      • Above as tables, totals graphs. [x] done
      • Job Ids listed [x] done
      • sums of jobs [x] done
      • Info by factors experiment [x] done / job type -- [?] how do we get it?
      • Viewer sortable tables [ ] add table sorter javascript, pillage plone
      • User savable table config [ ] hmm....
      • Strip-chart history kept [x] done
      • Charts for given factors
      • Information overlayable on chart [x] done (but ugly), see rrdgraph
    2. Overall data handling monitoring
      • Tape requests -- [-] currently few, could add FTS data?
      • File requests -- [x] cpn graphs
      • Deliveries [x] cpn graphs
      • Errors -- cpn exit status?!? [ ] todo
    3. Job level monitoring
      1. basic info [-] see above
      2. completino info [-] see above
      3. log of transitions [-] avail when we have numeric state info stripchart per job
      4. Files read in [-] info not available (yet) [-] file upload above
      5. Files written -- ditto
      6. Waiting on file -- ditto job status and/or cpn info
      7. Working on file info -- ditto
      8. Job efficiency -- cpu usage available per job
      9. Log files while running(?) -- info not available yet(?)
    4. Acces requirements [-] todo when nearer production