If a job fails we need

  • information on exit codes
  • job log files
  • related SAM logs
  • related FTS logs
  • related ifdh logs
  • status of services during the job

To help with triage to decide where reports of the job's demise should go.

Also, we need some level of Black Hole detection, which notices nodes which have had a large number of very fast jobs,
and possibly some way to send tie-up-the-node jobs there or other remediation.