Iitial monitoring where we have all the services there, summary of what is working, and what is not..

Things under maintenance., etc Top level..

So second layer, specific to job management.

Mike: THings you want recorded about a job/project, versus general status

Paola: whethere things are down or degraded before we submit, know windows where we should
not submit. If the jobs fail, (where we spend most of our time) is it because samweb failed,
, we could see easily that it was out during this job. -- be able to correlate service outages
with job failures.

Anna -- problem with the service or a problem with the code/job itself.

Main decision tree is experiment vs services.

Also, opening logs of jobs to diagnose them is the first step. Can we distinguish errors from experiment
code from services failure. -- highlight kknown error message? Correlate with service log snippets?

We know when a job fails via samweb page. Then we need to look for log -- retrieve tail of log

How to turn up debug level on a job, if it is failing, to get more information.

Mechanism to re-run a job with assorted debug cranked up.

Discussion of error log levels, and useless debug messages :-)

Log analysis scripts(?)

Nova has a different montecarlo workflow than I was expecting == for keepup
we select files generated in the detector; Montecarlo goes all the way
through from .fcl onto reconstruction output.

Combinining /breaking up workflows?

GIve experiments information on how to instrument their code / jobs so they
can give logs we can make sense of.

Failed jobs -- when you have mass production mre than 20000 files, it is really hard
its really hard to do condor_q -l and get information about why they are failed.
  • Held reason
  • last remote host
  • exit status code
    other info on where it ran, and on job:
  • cpu usage, wall clock, number of files in, (and size), number of files out(and size)...
  • facility where the job ran (so we can say Wisconsin is great and Purdue isnt)
  • type or processor

Want monitoring data about services, etc fed into system, from other existing
monitoring systems, not have it go collect it itself.

Verifying things are in place to run the job:

Monitor screen before submitting a job -- is everything up? are our services okay?

while job is running?

post job -- how did it run?

When they submit a request -- important information needed, etc.

Other Notes

  • Have an announcement block for broadcasting a message to the team.
  • If a service is not working, one should be able to find out when the service stopped working, and for how long.
  • Most of OPG's time is looking into logs of failed jobs to find out what is wrong.
  • If there is a service problem, then you retry, else it's a code problem and you debug.
  • Provide a mechanism to access logs quickly if a job fails. Just retrieve the last 10 or 50 lines or ...
  • Have the experiments print out statements in their jobs, like phase 1 complete, phase 2 complete, and so on. This will make debugging easier however could be a can of worms.
  • Record in the database the stats of a job, in regards to successes and failures.
  • Perhaps have the system monitor time spent on the grid so wasted hours could be calculated. If it's granular enough, you can find out what job is wasting time.
  • Be able to calculate cpu usage, and wall clock time per stage of a job. Possible stages are reconstructions, and merges.
  • Have the experiments provide a log file plugin, which will determine exactly what went wrong with their job. If they don't provide a plugin, then the OPG will just tell them their job exited with code 1. Diagnosing why the jobs failed takes a lot of time.
  • Main high level view of this app is to automate the current process with scalability in mind.
  • Have a way to measure job performance. Was the job using an AMD processor? What facility was the job running in? What IP addresses? What time window: daytime, lunchtime, nighttime?
  • Perhaps have the job post a status code to the server in regards to if a phase completed successfully, instead of having to scan the logs.
  • Have some data quality monitoring (after the job has finished).
  • Paola sent an email out to everyone with a working draft of what a request form should contain. Requests are currently entered in SNOW. Need to know all the ingredients to do the job ahead of time.
  • There needs to be an initialization workflow in this system; an approval process. OPG can then properly determine the priorities.