Jen presented the gui for the CMS production monitoring.
- Lots of good GUI ideas.
- Failure rate by site/workflow/campaign
- summarized log bits
- finding "black holes"
- They have 24/7 coverage, we won't...
- 48 hour history?
- acknowledge alarms?
- other ideas?
WM Stats Monitoring System
You can filter workflows amongst other attributes. What they call a workflow is what we call a project.
The URL: cmsweb.cern.ch/wmstats/index.html
Keeps track of:
-all work flows
-all jobs that are running
Are all their machines up? Machines have statuses.
-what software package is being used
You can access information from the condor logs via the web interface.
-software exit codes
-condor exit codes
-what input caused the failure
-details of the error
-type of error
-8001: product not found
Various histograms about the jobs like efficiency rate.
Most of the offline production is done onsite at Fermilab. (Paola)
A workflow has multiple steps, that are daisy chained together.
Recoverable failure? If so, then retry; keeping count.
Our application needs to be generic enough for multiple experiments.
They store logs in CouchDB.
-site it's failing on
In the filter:
-monte carlo request
Physics people create the request. Jen does not know how they do this.
Their jobs should run for 8 hours. If over 48 hours they kill them.
I just recently sent an email to Oliver in regards to requesting account access.
These are other people to consider:
-Seangchen on floor 11 //there is a test environment somewhere
-Juan & Jorge //have access to this tool
Because this isn't a 24 x 7 operation we need more than just a current status monitor. We need historical statuses, maybe 3 days?
If a job failed, the states can be failed, or acknowledged-failed.
The requesters should give the offline production group the error codes so debugging failed jobs can be simpler.