CMSReq2015-03-20¶
Jen presented the gui for the CMS production monitoring.
- Lots of good GUI ideas.
- Reporting:
- Failure rate by site/workflow/campaign
- summarized log bits
- finding "black holes"
- They have 24/7 coverage, we won't...
- 48 hour history?
- acknowledge alarms?
- other ideas?
Other Notes:
WM Stats Monitoring System
You can filter workflows amongst other attributes. What they call a workflow is what we call a project.
The URL: cmsweb.cern.ch/wmstats/index.html
Keeps track of:
-workflow
-status
new
assignment-approved
running-closed
closed-out
-created
-queued
-pending
-running
-success
-all work flows
-all jobs that are running
Errors Types:
-site error
-data error
-software error
Agents:
Are all their machines up? Machines have statuses.
Search:
-request name
-output dataset
-input dataset
-prep id
-date range
Campaign:
-what software package is being used
-failure rate
You can access information from the condor logs via the web interface.
-software exit codes
-condor exit codes
-what input caused the failure
-details of the error
-type of error
Exit Codes:
-8001: product not found
-https://twiki.cern.ch/twiki/bin/view/CMSPublic/JobExitCodes
Various histograms about the jobs like efficiency rate.
Most of the offline production is done onsite at Fermilab. (Paola)
A workflow has multiple steps, that are daisy chained together.
Recoverable failure? If so, then retry; keeping count.
Our application needs to be generic enough for multiple experiments.
They store logs in CouchDB.
-exit codes
-site it's failing on
In the filter:
type
-monte carlo request
Physics people create the request. Jen does not know how they do this.
Their jobs should run for 8 hours. If over 48 hours they kill them.
I just recently sent an email to Oliver in regards to requesting account access.
These are other people to consider:
-Dave Mason
-Seangchen on floor 11 //there is a test environment somewhere
-Juan & Jorge //have access to this tool
Because this isn't a 24 x 7 operation we need more than just a current status monitor. We need historical statuses, maybe 3 days?
If a job failed, the states can be failed, or acknowledged-failed.
The requesters should give the offline production group the error codes so debugging failed jobs can be simpler.