Project

General

Profile

CMSReq2015-03-20

Jen presented the gui for the CMS production monitoring.

  • Lots of good GUI ideas.
  • Reporting:
    • Failure rate by site/workflow/campaign
    • summarized log bits
    • finding "black holes"
  • They have 24/7 coverage, we won't...
    • 48 hour history?
    • acknowledge alarms?
    • other ideas?

Other Notes:

WM Stats Monitoring System

You can filter workflows amongst other attributes. What they call a workflow is what we call a project.

The URL: cmsweb.cern.ch/wmstats/index.html

Keeps track of:
-workflow
-status
new
assignment-approved
running-closed
closed-out
-created
-queued
-pending
-running
-success
-all work flows
-all jobs that are running

Errors Types:
-site error
-data error
-software error

Agents:
Are all their machines up? Machines have statuses.

Search:
-request name
-output dataset
-input dataset
-prep id
-date range

Campaign:
-what software package is being used
-failure rate

You can access information from the condor logs via the web interface.
-software exit codes
-condor exit codes
-what input caused the failure
-details of the error
-type of error

Exit Codes:
-8001: product not found
-https://twiki.cern.ch/twiki/bin/view/CMSPublic/JobExitCodes

Various histograms about the jobs like efficiency rate.

Most of the offline production is done onsite at Fermilab. (Paola)

A workflow has multiple steps, that are daisy chained together.

Recoverable failure? If so, then retry; keeping count.

Our application needs to be generic enough for multiple experiments.

They store logs in CouchDB.
-exit codes
-site it's failing on

In the filter:
type
-monte carlo request

Physics people create the request. Jen does not know how they do this.

Their jobs should run for 8 hours. If over 48 hours they kill them.

I just recently sent an email to Oliver in regards to requesting account access.
These are other people to consider:
-Dave Mason
-Seangchen on floor 11 //there is a test environment somewhere
-Juan & Jorge //have access to this tool

Because this isn't a 24 x 7 operation we need more than just a current status monitor. We need historical statuses, maybe 3 days?

If a job failed, the states can be failed, or acknowledged-failed.

The requesters should give the offline production group the error codes so debugging failed jobs can be simpler.