Classification of glidein failure modes
Currently, the error messages in case of a glidein failure are only (barely) human readable.
This makes it impossible to classify errors in an automated way, thus requiring human intervention to solve the problem.
We should make them machine readable, so we can have automated actions taken.
This would greatly reduce the cost of operating a factory.
#2 Updated by Igor Sfiligoi over 8 years ago
The proposed solution is to be based on the OSG proposed standard:
#3 Updated by Igor Sfiligoi over 8 years ago
To allow for automated feedback, need standardised error reasons.This is what I currently envision:
- Config - e.g. Impossible combinations
- Corruption - e.g. SHA1 check failed
- WN Resource - e.g. Disk full or glexec not found
- Network - e.g. Cannot talk to VO Collector
- VO Proxy - e.g. Proxy too short
- VO Data - e.g. VO SW not installed
#4 Updated by Igor Sfiligoi over 8 years ago
A couple examples.
<?xml version="1.0"?> <OSGTestResult id="glideinWMS.check_disk" version="7.5.4"> <result> <status>OK</status> <metric name="diskspace" ts="2012-01-12T15:02:20" uri="local">/tmp/glidein_15432/</metric> </result> <detail>Enough disk space found.</detail> </OSGTestResult>
<?xml version="1.0"?> <OSGTestResult id="glideinWMS.check_proxy" version="7.5.4"> <result> <status>FAILED</status> <metric name="failure" ts="..." uri="local">VO Proxy</metric> <metric name="proxy" ts="2012-01-12T15:02:21" uri="local">/tmp/glidein_15432/proxy/a.proxy</metric> </result> <detail>Proxy had less than 12h left.</detail> </OSGTestResult>
#6 Updated by Igor Sfiligoi over 8 years ago
- Status changed from New to Feedback
- % Done changed from 0 to 50
Jessica has finished working for UCSD.
I have committed her work in
I expect the code to need a bit of polishing before being merged into the production tree, but it is supposed to work.
#9 Updated by Igor Sfiligoi over 8 years ago
Tony complained that error_gen was requesting data via a file.
This is now fixed; all input is now passed through arguments.
All the callers have been modified.
I have also cleaned up the code a bit.
This was done with 3 commits: