Project

General

Profile

Feature #2455

Classification of glidein failure modes

Added by Igor Sfiligoi almost 8 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Igor Sfiligoi
Category:
-
Target version:
Start date:
09/14/2012
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
Spent time:
4.00 h (Total: 22.00 h)
Stakeholders:
Duration:

Description

Currently, the error messages in case of a glidein failure are only (barely) human readable.
This makes it impossible to classify errors in an automated way, thus requiring human intervention to solve the problem.

We should make them machine readable, so we can have automated actions taken.
This would greatly reduce the cost of operating a factory.


Subtasks

Bug #2894: Document the feature developed in #2455ClosedIgor Sfiligoi

Bug #3002: Finish documenting the feature from #2455Assigned

Bug #2904: Fix remaining issues with #2455ClosedAnthony Tiradani

Feature #2964: Cleanly separate the XML handling code in glidein_startup.shNew

History

#1 Updated by Igor Sfiligoi almost 8 years ago

  • Assignee set to Igor Sfiligoi

Taken care by an Igor's Student: Jessica

#2 Updated by Igor Sfiligoi almost 8 years ago

The proposed solution is to be based on the OSG proposed standard:
https://twiki.grid.iu.edu/bin/view/SoftwareTools/CommonTestFormat#Alain_s_proposal_Version_4_evolu

#3 Updated by Igor Sfiligoi almost 8 years ago

To allow for automated feedback, need standardised error reasons.

This is what I currently envision:
  • Config - e.g. Impossible combinations
  • Corruption - e.g. SHA1 check failed
  • WN Resource - e.g. Disk full or glexec not found
  • Network - e.g. Cannot talk to VO Collector
  • VO Proxy - e.g. Proxy too short
  • VO Data - e.g. VO SW not installed

#4 Updated by Igor Sfiligoi almost 8 years ago

A couple examples.

Success:

<?xml version="1.0"?>
<OSGTestResult id="glideinWMS.check_disk" version="7.5.4">

   <result>
      <status>OK</status>
      <metric name="diskspace" ts="2012-01-12T15:02:20" 
                   uri="local">/tmp/glidein_15432/</metric>
   </result>
   <detail>Enough disk space found.</detail>
</OSGTestResult>

Failure:

<?xml version="1.0"?>
<OSGTestResult id="glideinWMS.check_proxy" version="7.5.4">

   <result>
      <status>FAILED</status>
      <metric name="failure" ts="..." uri="local">VO Proxy</metric>
      <metric name="proxy" ts="2012-01-12T15:02:21" 
                   uri="local">/tmp/glidein_15432/proxy/a.proxy</metric>
   </result>
   <detail>Proxy had less than 12h left.</detail>
</OSGTestResult>

#5 Updated by Burt Holzman over 7 years ago

  • Target version set to v3_1

#6 Updated by Igor Sfiligoi over 7 years ago

  • Status changed from New to Feedback
  • % Done changed from 0 to 50

Jessica has finished working for UCSD.

I have committed her work in
branch_v2_5_5plus_2455

I expect the code to need a bit of polishing before being merged into the production tree, but it is supposed to work.

#7 Updated by Igor Sfiligoi over 7 years ago

  • Assignee changed from Igor Sfiligoi to Burt Holzman

#8 Updated by Burt Holzman over 7 years ago

  • Assignee changed from Burt Holzman to Anthony Tiradani

Tony has volunteered to review and polish this.

#9 Updated by Igor Sfiligoi over 7 years ago

Tony complained that error_gen was requesting data via a file.
This is now fixed; all input is now passed through arguments.
All the callers have been modified.

I have also cleaned up the code a bit.
This was done with 3 commits:
1a4c00384808bc628b9f41c27ada635d48a842ae
9a6eca1492229ac1642e824216d4e1b4020df1ff
780be07224693a009ef2efb547321b34c2921371

Please re-review.

Igor

#10 Updated by Igor Sfiligoi over 7 years ago

Tony claims all his objections have been addressed.

#11 Updated by Igor Sfiligoi over 7 years ago

  • % Done changed from 50 to 70

Done a final pass of cleanup... the code now looks good enough.

I have also created a new branch with a single commit for easier merging into the production branches:
branch_v2_5_5plus_2455_one

#12 Updated by Igor Sfiligoi over 7 years ago

Merged branch_v2_5_5plus_2455_one into both branch_v2plus and HEAD.

#13 Updated by Igor Sfiligoi over 7 years ago

Still missing the documentation.
Any suggestion on where we should place it?

#14 Updated by Parag Mhashilkar over 7 years ago

  • Assignee changed from Anthony Tiradani to Igor Sfiligoi
  • Target version changed from v3_1 to v2_6_1

#15 Updated by Parag Mhashilkar over 7 years ago

  • Status changed from Feedback to Assigned

#16 Updated by Parag Mhashilkar over 7 years ago

Maybe factory troubleshooting page?

#17 Updated by Igor Sfiligoi over 7 years ago

This is not really troubleshooting.
It is about the format all validation scripts are supposed to adhere to.

Both the ones provided out of the box, and the ones written by both factory and frontend administrators.

#18 Updated by Parag Mhashilkar over 7 years ago

Then design seems like the closest place to point a link to a new page containing this info. Also configuration page if there is anything specific to configuration.

#19 Updated by Parag Mhashilkar over 7 years ago

  • Status changed from Assigned to Resolved

closing this ticket as part of 2.6.1

Opening a new child issue for documentation

#20 Updated by Parag Mhashilkar over 7 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF