Project

General

Profile

Feature #2454

Advertise classad in case of glidein failure

Added by Igor Sfiligoi about 8 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Igor Sfiligoi
Category:
-
Target version:
Start date:
04/18/2013
Due date:
% Done:

50%

Estimated time:
(Total: 0.00 h)
Stakeholders:
Duration:

Description

Currently, when a glidein fails, the VO is completely unaware of it.

The proposal is to instead advertise a "error" ClassAd to the VO collector, so they can monitor what is happening.


Subtasks

Feature #3615: Add failed glidein monitoring to the FrontendClosedParag Mhashilkar

Feature #6869: Frontend should log the failed classad statsClosedIgor Sfiligoi

Idea #3722: Improve debug classad implementation in advertise_failure.helperNewParag Mhashilkar

Feature #4435: Advertize ClassAd if factory removes held glideinNew

Bug #4657: Error classads still not working in v3_1ClosedIgor Sfiligoi


Related issues

Related to GlideinWMS - Idea #3384: Add plugin for APFMon Assigned01/31/2013

Related to GlideinWMS - Idea #3389: Add a Collector for glidein monitoring to the factoryNew09/08/2014

History

#1 Updated by Igor Sfiligoi about 8 years ago

  • Assignee set to Igor Sfiligoi

Handled by Igor's student: Benjamin

#2 Updated by Burt Holzman almost 8 years ago

  • Target version set to v3_1

#3 Updated by Igor Sfiligoi about 7 years ago

Benjamin is gone... will try to finish it myself.

#4 Updated by Igor Sfiligoi about 7 years ago

  • Status changed from New to Feedback

I have a working version in
branch_v2plus_igor_2454_v2
(branched from a recent v2plus).
Last commit: #a58519b5b0cbd5df379c9958481d0b2ef90330e9

May want to polish it a little more, but this one does work (i.e. I tested it).

I have pushed the condor binaries and the needed support scripts to be downloaded very early,
so I can now advertise an error classad for most validation errors.

The error classad looks like this (cut out non-essential attributes from the example):

MyType = "Machine" 
TargetType = "Job" 
Name = "glidein_4449@cabinet-3-3-2.t2.ucsd.edu" 
Machine = "cabinet-3-3-2.t2.ucsd.edu" 
State = "Drained" 
Activity = "Retiring" 
START = false
MyAddress = "<127.0.0.1:1>" 
GLIDEIN_ADVERTISE_ONLY = 1
GLIDEIN_Failed = true
GLIDEIN_FAILURE_REASON = "Glidein failed while running main/glexec_setup.sh. Keeping node busy until 1363998709 (Fri Mar 22 17:31:49 PDT 2013)." 
(Fri Mar 22 17:31:49 PDT 2013)" 
GLIDEIN_EXIT_CODE = 1
GLIDEIN_LAST_SCRIPT = "main/glexec_setup.sh" 
GLIDEIN_ToDie = 1363998709
GLIDEIN_Entry_Name = "CMS_T2_US_UCSD_gw4" 
GLIDECLIENT_Name = "UCSDOSG-itb-v2_0.cms" 
GLIDEIN_CMSSite = "T2_US_UCSD" 
... all the other glidein attributes ...

If anyone else can check it out, I would be interested in feedback.

PS: I would like to get it into a v2+ release, if at all possible.

#5 Updated by Igor Sfiligoi about 7 years ago

There are two things that may need to be worked on.

1) Should we have a knob to allow the FEs to opt-out?
I think we should... just did not have time this week to implement it.

2) Should the error classad be de-advertised the moment the glidein leaves the WN?
Should we allow both modes, and let the FE decide? What should be the default?
(currently it does not de-advertise)

I am unsure about this one.

On one hand, the Machine classad semantics is to represent a glidein. So, when the glidein terminates, it should pull the ad. Period.

However, the failed glidein lives for a very short period of time, so it may be hard for a FE to notice that it is happening. Leaving it there longer would raise the chance for it to be noticed. But then we need to have a good way to distinguish in what "status" it is.

Any suggestions?

#6 Updated by Parag Mhashilkar about 7 years ago

Igor Sfiligoi wrote:

There are two things that may need to be worked on.

1) Should we have a knob to allow the FEs to opt-out?
I think we should... just did not have time this week to implement it.

Yes -- Preferred default opted-out to keep the behavior same.

2) Should the error classad be de-advertised the moment the glidein leaves the WN?
Should we allow both modes, and let the FE decide? What should be the default?
(currently it does not de-advertise)

I am unsure about this one.

On one hand, the Machine classad semantics is to represent a glidein. So, when the glidein terminates, it should pull the ad. Period.

However, the failed glidein lives for a very short period of time, so it may be hard for a FE to notice that it is happening. Leaving it there longer would raise the chance for it to be noticed. But then we need to have a good way to distinguish in what "status" it is.

Any suggestions?

Though I agree to cleanup right away, this seems the case where keeping the classad longer is more useful.
You already know until what time the node is available. We can use this info to add easily available info

GLIDEIN_WN_RELINQUISHED = (CurrentTime > GLIDEIN_ToDie)

Or embed this info in the State or GLIDEIN_State using classad expressions.

#7 Updated by Igor Sfiligoi about 7 years ago

  • Assignee changed from Igor Sfiligoi to Parag Mhashilkar

I have implemented the switch that allows the FE to decide if it wants to see the failure classad, and if it should survive the glidein, too.

The parameter name is GLIDEIN_Report_Failed.
I have documented it in the manual, too.

Can you please check if it is good enough to be merged back?
(it has been tested)

Last commit in branch_v2plus_igor_2454_v2:
commit:eb569a6d399a21c90443cfcd6ce927883cc51e7a

#8 Updated by Burt Holzman about 7 years ago

This ought to go in v3, not v2. If you fork off master, we'll review and merge that, but this won't go into v2.

#9 Updated by Igor Sfiligoi about 7 years ago

Created branch
branch_v3plus_igor_2454_v2
out of master, and merged in all the code.

Have not tested it though.

commit:00854f416c7ce740d9437fe39aa65eaa672374c6

#10 Updated by Parag Mhashilkar almost 7 years ago

Looks ok to me. I just have one comment. In advertise_failure.helper, you create the classad and then append to it. I wonder if it makes sense to move static chunk of classad into template, particularly since we are templatizing a bunch of things in v3. What do you think?

#11 Updated by Parag Mhashilkar almost 7 years ago

  • Assignee changed from Parag Mhashilkar to Igor Sfiligoi

#12 Updated by Igor Sfiligoi almost 7 years ago

Using a template for the static part may be an idea, long term.

However, I would prefer to keep it "fully dynamic" for now... while we gain some experience.<br>
A lot of the values that are currently hardcoded to fake value we may eventually want to properly define...
was just beyond the scope of the base implementation.

I am creating a subticket, so we don't forget.

#13 Updated by Igor Sfiligoi almost 7 years ago

  • Status changed from Feedback to Resolved

Merged into master.
(commit:ba35b33c75f1bcebfeeb6bcd94948c1d41e3a0fc)

#14 Updated by Igor Sfiligoi almost 7 years ago

Parag pointed out I had no documentation of the error classad, so I summarily documented it.

Committed to both branch_v3plus_igor_2454_v2 and master.

#15 Updated by Parag Mhashilkar almost 7 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF