Project

General

Profile

Bug #3510

Glideins don't recognize misconfigured Condor

Added by Igor Sfiligoi over 6 years ago. Updated over 3 years ago.

Status:
Assigned
Priority:
Normal
Assignee:
Parag Mhashilkar
Category:
Factory
Target version:
Start date:
02/21/2013
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

I have just found an example where the Condor config was bad, resulting in Condor dying almost immediately,
but the glidein scripts reported the glidein as successful.

This makes monitoring for such problems almost impossible.

job.349645.1.err (44.6 KB) job.349645.1.err Igor Sfiligoi, 02/21/2013 04:38 PM
job.349645.1.out (12.7 KB) job.349645.1.out Igor Sfiligoi, 02/21/2013 04:38 PM

History

#1 Updated by Igor Sfiligoi over 6 years ago

Attached are the log files, where the problem is clearly seen.

The factory was running v2_6_2.

#2 Updated by Burt Holzman over 6 years ago

  • Target version set to v2_7_x

#3 Updated by Burt Holzman over 6 years ago

In this case, HTCondor was misconfigured in such a way that the startd crashed/exited. Before the master could restart it (by default it waits 10 seconds), the MASTER.DAEMON_SHUTDOWN kicks in -- and the exit code sadly isn't configurable (modulo some undocumented knob).

I think the best we could do is to check how long condor_startup.sh took to execute, and count it as "failed" if it's under a given threshold (60s?)

#4 Updated by Burt Holzman over 6 years ago

I added a check if Condor's elapsed runtime is less than 60 seconds, but haven't tested it yet.

#5 Updated by Burt Holzman over 6 years ago

  • Status changed from New to Assigned

#6 Updated by John Weigand over 6 years ago

Burt,

I need a little more information on this.
I assume we are talking about the the condor_config of the condor
coming down with pilot.
If so, I do you screw that up.
Need a clue as to how to simulate the problem.

John Weigand

#7 Updated by John Weigand over 6 years ago

That was supposed to say "how do you screw that up"

John Weigand

#8 Updated by Igor Sfiligoi over 6 years ago

Look at the attached logs:
cat_StartdLog.py job.349645.1.err |tail
02/21/13 14:24:39 (pid:22401) ERROR "Syntax error in START expression: '((True) && (True) && (ifthenelse(JOB_Max_Mins=!=UNDEFINED,(GLIDEIN_ToDie-MyCurrentTime)>(JOB_Max_Mins*60),ifthenelse(JOB_Is_Short3h=?=True,(GLIDEIN_ToDie-MyCurrentTime)>(3*3600),ifthenelse(JOB_Is_Long22h=?=True,(GLIDEIN_ToDie-MyCurrentTime)>(22*3600),ifthenelse(JOB_Is_Long33h=?=True,(GLIDEIN_ToDie-MyCurrentTime)>(33*3600),ifthenelse(JOB_Is_Long44h=?=True,(GLIDEIN_ToDie-MyCurrentTime)>(44*3600),(GLIDEIN_ToDie-MyCurrentTime)>(12*3600))))))) && ((DESIRED_Sites=?="any")||(DESIRED_Sites=?=UNDEFINED)||stringListMember(GLIDEIN_Site,DESIRED_Sites))&&(Job_Is_BigMem8G=!=True))) && (((GLIDEIN_ToRetire =?= UNDEFINED) || (CurrentTime < GLIDEIN_ToRetire)))'" at line 489 in file /slots/01/dir_15970/userdir/src/condor_startd.V6/util.cpp

The config had a syntax error in the START expression.
Should be trivial to reproduce.

Igor

#9 Updated by John Weigand over 6 years ago

Igor,

is this only visible using this cat_StartdLog.py script.
I don't see it in the error file you attached at least viewing
it in Firefox.

John Weigand

#10 Updated by Igor Sfiligoi over 6 years ago

Sort of.
The Condor log files are compressed and encoded in the logs returned by the glideins.

But unless you are a Matrix-style person, you have to use the cat_* tools to make sense of them.

#11 Updated by John Weigand over 6 years ago

Igor,

Since I am not familiar with how you create these updates to the glidein condor config, can you tell what you are using to do this.

Is this a helper script? If so, can you attach it and the xml line in the config file referencing it.

If something else, what?

Thanks
John Weigand

#12 Updated by Igor Sfiligoi over 6 years ago

I think it was just a typo in the START attribute of either the factory or the frontend XML file.
So tweak for the XML and a reconfig should trigger it.

Or are you trying to reproduce it fully by hand?

#13 Updated by John Weigand over 6 years ago

Igor,

I assume you are talking about this xml stanza in a factory config file which I have tried.

<attr name="GLIDEIN_Start" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="boolean" value="((True) && (True) && (ifthenelse(JOB_Max_Mins=!=UNDEFINED,0,-100)))"/>

I've never done one of these and I keep getting a syntax error on the reconfig pointing to the first '&&'.

If it is an attr element, show me yours that you corrected so I can figure out what I am doing wrong.

John Weigand

#14 Updated by Igor Sfiligoi over 6 years ago

Well, yes...
&& is invalid in XML ;)

Use & any time you need & in the config.

#15 Updated by John Weigand over 6 years ago

For my future reference, simulated the error condition using..

attr name="GLIDEIN_Start" const="True" glidein_publish="True" 
      job_publish="True" parameter="True" publish="True" type="string" 
      value="((JOB_Max_Mins=!=UNDEFINED,0,-100)))" 

John Weigand

#16 Updated by John Weigand over 6 years ago

Testing was successful. As currently coded, when condor fails within 60 secs, the
condor_startup.sh terminates with an exit code of 1 resulting in the following outputs.

1. stdout shows

=== Last script ended Wed Jul 10 07:21:40 CDT 2013 (1373458900) with code 1 after 9 ===

=== Glidein ending Wed Jul 10 07:41:39 CDT 2013 (1373460099) with code 1 after 1212 ===

2. xml

  :
  <result>
    <status>ERROR</status>
  :
    <metric name="failure" ts="2013-07-10T07:21:40-05:00" uri="local">Unknown</metric>
    <metric name="CondorOneMinuteShutdown" ts="2013-07-10T07:21:40-05:00" uri="local">True</metric>
  :
  </result>
  <detail>
     Validation failed in condor_startup.sh.
    See Condor logs for details
  </detail>

Note that with an exit code of 1 set, the glidein_startup.sh job still sits for 20 minutes before
terminating and these process are still on the WN. The stdout/err are not available until the glidein terminates
in 20 minutes.

wms21d0  22533 22527  0 07:21 ? 00:00:00 /bin/bash /var/lib/condor/execute/dir_22527/condor_exec.exe -v std ....
wms21d0  26021 22533  0 07:21 ? 00:00:00 sleep 1199

If this is acceptable, then we are good to go.

John Weigand

#17 Updated by Parag Mhashilkar over 6 years ago

20 mins dead time after any failure in glidein's bootstrap is required to avoid black hole effect. So this is intended behavior.

#18 Updated by Igor Sfiligoi over 6 years ago

Hi John.

Where does the "if less than 60s => fail" logic comes from?
You just implemented it, or was it added independently?
(as you can see from the original logs, it was not there initially)

#19 Updated by John Weigand over 6 years ago

Igor,

That is Burt's logic. I just tested to verify that it does not
break anything and documented the resulting outputs. There
was nothing in the initial / subsequent comments on the ticket
indicating the desired behavior/results.

The intent of that line is to assume condor broke since it died
quickly. The exit code from condor is 0 even on failure to start
under the conditions you described. By resetting the condor_ret
to 1, it prevents the black hole that Parag commented on.

John Weigand

#20 Updated by Igor Sfiligoi over 6 years ago

Right... it was actually in the ticket above... should have read better :(

Said that, can we add a couple things?
  1. Make the timeout configurable... 60s should be OK most of the time, but would be good to have a knob to alter this for edge cases.
  2. Please improve the error message. It should be stating that the failure is due to Condor terminating too fast.
    Adding the relevant metrics (actual time and min time) would also be highly desirable.
    (of course, getting even more detials would be even better, but that's probably for a separate ticket)

#21 Updated by Parag Mhashilkar over 3 years ago

  • Assignee changed from Burt Holzman to Parag Mhashilkar
  • Target version changed from v2_7_x to v3_x


Also available in: Atom PDF