Glideins don't recognize misconfigured Condor
I have just found an example where the Condor config was bad, resulting in Condor dying almost immediately,
but the glidein scripts reported the glidein as successful.
This makes monitoring for such problems almost impossible.
#3 Updated by Burt Holzman about 8 years ago
In this case, HTCondor was misconfigured in such a way that the startd crashed/exited. Before the master could restart it (by default it waits 10 seconds), the MASTER.DAEMON_SHUTDOWN kicks in -- and the exit code sadly isn't configurable (modulo some undocumented knob).
I think the best we could do is to check how long condor_startup.sh took to execute, and count it as "failed" if it's under a given threshold (60s?)
#8 Updated by Igor Sfiligoi almost 8 years ago
Look at the attached logs:
cat_StartdLog.py job.349645.1.err |tail
02/21/13 14:24:39 (pid:22401) ERROR "Syntax error in START expression: '((True) && (True) && (ifthenelse(JOB_Max_Mins=!=UNDEFINED,(GLIDEIN_ToDie-MyCurrentTime)>(JOB_Max_Mins*60),ifthenelse(JOB_Is_Short3h=?=True,(GLIDEIN_ToDie-MyCurrentTime)>(3*3600),ifthenelse(JOB_Is_Long22h=?=True,(GLIDEIN_ToDie-MyCurrentTime)>(22*3600),ifthenelse(JOB_Is_Long33h=?=True,(GLIDEIN_ToDie-MyCurrentTime)>(33*3600),ifthenelse(JOB_Is_Long44h=?=True,(GLIDEIN_ToDie-MyCurrentTime)>(44*3600),(GLIDEIN_ToDie-MyCurrentTime)>(12*3600))))))) && ((DESIRED_Sites=?="any")||(DESIRED_Sites=?=UNDEFINED)||stringListMember(GLIDEIN_Site,DESIRED_Sites))&&(Job_Is_BigMem8G=!=True))) && (((GLIDEIN_ToRetire =?= UNDEFINED) || (CurrentTime < GLIDEIN_ToRetire)))'" at line 489 in file /slots/01/dir_15970/userdir/src/condor_startd.V6/util.cpp
The config had a syntax error in the START expression.
Should be trivial to reproduce.
#11 Updated by John Weigand almost 8 years ago
Since I am not familiar with how you create these updates to the glidein condor config, can you tell what you are using to do this.
Is this a helper script? If so, can you attach it and the xml line in the config file referencing it.
If something else, what?
#13 Updated by John Weigand almost 8 years ago
I assume you are talking about this xml stanza in a factory config file which I have tried.
<attr name="GLIDEIN_Start" const="True" glidein_publish="True" job_publish="True" parameter="True" publish="True" type="boolean" value="((True) && (True) && (ifthenelse(JOB_Max_Mins=!=UNDEFINED,0,-100)))"/>
I've never done one of these and I keep getting a syntax error on the reconfig pointing to the first '&&'.
If it is an attr element, show me yours that you corrected so I can figure out what I am doing wrong.
#16 Updated by John Weigand almost 8 years ago
Testing was successful. As currently coded, when condor fails within 60 secs, the
condor_startup.sh terminates with an exit code of 1 resulting in the following outputs.
1. stdout shows
=== Last script ended Wed Jul 10 07:21:40 CDT 2013 (1373458900) with code 1 after 9 === === Glidein ending Wed Jul 10 07:41:39 CDT 2013 (1373460099) with code 1 after 1212 ===
: <result> <status>ERROR</status> : <metric name="failure" ts="2013-07-10T07:21:40-05:00" uri="local">Unknown</metric> <metric name="CondorOneMinuteShutdown" ts="2013-07-10T07:21:40-05:00" uri="local">True</metric> : </result> <detail> Validation failed in condor_startup.sh. See Condor logs for details </detail>
Note that with an exit code of 1 set, the glidein_startup.sh job still sits for 20 minutes before
terminating and these process are still on the WN. The stdout/err are not available until the glidein terminates
in 20 minutes.
wms21d0 22533 22527 0 07:21 ? 00:00:00 /bin/bash /var/lib/condor/execute/dir_22527/condor_exec.exe -v std .... wms21d0 26021 22533 0 07:21 ? 00:00:00 sleep 1199
If this is acceptable, then we are good to go.
#19 Updated by John Weigand almost 8 years ago
That is Burt's logic. I just tested to verify that it does not
break anything and documented the resulting outputs. There
was nothing in the initial / subsequent comments on the ticket
indicating the desired behavior/results.
The intent of that line is to assume condor broke since it died
quickly. The exit code from condor is 0 even on failure to start
under the conditions you described. By resetting the condor_ret
to 1, it prevents the black hole that Parag commented on.
#20 Updated by Igor Sfiligoi almost 8 years ago
Right... it was actually in the ticket above... should have read better :(Said that, can we add a couple things?
- Make the timeout configurable... 60s should be OK most of the time, but would be good to have a knob to alter this for edge cases.
- Please improve the error message. It should be stating that the failure is due to Condor terminating too fast.
Adding the relevant metrics (actual time and min time) would also be highly desirable.
(of course, getting even more detials would be even better, but that's probably for a separate ticket)