Project

General

Profile

Bug #6732

Glideins report 100% success rate even if startd fails to start/run a matched job.

Added by Parag Mhashilkar over 5 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Glidein
Target version:
Start date:
08/01/2014
Due date:
% Done:

100%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:

CMS, OSG

Duration:

Description

See attached files for glidein and condor logs

From: Igor Sfiligoi
Subject: Example log files where glideins are blind to errors
Date: August 1, 2014 11:34:53 AM CDT
To: glideinwms
Cc: jeffrey dost

Hi all.

Attached is the output of one glidein that failed to successfuly start a single user job,
yet the "parsed summary" reports 100% success rate!

Can someone fix it?
At least for this use-case?

Thanks,
Igor

job.2284914.1.err.bz2 (161 KB) job.2284914.1.err.bz2 Parag Mhashilkar, 08/01/2014 11:51 AM
job.2284914.1.out.bz2 (4.59 KB) job.2284914.1.out.bz2 Parag Mhashilkar, 08/01/2014 11:51 AM

History

#1 Updated by Parag Mhashilkar about 5 years ago

  • Target version changed from v3_2_x to v3_2_9

#2 Updated by Parag Mhashilkar almost 5 years ago

  • Stakeholders updated (diff)

#3 Updated by Parag Mhashilkar almost 5 years ago

  • Target version changed from v3_2_9 to v3_2_x

#4 Updated by Parag Mhashilkar over 4 years ago

  • Assignee changed from Marco Mambelli to HyunWoo Kim
  • Target version changed from v3_2_x to v3_2_12

#5 Updated by HyunWoo Kim over 4 years ago

  • % Done changed from 0 to 70

I did the following experiment in order to check if our current code (mainly parsing_starterlog.aws)
can handle a situation similar to the one described in the ticket.

Each of the following 6 jdf files contains a different integer which will be the status return code

condor_submit test-0.jdf
condor_submit test-1.jdf
condor_submit test-2.jdf
condor_submit test-3.jdf
condor_submit test-4.jdf
condor_submit test-5.jdf

[hyunwoo@fermicloud159 ~]$ cat test-0.jdf
executable = testscript.py
arguments = 0
universe = vanilla
output = test0.out
error = test0.error
log = test0.log
queue

[hyunwoo@fermicloud159 ~]$ cat testscript.py
#!/usr/bin/python
import sys
import time

def main(argv=None):
time.sleep(30)
return int(argv)

if name == "__main__":
print sys.argv
sys.exit( main(sys.argv1) )

Then, configured my Frontend to allow only one glidein

In fermicloud159:/etc/gwms-frontend/frontend.xml
<config>
<running_glideins_total curb="90000" max="1"/>
so that there is only one glidein_startup.sh

[root@fermicloud158 ~]# condor_q g
-
Schedd: : <131.225.155.181:9615?sock=18522_d8bd_9>
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
5912.0 frontend 10/9 11:27 0+00:13:36 R 0 0.1 glidein_startup.sh

So, all these 6 jobs went to one glidein
which can be seen in fermicloud025 whcih is the CE to which my Factory will deploy glideins:

[root@fermicloud025 tmp]# grep "Process exited" glide_08OiAZ/log/StarterLog
10/09/15 11:32:09 (pid:22xxx) Process exited, pid=22xxx, status=0
10/09/15 11:32:09 (pid:22xxy) Process exited, pid=22xxy, status=1
10/09/15 11:32:09 (pid:22092) Process exited, pid=22096, status=2
10/09/15 11:32:42 (pid:22117) Process exited, pid=22121, status=3
10/09/15 11:33:59 (pid:22176) Process exited, pid=22283, status=4
10/09/15 11:34:31 (pid:22297) Process exited, pid=22301, status=5

And finally I look at job.5912.0.out as follows
[root@fermicloud158 entry_ITB_FC_CE2]# less /var/log/gwms-factory/client/user_frontend/glidein_gfactory_instance/entry_ITB_FC_CE2/job.5912.0.out
This shows a correct percentage now..

=== Stats of main ===
10/09/15 11:28:27 Starting job 131.0 from 131.225.155.182:9615?sock=4806_02fd_10
10/09/15 11:28:57 Terminated job 131.0 from 131.225.155.182:9615?sock=4806_02fd_10 status 0 duration 30
10/09/15 11:31:08 Starting job 132.0 from 131.225.155.182:9615?sock=4806_02fd_11
10/09/15 11:31:38 Terminated job 132.0 from 131.225.155.182:9615?sock=4806_02fd_11 status 1 duration 30
10/09/15 11:31:39 Starting job 133.0 from 131.225.155.182:9615?sock=4806_02fd_11
10/09/15 11:32:09 Terminated job 133.0 from 131.225.155.182:9615?sock=4806_02fd_11 status 2 duration 30
10/09/15 11:32:11 Starting job 134.0 from 131.225.155.182:9615?sock=4806_02fd_11
10/09/15 11:32:42 Terminated job 134.0 from 131.225.155.182:9615?sock=4806_02fd_11 status 3 duration 31
10/09/15 11:33:29 Starting job 135.0 from 131.225.155.182:9615?sock=4806_02fd_12
10/09/15 11:33:59 Terminated job 135.0 from 131.225.155.182:9615?sock=4806_02fd_12 status 4 duration 30
10/09/15 11:34:01 Starting job 136.0 from 131.225.155.182:9615?sock=4806_02fd_12
10/09/15 11:34:31 Terminated job 136.0 from 131.225.155.182:9615?sock=4806_02fd_12 status 5 duration 30 =================
Total jobs 6 utilization 181
Total goodZ jobs 1 (16.6667%) utilization 30 (16.5746%)
Total goodNZ jobs 5 (83.3333%) utilization 151 (83.4254%)
Total badSignal jobs 0 (0%) utilization 0 (0%)
Total badOther jobs 0 (0%) utilization 0 (0%) === End Stats of main ===

It turns out our code is ok..

So, our guess about the situation under which the user observed his job.xxxx.0.out
is as follows:
Each of the series of jobs that went into the same glidein must be a wrapper which wraps a real job.
Looks like the wrapper script returned 0 regardless of the return code of the real job wrapped inside.
And judging from the fact that all jobs lasted only 30 seconds or so, the wrapped real jobs did not run successfully..

I also compared the format of the condor messages in StarterLog with the format described
in the current parsing_starterlog.awk which is the one that actually parses StarterLog.
They are same, namely, we can exclude the assumption that parsing_starterlog.awk might need some changing
due to a possibility of condor message format change.

#6 Updated by HyunWoo Kim over 4 years ago

I need to point out one more :

job.2284914.1.out shows

=== Stats of main ===
08/01/14 08:34:47 Starting job 543367.0 from 169.228.38.40:9615?sock=85533_8f88_102147
08/01/14 08:35:18 Terminated job 543367.0 from 169.228.38.40:9615?sock=85533_8f88_102147 status 0 duration 31
08/01/14 09:08:55 Starting job 544385.0 from 169.228.38.40:9615?sock=85533_8f88_103995
08/01/14 09:09:43 Terminated job 544385.0 from 169.228.38.40:9615?sock=85533_8f88_103995 status 0 duration 48

This messages are generated by parsing_starterlog.awk which parses log/StarterLog file
Each line that says Terminated job 543367.0 from 169.228.38.40:9615?sock=85533_8f88_102147 status 0 duration
indicates that the Startd actually started a starter and the starter also launched a child process which actually ran the user job.
So, the title of this ticket "even if startd fails to start/run a matched job." is not accurate..

#7 Updated by HyunWoo Kim over 4 years ago

  • Status changed from New to Resolved
  • % Done changed from 70 to 100

Parag was wondering if the current code handles the case of multicore jobs
and the condor_startup.sh shows it does so.
slotlogs="`ls -1 ${main_starter_log} ${main_starter_log}.slot* 2>/dev/null`"

So, we will close this ticket.
In conclusion, we made sure that the current code (glidein_startup.sh and its children scripts) handle
error reporting properly.

#8 Updated by Parag Mhashilkar about 4 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF