Project

General

Profile

Bug #2081

Glidein monitoring improperly accounting preempted jobs

Added by Igor Sfiligoi over 8 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Factory Monitoring
Target version:
Start date:
10/31/2011
Due date:
% Done:

0%

Estimated time:
1.00 h
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

I just noticed that the glidein monitoring flags preempted jobs (at glidein deadline) as successfully finished!

Don't know if this was happening in the past as well.
But we should fix it.

Igor

[entry_CMS_T2_US_Purdue_rossmann]$ tail -100 job.831.1.out
... === Last script starting Mon Oct 31 14:20:26 EDT 2011 (1320085226) after validating for 40 ===

Saving PATH into GLIDEIN_PATH_magic321 === Condor starting Mon Oct 31 14:20:27 EDT 2011 (1320085227) === === Condor started in background, now waiting on process 31997 === === Condor ended Mon Oct 31 15:12:27 EDT 2011 (1320088347) after 3120 ===

=== Stats of main ===
10/31/11 14:21:12 Starting job 314224.0 from 169.228.130.26:9615?sock=567_40b0_949100
10/31/11 15:09:30 Terminated job 314224.0 from 169.228.130.26:9615?sock=567_40b0_949100 status 0 duration 2898 =================
Total jobs 1 utilization 2898
Total goodZ jobs 1 (100%) utilization 2898 (100%)
Total goodNZ jobs 0 (0%) utilization 0 (0%)
Total badSignal jobs 0 (0%) utilization 0 (0%)
Total badOther jobs 0 (0%) utilization 0 (0%) === End Stats of main ===
...

[entry_CMS_T2_US_Purdue_rossmann]$ cat_StarterLog.py job.831.1.err
10/31/11 14:21:11 (pid:32126) Job 314224.0 set to execute immediately
10/31/11 14:21:11 (pid:32126) Starting a VANILLA universe job with ID: 314224.0
10/31/11 14:21:11 (pid:32126) IWD: /tmp/glide_e29692/execute/dir_32126
10/31/11 14:21:11 (pid:32126) Renice expr "0" evaluated to 0
10/31/11 14:21:11 (pid:32126) Using wrapper /tmp/glide_e29692/condor_job_wrapper.sh to exec /tmp/glide_e29692/execute/dir_32126/condor_exec.exe
10/31/11 14:21:12 (pid:32126) Create_Process succeeded, pid=32203
10/31/11 15:09:29 (pid:32126) Got SIGQUIT. Performing fast shutdown.
10/31/11 15:09:29 (pid:32126) ShutdownFast all jobs.
10/31/11 15:09:30 (pid:32126) Process exited, pid=32203, status=0
10/31/11 15:09:31 (pid:32126) Last process exited, now Starter is exiting
10/31/11 15:09:31 (pid:32126) ** condor_starter (condor_STARTER) pid 32126 EXITING WITH STATUS 0

sfiligoi@submit-2 ~/submit/itb$ grep 314224.000 a.log
000 (314224.000.000) 10/31 11:18:32 Job submitted from host: <169.228.130.26:9615?sock=356_5644_2>
001 (314224.000.000) 10/31 11:21:11 Job executing on host: <172.18.48.20:46760?CCBID=169.228.130.23:9742#53968&noUDP>
006 (314224.000.000) 10/31 11:26:20 Image size of job updated: 12560
004 (314224.000.000) 10/31 12:09:31 Job was evicted.
001 (314224.000.000) 10/31 12:13:07 Job executing on host: <172.18.48.13:39277?CCBID=169.228.130.23:9865#56420&noUDP>
004 (314224.000.000) 10/31 13:00:11 Job was evicted.
001 (314224.000.000) 10/31 13:00:56 Job executing on host: <172.18.48.13:56588?CCBID=169.228.130.23:9832#61178&noUDP>

History

#1 Updated by Burt Holzman over 8 years ago

  • Assignee set to Parag Mhashilkar
  • Target version set to v2_5_3

#2 Updated by Parag Mhashilkar over 8 years ago

  • Category set to Factory Monitoring
  • Status changed from New to Assigned

Isn't this duplicate of Bug #1492 which I fixed recently. If we correctly treat 004 equivalent to 007 (for now) I think we should get the right behavior.

#3 Updated by Igor Sfiligoi over 8 years ago

No... this is a WN problem... the awk script called from condor_startup.sh.

Bug #1492 was about a problem of Condor-G monitoring.

Igor

#4 Updated by Parag Mhashilkar over 8 years ago

There is not enough information from the starter logs below except the SIGQUIT signal that the job was evicted. What if the job indeed exited with 0 during the process? Then the accounting will be off again. How many such cases do you see? I am trying to see if it is worth delaying v2.5.3 for this one.

10/31/11 14:21:12 (pid:32126) Create_Process succeeded, pid=32203
10/31/11 15:09:29 (pid:32126) Got SIGQUIT. Performing fast shutdown.
10/31/11 15:09:29 (pid:32126) ShutdownFast all jobs.
10/31/11 15:09:30 (pid:32126) Process exited, pid=32203, status=0
10/31/11 15:09:31 (pid:32126) Last process exited, now Starter is exiting

#5 Updated by Parag Mhashilkar over 8 years ago

  • Target version changed from v2_5_3 to v2_5_4

#6 Updated by Igor Sfiligoi over 8 years ago

Not critical; I don't expect it to be in 2.5.3.

And I don't have an answer on how to fix it.
I just noticed one problem, now someone needs to have a deeper look and look what is the proper way to fix it.
I am happy to give that person access to the logs... just let me know.

#7 Updated by Parag Mhashilkar over 8 years ago

It will most likely be me looking at it. If there is nothing critical in the relevant log files can you please upload the logs with this issue? This will make it easier to transfer the issue along with all the info to someone else if need be. Or else let me know how to get the required logs.

#8 Updated by Igor Sfiligoi over 8 years ago

After our recent discussion of moving to parsing proper startd history instead of the generic starter log, I am leaning toward abandoning this ticket, as it may be a waste of time.

However, we better make sure the alternative actually works:
http://research.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#param:StartdHistory

Igor

#9 Updated by Parag Mhashilkar over 8 years ago

  • Target version changed from v2_5_4 to v2_5_5

#10 Updated by Burt Holzman over 8 years ago

  • Priority changed from Normal to Low

#11 Updated by Parag Mhashilkar over 8 years ago

  • Target version changed from v2_5_5 to v2_5_6

#12 Updated by Parag Mhashilkar over 8 years ago

  • Target version changed from v2_5_6 to v2_5_7

#13 Updated by Burt Holzman over 8 years ago

  • Target version changed from v2_5_7 to v2_7_x

#14 Updated by Burt Holzman about 8 years ago

  • Priority changed from Low to Normal

#15 Updated by Parag Mhashilkar almost 8 years ago

Talked to Igor. The link above is now broken. It is actually referring to STARTD_HISTORY attribute.

What we want is
  • the glidein should configure condor daemons to write STARTD_HISTORY
  • parse the required info from STARTD_HISTORY rather than the starter logs

#16 Updated by Parag Mhashilkar almost 8 years ago

Note: STARTD_HISTORY is only available since 7.4 but that should be ok as we don't support old condor releases that far in past.

#17 Updated by Parag Mhashilkar about 7 years ago

  • Target version changed from v2_7_x to v3_x

#18 Updated by Parag Mhashilkar over 6 years ago

  • Assignee changed from Parag Mhashilkar to Marco Mambelli

#19 Updated by Parag Mhashilkar over 3 years ago

  • Assignee changed from Marco Mambelli to Dennis Box
  • Target version changed from v3_x to v3_2_17

#20 Updated by Parag Mhashilkar over 3 years ago

  • Target version changed from v3_2_17 to v3_2_18

#21 Updated by Marco Mambelli over 3 years ago

  • Target version changed from v3_2_18 to v3_2_19

#22 Updated by Dennis Box over 3 years ago

I took a look at this. From Parags old comments:

Talked to Igor. The link above is now broken. It is actually referring to STARTD_HISTORY attribute.
What we want is

the glidein should configure condor daemons to write STARTD_HISTORY
parse the required info from STARTD_HISTORY rather than the starter logs

The first part was done some time ago, we create and bring back the StartdHistoryLog with every glidein.

To understand the second part, I set up a stand alone condor pool to see what the startd_history shows for preempted jobs. I was able to evict jobs with lower priorities to run jobs with higher ones. Unfortunately, when a job is evicted, the startd_history file shows it completing normally as well, i.e. transitioning from jobstatus 2 to jobstatus 4. I am not sure this will help us, unless I am looking at the wrong part of the log.

#23 Updated by Dennis Box over 3 years ago

I looked again at the StartdHistoryLog. While Jobstatus doesn't tell you what you want to know for an evicted job, numjobstarts, numshadowstarts, ExitSignal and probably some other classad attributes do give useful information.

#24 Updated by Dennis Box over 3 years ago

I am able to set up preemption on glideins but do not see the errors originally reported by Igor 5 years ago.

Here is the starterlog of a glidein with a pre-emption:

[root@fermicloud365 entry_ITB_FC_CE2]# cat_StarterLog.py job.1739.0.err
03/16/17 11:59:17 (pid:8525) DockerProc::Detect()
03/16/17 11:59:17 (pid:8525) Failed to run '/usr/bin/docker -v'.
03/16/17 11:59:17 (pid:8525) DockerAPI::detect() failed to detect the Docker version; assuming absent.
03/16/17 11:59:35 (pid:8825) **********************************************
03/16/17 11:59:35 (pid:8825) * condor_starter (CONDOR_STARTER) STARTING UP
[.......]
03/16/17 11:59:35 (pid:8825) Running job as user same uid as parent: personal condor
03/16/17 11:59:35 (pid:8825) Create_Process succeeded, pid=8832
03/16/17 11:59:35 (pid:8825) Unable to write into oom_adj file for the starter: (errno=13, Permission denied)
03/16/17 12:06:55 (pid:8825) Got SIGTERM. Performing graceful shutdown.
03/16/17 12:06:55 (pid:8825) ShutdownGraceful all jobs.
03/16/17 12:06:55 (pid:8825) Process exited, pid=8832, signal=15
03/16/17 12:06:56 (pid:8825) Last process exited, now Starter is exiting
03/16/17 12:06:56 (pid:8825) *
condor_starter (condor_STARTER) pid 8825 EXITING WITH STATUS 0
03/16/17 12:06:56 (pid:27027)
**************************************************
03/16/17 12:06:56 (pid:27027) * condor_starter (CONDOR_STARTER) STARTING UP
03/16/17 12:06:56 (pid:27027) *
/tmp/glide_07vpwY/main/condor/sbin/condor_starter
[.......]
03/16/17 12:06:56 (pid:27027) Running job as user same uid as parent: personal condor
03/16/17 12:06:56 (pid:27027) Create_Process succeeded, pid=27068
03/16/17 12:06:56 (pid:27027) Unable to write into oom_adj file for the starter: (errno=13, Permission denied)
03/16/17 12:16:56 (pid:27027) Process exited, pid=27068, status=0
03/16/17 12:16:57 (pid:27027) Got SIGQUIT. Performing fast shutdown.
03/16/17 12:16:57 (pid:27027) ShutdownFast all jobs.
03/16/17 12:16:57 (pid:27027) ** condor_starter (condor_STARTER) pid 27027 EXITING WITH STATUS 0

And here from the stdout file job.1739.0.out:

Total jobs/goodZ jobs/goodNZ jobs/badSignal jobs/badOther jobs below are normalized to 1 Core === Stats of main ===
03/16/17 11:59:35 Starting job 178.5 from 131.225.155.130:9615?addrs=131.225.155.130-9615&noUDP&sock=886427_c2e7_5150
03/16/17 12:06:55 Terminated job 178.5 from 131.225.155.130:9615?addrs=131.225.155.130-9615&noUDP&sock=886427_c2e7_5150 signal 15 duration 440
03/16/17 12:06:56 Starting job 180.7 from 131.225.155.130:9615?addrs=131.225.155.130-9615&noUDP&sock=886427_c2e7_5159
03/16/17 12:16:56 Terminated job 180.7 from 131.225.155.130:9615?addrs=131.225.155.130-9615&noUDP&sock=886427_c2e7_5159 status 0 duration 600 =================
Total jobs 2 utilization 1040
Total goodZ jobs 1 (50%) utilization 600 (57.6923%)
Total goodNZ jobs 0 (0%) utilization 0 (0%)
Total badSignal jobs 1 (50%) utilization 440 (42.3077%)
Total badOther jobs 0 (0%) utilization 0 (0%) === End Stats of main ===

I can get the same information from the StartdHistoryLog that I get from the stdout file (job.1739.0.out) but I don't see the point.

Am I preempting wrong? Did this get fixed as a side effect of something else in the intervening 5 years?

#25 Updated by Marco Mambelli about 3 years ago

  • Target version changed from v3_2_19 to v3_2_20

#26 Updated by Marco Mambelli almost 3 years ago

  • Target version changed from v3_2_20 to v3_2_21

#27 Updated by Marco Mambelli over 2 years ago

  • Target version changed from v3_2_21 to v3_2_22

#28 Updated by Dennis Box over 2 years ago

  • Estimated time set to 1.00 h

I intend to look at this ticket one last time briefly and then if what I believe is the case is so cancel it. I believe it was fixed as a side effect of something else quite some time ago.

#29 Updated by Dennis Box over 2 years ago

  • Status changed from Assigned to Closed

This ticket should have been closed when commit 22852e957dcb762ace438766ded5387110d6ec48 was made as part of #6415
Log message:

commit 22852e957dcb762ace438766ded5387110d6ec48
Author: Igor Sfiligoi <>
Date: Wed Jun 4 15:19:54 2014 -0700

6415# Create the StartdHistoryLog and ship it back. Add the tool to extract it

Closing ticket



Also available in: Atom PDF