Project

General

Profile

Bug #5807

Non-delivery of log and output files from a SAM job

Added by Matthew Tamsett over 5 years ago. Updated about 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Start date:
04/02/2014
Due date:
% Done:

0%

Estimated time:
Duration:

Description

Hi All,

I've been trying to run the absolute calibration for Luke. I'm using Gavin's multifile method, followed by a postscript that copies back my output files. The project submits and runs fine:

http://samweb.fnal.gov:8480/station_monitor/nova/stations/nova/projects/tamsett-PCHitMC-S14-03-24-20140401_1736

Yet only 1,000 of the 4,800 output histogram files has been delivered:

ls /nova/ana/users/tamsett/AbsCalib_April1_3 | grep -c hist

And NONE of the log files have been written:

ls -lh /nova/data/condor-tmp/tamsett/ | grep "_1736.sh"

So it's very hard to tell what's going on.

Any ideas how to fix either/both of these?

Thanks

Matthew

p.s. Sorry if this is the wrong place to post a ticket. I can move it.

History

#1 Updated by Christopher Backhouse over 5 years ago

Are you sure your output files aren't overwriting each other?

Absense of logs is odd. I've never seen that. Usually the files are there, even if they're empty.

#2 Updated by Matthew Tamsett over 5 years ago

I'm not sure they're not over writing each other I'll check that.

Also, what I meant to say was that the log files have appeared, just with zero size and they've never been updated to included any contents, even after the apparent conclusion of their processes.

#3 Updated by Gavin Davies over 5 years ago

Was there a resolution to this?

#4 Updated by Dominick Rocco over 5 years ago

No, this hasn't been resolved. Denis box suggested this:
-l 'when_to_transfer_output = ON_EXIT_OR_EVICT' \

It puts that line in the condor .cmd file. Sadly, it does nothing. It might be because the ON_EXIT option is already set and the options don't overwrite each other the way one might hope. Alternatively, SYSTEM_PERIODIC_REMOVE might do something weird with the glideins that just breaks things. Log files also aren't returned when you do a condor_rm, that's frustrating.

Including a ulimit in jobs can get around the SYSTEM_PERIODIC_REMOVE, but not other cases. Sadly, we don't have that tuned correctly now. The default virtual memory limit in runNovaSAM.py is 3.9*1024^3, but that's too high because the grid uses 4,000,000. We need to dial it back slightly and allow some overhead for art_sam_wrap.sh to do it's thing.

Still, this only helps us when we use runNovaSAM.py. Any custom grid scripts which are less careful will suffer from the same problem.

There's an off chance that this doesn't happen anymore when we move to jobsub_client, but I'd say it's pretty unlikely.

#5 Updated by Dominick Rocco over 5 years ago

I'll add that if jobsub did this:
-l 'when_to_transfer_output = ON_EXIT_OR_EVICT' \

by default, we might not have that problem. Not sure why you ever wouldn't want that to be the case.

#6 Updated by Christopher Backhouse over 5 years ago

I think the problem with ON_EXIT_OR_EVICT is that eviction is what happens if a job gets checkpointed due to something else with higher priority replacing it. Which is something we never do.

Whereas presumably what PERIODIC_REMOVE does is something else, the same as killing the job.

Agreed that jobsub_client probably doesn't help. I think this is at heart a condor problem. Still we could complain about it now in the hopes that people currently have a better idea how this all works, due to working on jobsub_client.

#7 Updated by Gavin Davies over 5 years ago

  • Status changed from New to Resolved

This is resolved with the new jobsub_tools v1_3_1_1_2:

From the README:

v1_3_1_1_2 8/27/14 ==============================================================
Emergency patch of v1_3_1_1
- INC000000435312 and related - if users clean up the $TMP directory
before exiting, log and err files do not come back

At least we can call it resolved since no complaints of this occurring with the new jobsub.
We should keep an eye out for this.

#8 Updated by Alexander Himmel about 3 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF