Non-delivery of log and output files from a SAM job
I've been trying to run the absolute calibration for Luke. I'm using Gavin's multifile method, followed by a postscript that copies back my output files. The project submits and runs fine:
Yet only 1,000 of the 4,800 output histogram files has been delivered:
ls /nova/ana/users/tamsett/AbsCalib_April1_3 | grep -c hist
And NONE of the log files have been written:
ls -lh /nova/data/condor-tmp/tamsett/ | grep "_1736.sh"
So it's very hard to tell what's going on.
Any ideas how to fix either/both of these?
p.s. Sorry if this is the wrong place to post a ticket. I can move it.
#2 Updated by Matthew Tamsett over 6 years ago
I'm not sure they're not over writing each other I'll check that.
Also, what I meant to say was that the log files have appeared, just with zero size and they've never been updated to included any contents, even after the apparent conclusion of their processes.
#4 Updated by Dominick Rocco over 6 years ago
No, this hasn't been resolved. Denis box suggested this:
-l 'when_to_transfer_output = ON_EXIT_OR_EVICT' \
It puts that line in the condor .cmd file. Sadly, it does nothing. It might be because the ON_EXIT option is already set and the options don't overwrite each other the way one might hope. Alternatively, SYSTEM_PERIODIC_REMOVE might do something weird with the glideins that just breaks things. Log files also aren't returned when you do a condor_rm, that's frustrating.
Including a ulimit in jobs can get around the SYSTEM_PERIODIC_REMOVE, but not other cases. Sadly, we don't have that tuned correctly now. The default virtual memory limit in runNovaSAM.py is 3.9*1024^3, but that's too high because the grid uses 4,000,000. We need to dial it back slightly and allow some overhead for art_sam_wrap.sh to do it's thing.
Still, this only helps us when we use runNovaSAM.py. Any custom grid scripts which are less careful will suffer from the same problem.
There's an off chance that this doesn't happen anymore when we move to jobsub_client, but I'd say it's pretty unlikely.
#6 Updated by Christopher Backhouse over 6 years ago
I think the problem with ON_EXIT_OR_EVICT is that eviction is what happens if a job gets checkpointed due to something else with higher priority replacing it. Which is something we never do.
Whereas presumably what PERIODIC_REMOVE does is something else, the same as killing the job.
Agreed that jobsub_client probably doesn't help. I think this is at heart a condor problem. Still we could complain about it now in the hopes that people currently have a better idea how this all works, due to working on jobsub_client.
#7 Updated by Gavin Davies about 6 years ago
- Status changed from New to Resolved
This is resolved with the new jobsub_tools v1_3_1_1_2:
From the README:
Emergency patch of v1_3_1_1
- INC000000435312 and related - if users clean up the $TMP directory
before exiting, log and err files do not come back
At least we can call it resolved since no complaints of this occurring with the new jobsub.
We should keep an eye out for this.