Treat Disconnect and Job Failure Differently
A thought I've had as we try to move more of our processing off-site. Many OSG sites will disconnect a job if priorities on the site change, and then put the job back in the queue, the logic on Condor's mind being that a job will do the same thing every time it runs, so putting it back in the queue is the equivalent of trying again. However with a SAM project, the file the job was working on when it was evicted is just marked "skipped" and never re-tried. Would it be possible to try to flag these disconnects and have SAM put the file back in the queue of files to be processed?
This is not a pressing request, obviously, but something to turn over in your mind to better get the different FIFE tools working together.
#1 Updated by Robert Illingworth about 4 years ago
I've already been thinking along these lines. Putting them back in the queue is doable, but complex. And there are other issues, like how do we even detect that condor has preempted a job? So it's something we're already thinking about, but it's been lower down the priorities.