Project

General

Profile

Feature #6843

Don't produce orphaned RootOutput-* and TFileService-* files

Added by Christopher Backhouse about 6 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Category:
Application
Target version:
Start date:
08/20/2014
Due date:
% Done:

100%

Estimated time:
1.00 h
Spent time:
Scope:
Internal
Experiment:
NOvA
SSI Package:
art
Duration:

Description

I understand the purposes of these files, 1) not to create incomplete files under the output filename and 2) not to assume any particular tmp directory is large enough.

But I get huge numbers of them cluttering up all my working directories. Very frequently in code development the job is forcibly killed.

In principle it's possible to do a sequence like this:
  • Create your RootOutput file, keep a file handle open to it
  • rm RootOutput file. So long as you keep the handle open, the file still exists. But if the job crashes or is killed there's nothing left behind
  • At the end of the job, link your file back into the filesystem at the output filename.

See attachment for proof of principle.

I think for ROOT output you could create and unlink the file first and then open the TFile on the /proc/self/fd/ location.

If this is too elaborate, is it possible to put logic into the signal-handler for when I press Ctrl-C twice to delete these temporary files? This is by far the main way I create them. Explicitly killing from another shell is rarer.

test_tmp.cxx (630 Bytes) test_tmp.cxx Christopher Backhouse, 08/21/2014 06:09 PM

History

#1 Updated by Marc Paterno about 6 years ago

  • Status changed from New to Rejected

The behavior you are observing is actually a requirement from our stakeholders: even when a program fails (that is, the program is shutting down due to an exception), and when a program is shut down in response to a signal (the Ctrl-C from the command line), the art executable does as much as it can to avoid losing data. So not only are the files not deleted, but significant effort is applied to making sure the files are closed cleanly.

If the NOvA experiment wants to raise a different requirement, please let us know, and we will schedule the discussion for a Stakeholders meeting.

#2 Updated by Kanika Sachdev about 6 years ago

Is it not possible to make this behaviour optional, at the very least, so that we can configure with a fhicl switch?

#3 Updated by Christopher Backhouse about 6 years ago

From just a few hours of work on my module I see I had 25 RootOutput files and the same number of TFileService files.

This isn't valuable detector data that needs to be preserved, this is all completely worthless. I would support Kanika's switch, and have it default to off. Anyone who really cares about the output of failed jobs so much can turn it on.

#4 Updated by Christopher Backhouse about 6 years ago

Is it necessary for the temporary files to have a unique name every time?
Can't they just be ${OUTPUT}.partial.root?
If a job is run with the same output name twice it's going to overwrite its previous result if it succeeds, so this would be no different for the temporary file.
This would address my main problem, which is that hundreds of these things build up.

#5 Updated by Christopher Green about 6 years ago

  • Tracker changed from Bug to Feature
  • Category set to I/O
  • Status changed from Rejected to Feedback
  • SSI Package art added
  • SSI Package deleted ()

We will schedule this for discussion at the next stakeholders meeting.

#6 Updated by Christopher Green about 6 years ago

I should add that, if clutter is your only problem, you can set the RootOutput module parameter tmpDir to /tmp or /var/tmp, and the files will be out of your sight and cleaned up anyway after 10 days. However, they will still be available if you need them. If the files are sufficiently large and/or numerous to cause concern about filling up these central areas, you can set up your own scratch directory and run the standard command tmpwatch yourself as a regular cron job (optionally setting a much smaller interval than the usual 10 days).

#7 Updated by Christopher Backhouse about 6 years ago

Interesting.

Of course, that would be much more useful if I could configure that behaviour by default via an environment variable or ~/.artrc or whatever. But suggestions for those sorts of things have suffered in the past from concerns about reproducibility of people's problems when they all have distinct setups.

(I'd also set fastCloning: false in such a global config. It's pretty unusual for that not to wind up being necessary for some reason or other).

#8 Updated by Christopher Green about 6 years ago

  • Category changed from I/O to Application
  • Status changed from Feedback to Assigned
  • Assignee set to Christopher Green
  • Target version set to 1.13.00
  • Estimated time set to 1.00 h
  • Experiment NOvA added
  • Experiment deleted (-)

Per discussion in stakeholder meeting, we will implement a --tmpdir option.

#9 Updated by Christopher Green about 6 years ago

  • Target version changed from 1.13.00 to 1.12.00
  • % Done changed from 0 to 100

--tmpdir option implemented with 0af5798.

#10 Updated by Christopher Green almost 6 years ago

  • Status changed from Assigned to Resolved

#11 Updated by Christopher Green over 5 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF