Project

General

Profile

Support #4040

Investigate if the events in dsag:/data/complete/_ds50daq_eb00_20130609-220613.root can be recovered

Added by Kurt Biery about 6 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
06/10/2013
Due date:
% Done:

0%

Estimated time:
Duration:

Description

This file was not gracefully closed at end-run time because the EventBuilder (in a single EB and no AG system) hung at end run and had to be killed.

I am working on copying this file to cluck:/home/biery/scratch/ds50Data, but this will take a while - this is a 146 GB file.

History

#1 Updated by Kurt Biery about 6 years ago

Here is an email excerpt from Paul explaining the ROOT file recovery is essentially impossible:

I'm afraid that for all practical purposes file recovery in the real world is impossible.

The tree metadata and branch data buffers don't get written to file when it is not closed properly and so a lot of information about the trees is just destroyed.

...

With extreme effort it would be possible to recover branch data that had been flushed out of TBasket buffers, but it would take very careful hand work, and data that had not been flushed would be missing. It could take weeks to do.

In the art case, the situation is much worse, there is a great deal of information about the contents of the file that is not written out until file close time, and when the program crashes this is never done at all, and the vital art information about the events in the file just disappears (things like the event index, the parentage information, the parameter sets, the process history, etc.).

...

In an earlier email, Paul described a bit about the ROOT data storage format:

The reason that recovery is difficult is that internally a root file looks very much like a unix filesystem, there is a superblock, a list of free blocks, a top-level directory, etc. So a failed close, or missing close on a root file is much like the effect of a system crash on a mounted filesystem, and root files are not journaled like ext3 or ext4. They are more like ext2 was.

#2 Updated by Kurt Biery about 6 years ago

Some additional information from Paul:

So let me amplify a little:

In root a TTree can be configured to checkpoint itself after every X many bytes is written to the tree. This makes it so that when the improperly closed file is opened in update mode, which triggers a recovery operation, the tree will reset itself to appear like it did at the last checkpoint (metadata and data).

Unfortunately this does not play well with the rest of art, because of the way that all the art metadata is not even written until formal file close time.

#3 Updated by Kurt Biery about 6 years ago

  • Status changed from New to Closed


Also available in: Atom PDF