Ideas for station tar/merge file retrieval¶
The intention is to recover data files that have been archived in tar format as transparently as possible using the sam station.
Currently files with no valid locations are not included in snapshots. A file which is only physically available in a tar file will therefore not be in the snapshot. One solution to this is to add tar locations as "virtual" file locations pointing to the relevant tar file. These could be calculated on the fly, but this would likely be expensive as it involves checking all direct descendants of every file that otherwise has no location. Alternatively, the location could be precomputed when the tar file is declared (or undeclared). This would require backfilling all the existing tar files, but it shouldn't be too difficult to do so.
If the station gets a request for a file that has no directly available locations, but is available from a tar file, it needs to be able to locate the relevant file. Adding virtual locations as described above would be the simplest way of doing this. Requests for files contained in the same tar file should be grouped for the most efficient use of retrieved tar files.
If the tar file is not currently in the cache, it needs to be pulled from the tape location. This can be done by creating an internal station lightweight project (similar to what is done for station to station file routing).
When the tar file is available, the desired files need to be extracted from it. This requires at least one new RPC call for the stager, which has to do the physical unpacking. When the the files are extracted, they are added as normal cached files and handed off to the projects as normal.
Is it better to extract only the requested files from the tar file, or to unpack the whole lot in case the others are wanted? If only the requested files are unpacked, how long should the tar file be retained in the cache in case somebody does want one of the other files in it?