Project

General

Profile

File Merging

This article describes the process by which files produced during production are automatically merged. The purpose of merging is to increase the average size of files stored on tape, that is, to avoid storing large numbers of small files. More details about merging workflows can be found in docdb 25439, presented at the Sept. 12, 2019 Analysis Tools meting.

Merging SAM Parameters

Merging work flows are controlled by two SAM parameters.

  • merge.merge
  • merge.merged

Either SAM parameter can take an integer value, which can be 0 or 1. If parameter merge.merge is 1, this file is a candidate for merging. If parameter merge.merged is 1, this file has been merged. Once generated, SAM parameter merge.merge is never modified. SAM parameter merge.merged may be modified by subsequent processing.

Configuring Merging Work Flows

The merging SAM parmeters are generated when a file is first produced according to fcl parameters of art service FileCatalogMetadataMicroBooNE. Parameter merge.merge is set according to whether fcl parameter "Merge" is true or false (default is false). Parameter merge.merged is always set to 0 initially.

The fcl configuration of art service FileCatalogMetadataMicroBooNE is typically not included in standard fcl files stored in a uboonecode release. Rather, the complete fcl configuration of FileCatalogMetadataMicroBooNE is added using a fcl wrapper generated by project.py. The generated fcl configuration is controlled by xml parameters. To set fcl parameter "Merge" to be true, include the following xml element in the xml stage configuration.

<merge>1</merge>

For the record, insise project.py, the complete fcl configuration of art service FileCatalogMetadataMicroBooNE is generated by function "get_sam_metadata" of python module "experiment_utilities."

FTS Merging Dropbox

In any project.py work flow, when a file is copied to an FTS dropbox directory, the dropbox directory is decided by function "get_dropbox" of python module "experiment_utilities." The choice of dropbox is mainly decided by SAM metadata of a particular file. If SAM parameter merge.merge is 1, the file will usually be sent to a special FTS merging dropbox. If SAM parameter merge.merge is 0 or missing, the file is sent to a standard tape dropbox. There is an exception if a file is larger 1 GB (hardwired value), in which case the file is copied to the tape storage dropbox, as if SAM parameter merge.merge had been 0.

The FTS merging dropbox has the property that files are copied, and get a sam location, in a non-tape-backed area of dCache (specifially in the scratch area /pnfs/uboone/scratch). This disk storage location is temporary until the file is processed by follow-on merge processing.

Merge Processing

Merge processing is handled by script merge2.py, which can be found in uboone suite package ubutil/scripts. The logic of merge.py is outlined in this section. Merge.py interacts with SAM, with the batch system, and with its own merging database. Bookkeeping is handled by a combination of SAM parameters (merge.merge and merge.merged) and the merging database.

Files that are candidates for merging are identified by the following combination of criteria.

  • Files are queried using SAM dimension "merge.merge 1 and merge.merged 0".
  • Queried files are checked whether they have a SAM disk location.
  • Queried files are checked whether they actually exist at the specified SAM disk location.

If queried files already have a tape location (i.e. because the size was greater than 1 GB), the SAM parameter merge.merged is updated to to have value 1.

If queried files meet all of the above criteria (i.e. have disk locations), then files are grouped with similar files for merging. The actual merging is done by batch jobs. Batch jobs are not submitted unless or until merge candidates meet either of the following two criteria.

  • Files to be merged exceed some minimum size.
  • Files to be merged exceed some minimum age (usually three days).

After a merging batch job runs, the results are checked. If merging was successful, and after merged files have tape locations, SAM parameter merge.merged of the original unmerged files is set to 1, disk locations are removed from SAM, and the files are deleted from disk. Files generated by merging batch jobs have SAM parameter merge.merge set to 0 (because they are not candidates for further merging), and are stored directly to tape.

Unmerged files that have been successfully merged remain in the SAM database as so-called virtual files (that is, files without any location). Merged files have as their parents both the unmerged virtual files from which they were made, and the parents of the unmerged files.

How Files are Grouped for Merging

Files that are candidate for being merged together must have the same seven cardinal sam metadata, and must have the same run number. That, they must agree with respect to the following eight SAM dimensions.

  • file_type
  • file_format
  • data_tier
  • data_stream
  • ub_project.name
  • ub_project.stage
  • ub_project.version
  • run_number