This article describes the process by which files produced during production are automatically merged. The purpose of merging is to increase the average size of files stored on tape, that is, to avoid storing large numbers of small files. More details about merging workflows can be found in docdb 25439, presented at the Sept. 12, 2019 Analysis Tools meting.
Merging SAM Parameters¶
Merging work flows are controlled by two SAM parameters.
Either SAM parameter can take an integer value, which can be 0 or 1. If parameter
merge.merge is 1, this file is a candidate for merging. If parameter
merge.merged is 1, this file has been merged. Once generated, SAM parameter
merge.merge is never modified. SAM parameter
merge.merged may be modified by subsequent processing.
Configuring Merging Work Flows¶
The merging SAM parmeters are generated when a file is first produced according to fcl parameters of art service
merge.merge is set according to whether fcl parameter "Merge" is true or false (default is false). Parameter
merge.merged is always set to 0 initially.
The fcl configuration of art service
FileCatalogMetadataMicroBooNE is typically not included in standard fcl files stored in a
uboonecode release. Rather, the complete fcl configuration of
FileCatalogMetadataMicroBooNE is added using a fcl wrapper generated by
project.py. The generated fcl configuration is controlled by xml parameters. To set fcl parameter "
Merge" to be true, include the following xml element in the xml stage configuration.
For the record, insise
project.py, the complete fcl configuration of art service
FileCatalogMetadataMicroBooNE is generated by function "
get_sam_metadata" of python module "
FTS Merging Dropbox¶
project.py work flow, when a file is copied to an FTS dropbox directory, the dropbox directory is decided by function "
get_dropbox" of python module "
experiment_utilities." The choice of dropbox is mainly decided by SAM metadata of a particular file. If SAM parameter
merge.merge is 1, the file will usually be sent to a special FTS merging dropbox. If SAM parameter
merge.merge is 0 or missing, the file is sent to a standard tape dropbox. There is an exception if a file is larger 1 GB (hardwired value), in which case the file is copied to the tape storage dropbox, as if SAM parameter
merge.merge had been 0.
The FTS merging dropbox has the property that files are copied, and get a sam location, in a non-tape-backed area of dCache (specifially in the scratch area
/pnfs/uboone/scratch). This disk storage location is temporary until the file is processed by follow-on merge processing.
Merge processing is handled by script
merge2.py, which can be found in uboone suite package
ubutil/scripts. The logic of
merge.py is outlined in this section.
Merge.py interacts with SAM, with the batch system, and with its own merging database. Bookkeeping is handled by a combination of SAM parameters (
merge.merged) and the merging database.
Files that are candidates for merging are identified by the following combination of criteria.
- Files are queried using SAM dimension "
merge.merge 1 and merge.merged 0".
- Queried files are checked whether they have a SAM disk location.
- Queried files are checked whether they actually exist at the specified SAM disk location.
If queried files already have a tape location (i.e. because the size was greater than 1 GB), the SAM parameter
merge.merged is updated to to have value 1.
If queried files meet all of the above criteria (i.e. have disk locations), then files are grouped with similar files for merging. The actual merging is done by batch jobs. Batch jobs are not submitted unless or until merge candidates meet either of the following two criteria.
- Files to be merged exceed some minimum size.
- Files to be merged exceed some minimum age (usually three days).
After a merging batch job runs, the results are checked. If merging was successful, and after merged files have tape locations, SAM parameter
merge.merged of the original unmerged files is set to 1, disk locations are removed from SAM, and the files are deleted from disk. Files generated by merging batch jobs have SAM parameter
merge.merge set to 0 (because they are not candidates for further merging), and are stored directly to tape.
Unmerged files that have been successfully merged remain in the SAM database as so-called virtual files (that is, files without any location). Merged files have as their parents both the unmerged virtual files from which they were made, and the parents of the unmerged files.
How Files are Grouped for Merging¶
Files that are candidate for being merged together must have the same seven cardinal sam metadata, and must have the same run number. That, they must agree with respect to the following eight SAM dimensions.