Understanding storage volumes » History » Version 17

« Previous - Version 17/30 (diff) - Next » - Current version
Kevin Retzke, 11/06/2017 04:25 PM
clarify retention for tape-backed areas

Understanding Storage Volumes

Quick (and incomplete) Overview of Storage Volumes

This is a quick reference guide to storage volumes at Fermilab. This doesn't cover all use cases or scenarios. Read below for a more complete discussion of each volume. Please note that "grid accessible" means you can copy to and from the area in question with ifdh cp; it does not imply direct read and write access on the worker node.

Overview of Storage Volumes at Fermilab

Quota/Space Retention Policy Tape Backed? Retention Lifetime on disk Use for path Grid Accessible
Persistent dCache No/~100 TB Managed by Experiment No Till manually deleted immutable files w/ long lifetime /pnfs/<experiment>/persistent Yes
Scratch dCache No/no limit LRU eviction - least recently used file deleted No Approx 30-60 days immutable files w/ short lifetime /pnfs/<experiment>/scratch Yes
Tape backed dCache No/O(400) TB LRU eviction (from disk) Yes Greater than 200 days Long-term archive /pnfs/<experiment>/archive Yes
BlueArc Data Yes (~1 TB)/ ~100 TB total Managed by Experiment No Till manually deleted Storing final analysis samples /<experiment>/data No (Note 1)
BlueArc App Yes (~100 GB)/ ~3 TB total Managed by Experiment No Till manually deleted Storing and compiling software /<experiment>/app No (Note 2)

Note 1: BlueArc Data areas are not directly accessible from grid worker nodes. Access to these volumes via ifdh is limited in both bandwidth and number of active connections such that moving the files via dCache will always be faster and more efficient. Just do it. Furthermore, access via ifdh cp from worker nodes will no longer be available in January 2018.

Note 2: There is no write access to BlueArc App areas from grid worker nodes. Direct read access to BlueArc App areas on worker nodes will cease in January 2018.

Discussion of all the storage elements you may encounter at Fermilab

There are three types of storage volumes that you will encounter at Fermilab: local hard drives, network attached storage, and distributed storage. Each has it's own advantages and limitations, and knowing which one to use when isn't all straightforward or obvious. But with a small amount of foresight, you can avoid some of the common pitfalls that have caught out other users. Finding out what types of volumes are available on a node can be achieved with the "df" command which lists a lot of information about each volume (total size, available size, mount point, device location).

Local Volumes

Local hard drives are the storage volumes that is the most familiar to users. These volumes are mounted on the machine with direct links to the /dev/ location. An example from the MicroBooNE interactive node would be the /var volume:

<> df -h /var
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda5       7.9G  964M  6.6G  13% /var

This volume is locally mounted (note the /dev/vda5 in the Filesystem column), has a total size of 7.9 GiB, 964 MiB of used space, and is available by changing present directories "cd /var". Locally mounted volumes has several advantages: they access speed to the volume is usually very high, the volume will have full POSIX access (, and they are the ONLY type of storage volume where you can permitted to keep authentication certificates and tickets. It is also good practice to only keep proxies on locally mounted volumes. These volumes are also commonly small and should not be used to store data files or for code development areas. Note that /tmp and /var/tmp are for temporary storage and files are removed after ~30 days if they haven't been accessed. It's a temporary directory - it's right there in the name. You should also know that local volumes are not accessibly as an output location from grid worker nodes.

NAS Volumes

The next most common type of storage element that a user will utilize is network attached storage (NAS). In general, traditional NAS volumes are comprised of RAIDs ( that have full POSIX capabilities. At Fermilab, almost all NAS volumes are mounted using NFSv4 protocols. This makes the volumes behave almost identically to a locally mounted volume with the exception that access speeds are limited by the network bandwidth and latency. The most common example for experiments are the /<exp>/data and /<exp>/app volumes. Again, using MicroBooNE as an example we can see the properties of the volume using df:

<> df -h /uboone/app /uboone/data
Filesystem                         Size  Used Avail Use% Mounted on  3.3T  2.9T  394G  88% /uboone/app     52T   43T  9.7T  82% /uboone/data

You can identify a volume as NAS since there is a server name (e.g. for /uboone/app) in the Filesystem column. These volumes have full POSIX access from the nodes they are mounted on (not the case for dCache), but the two volumes serve different purposes. Note you should NOT storage any certificates, keytabs, tickets, or proxies on these volumes since information passes across the network.

Experiment app volumes are designed to be used primarily for code and script development having slightly lower latency but also smaller total storage. The lower latency is good for writing lots of smaller files such as when compiling analysis code. As well, the /<exp>/app volume is the location where experiment software coordinators should distribute their software packages in the /<exp>/app/products UPS product area. Some of the experiment /<exp>/app volumes are mounted on GPGrid worker nodes, but none of these volumes are mounted on OSG worker nodes. And these volumes are quickly being unmounted, so you should avoid any references to those volumes in any of your workflows. So do not expect a script which references /<exp>/app to succeed when submitted to run on the grid. (More on this later.) Note that the /<exp>/app volumes are not accessible from grid worker nodes so you cannot use that as the output directory for grid jobs. This quota for each user on app volumes is determined by the experiment offline management.

The experiment data volumes (/<exp>/data) are designed to store ntuples and small datasets, but they have higher latency than the app volumes. Using data volumes can be needed when you have files that need to have full POSIX access (read/write/modify) that isn't available through dCache (more on that later). None of the experiment data volumes are mounted on any worker nodes (GPGrid or OSG). And while transfer to the /<exp>/data volume is possible, users should know that the transfer is highly throttled with a maximum of 5 transfer to any /<exp>/data volume possible at any given time. Transferring from or to a data volume can cause considerable inefficiency for grid jobs since every job must queue up waiting to get through just those 5 doors. The Fermilab dCache has much greater capacity for transfer doors and bandwidth. This quota for each user on data volumes is determined by the experiment offline management.

Users' home areas on the General Purpose Virtual Machines (GPVMs) are also NAS attached storage and so certificates, keytabs, tickets, and proxies should not be stored in users' home area.

Special BlueArc volumes

There are several BlueArc volumes that have been created for unique purposes. These volumes are /grid/fermiapp, /grid/app, and /grid/data. The /grid/fermiapp volume is a volume created specifically for distribution of Scientific Computing Division (SCD) maintained software (e.g. /grid/fermiapp/products/common/ and /grid/fermiapp/products/larsoft/) through UPS product areas on the GPGrid worker nodes. There are also directories on this volume for use by experiments /grid/fermiapp/products/uboone/, but these directories are going away since the combination of /<exp>/app and CVMFS distribution provides all the need capability previously supplied by /grid/fermiapp/products/<exp>/. It is important to stress that users should NOT be using /grid/fermiapp, /grid/app, or /grid/data as a location for their worker node output. Those volumes are for the use of SCD to perform grid testing and their use by experiments is strongly discouraged. Experiments with /<exp>/app, /<exp>/data, and dCache space should never use any of these volumes, and users should have no expectation that data stored is permanently stored. Those volumes may be modified and data removed with very limited notice.

dCache Volumes

dCache volumes are a special form a storage volume and is utilized at Fermilab to provide more the 3 PiB of RAID based storage and access to tape storage with a capacity of more than 50 PiB. As well as being a large storage element, it also has the large bandwidth for transfers to and from grid worker nodes. It serves as the main storage element for both experiment production groups and analyzers. And while dCache has many capabilities, it also has some very specific limitations that must be understood in order to effectively utilize these volumes. One of the most important things to understand is that files written to dCache are immutable, meaning that once a file is written into a dCache volume it cannot be modified, only read, moved, or deleted.

There are three distinct types of storage volumes that users should be aware of: persistent, scratch, and tape-backed volumes. The general properties of each type are listed in the table at the top of the page and you should understand that each volume behaves and handles files differently. The volume that will act the most similarly to a standard hard drive is the persistent volume (a.k.a. /pnfs/<experiment>/persistent/). This volume stores files persistently and they are not copied to tape storage. This means that the data in the file is actively available for reads at any time. As well, the file will not be deleted or removed from the volume unless manually removed by someone on the experiment. The next type of dCache volume is the scratch volume. This is a large storage volume (~ 1 PiB) that is shared across all experiments with each experiment having an access point /pnfs/<experiment>/scratch/. This is an important aspect of the volume because it means that the actions of other experiments will affect files stored by your experiment. These files are stored on disk but they are not copied to tape. When a new file is written to scratch space, old files are removed in order to make room for the newer file. This removal is done without any notification to the owner of the file and is permanent. The file least recently read in the scratch volume will be removed until there is space for the new file. The lifetime of files in the volume can be seen here: Note again, that you will not be notified that your file will be removed, so only use the scratch area as temporary storage of files and know the expected lifetime of the files before they need to either be deleted or transferred to a separate volume.

The tape-backed volumes for an experiment are just that: they are disk based storage areas that have their contents mirrored to permanent storage on Enstore tape. These volumes (name depends on experiment configuration, e.g. /pnfs/uboone/data/, /pnfs/nova/archive/). While the files in this area will not be deleted from tape storage without explicit commands, they are NOT actively available for immediate read from disk. Files on tape must be copied from tape onto the dCache disk that sits in front of the Enstore storage facility (this is why dCache contains the word "cache"). In order for a file to be available for immediate reading, it must be copied from Enstore to dCache disk. Once staged from tape, the data in the file is available for read. This staging is commonly referred to as "prestaging" data since transferring the file from tape to disk (a.k.a Enstore to dCache) should be done before the files are to be processed by grid jobs. The most efficient way to prestage files is to utilize the SAM command: "samweb prestage-dataset <dataset_name>". Users should always consider the size of the dataset that they are prestaging and not prestage large datasets (> 50 TB) without consulting their Offline Coordinators.