Project

General

Profile

Understanding Storage Volumes

Quick (and incomplete) Overview of Storage Volumes

This is a quick reference guide to storage volumes at Fermilab. This doesn't cover all use cases or scenarios. Read below for a more complete discussion of each volume. Please note that "grid accessible" means you can copy to and from the area in question with ifdh cp; it does not imply direct read and write access on the worker node.

Overview of Storage Volumes at Fermilab

Quota/Space Retention Policy Tape Backed? Retention Lifetime on disk Use for path Grid Accessible
Persistent dCache No/~100 TB/exp Managed by Experiment No Till manually deleted immutable files w/ long lifetime /pnfs/<experiment>/persistent Yes
Scratch dCache No/no limit LRU eviction - least recently used file deleted No Varies, ~30 days (NOT guaranteed) immutable files w/ short lifetime /pnfs/<experiment>/scratch Yes
Resilient dCache No/no limit Periodic eviction if file not accessed No Approx. 30 days input tarballs with custom code for grid jobs (do NOT use for grid job outputs) /pnfs/<experiment>/resilient Yes
Tape backed dCache No/O(4) PB LRU eviction (from disk) Yes Approx 30 days Long-term archive /pnfs/<experiment>/rest_of_path Yes
BlueArc Data Yes (~1 TB)/ ~100 TB total Managed by Experiment No Till manually deleted Storing final analysis samples /<experiment>/data No
BlueArc App Yes (~100 GB)/ ~3 TB total Managed by Experiment No Till manually deleted Storing and compiling software /<experiment>/app No

Discussion of all the storage elements you may encounter at Fermilab

There are three types of storage volumes that you will encounter at Fermilab: local hard drives, network attached storage, and distributed storage. Each has it's own advantages and limitations, and knowing which one to use when isn't all straightforward or obvious. But with a small amount of foresight, you can avoid some of the common pitfalls that have caught out other users. Finding out what types of volumes are available on a node can be achieved with the "df" command which lists a lot of information about each volume (total size, available size, mount point, device location).

Local Volumes

Local hard drives are the storage volumes that is the most familiar to users. These volumes are mounted on the machine with direct links to the /dev/ location. An example from the MicroBooNE interactive node would be the /var volume:

<uboonegpvm01.fnal.gov> df -h /var
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda5       7.9G  964M  6.6G  13% /var

This volume is locally mounted (note the /dev/vda5 in the Filesystem column), has a total size of 7.9 GiB, 964 MiB of used space, and is available by changing present directories "cd /var". Locally mounted volumes has several advantages: they access speed to the volume is usually very high, the volume will have full POSIX access (https://en.wikipedia.org/wiki/POSIX), and they are the ONLY type of storage volume where you can permitted to keep authentication certificates and tickets. It is also good practice to only keep proxies on locally mounted volumes. These volumes are also commonly small and should not be used to store data files or for code development areas. Note that /tmp and /var/tmp are for temporary storage and files are removed after ~30 days if they haven't been accessed. It's a temporary directory - it's right there in the name. You should also know that local volumes are not accessibly as an output location from grid worker nodes.

NAS Volumes

The next most common type of storage element that a user will utilize is network attached storage (NAS). In general, traditional NAS volumes are comprised of RAIDs (https://en.wikipedia.org/wiki/RAID) that have full POSIX capabilities. At Fermilab, almost all NAS volumes are mounted using NFSv4 protocols. This makes the volumes behave almost identically to a locally mounted volume with the exception that access speeds are limited by the network bandwidth and latency. The most common example for experiments are the /<exp>/data and /<exp>/app volumes. Again, using MicroBooNE as an example we can see the properties of the volume using df:

<uboonegpvm01.fnal.gov> df -h /uboone/app /uboone/data
Filesystem                         Size  Used Avail Use% Mounted on
if-nas-0.fnal.gov:/microboone/app  3.3T  2.9T  394G  88% /uboone/app
blue3.fnal.gov:/microboone/data     52T   43T  9.7T  82% /uboone/data

You can identify a volume as NAS since there is a server name (e.g. if-nas-0.fnal.gov for /uboone/app) in the Filesystem column. These volumes have full POSIX access from the nodes they are mounted on (not the case for dCache), but the two volumes serve different purposes. Note you should NOT storage any certificates, keytabs, tickets, or proxies on these volumes since information passes across the network.

Experiment app volumes are designed to be used primarily for code and script development having slightly lower latency but also smaller total storage. The lower latency is good for writing lots of smaller files such as when compiling analysis code. As well, the /<exp>/app volume is the location where experiment software coordinators should distribute their software packages in the /<exp>/app/products UPS product area. The experiment /<exp>/app volumes are mounted on neither the GPGrid nor OSG worker nodes. This quota for each user on app volumes is determined by the experiment offline management.

The experiment data volumes (/<exp>/data) are designed to store ntuples and small datasets, but they have higher latency than the app volumes. Using data volumes can be needed when you have files that need to have full POSIX access (read/write/modify) that isn't available through dCache (more on that later). None of the experiment data volumes are mounted on any worker nodes (GPGrid or OSG). This quota for each user on data volumes is determined by the experiment offline management.

Users' home areas on the General Purpose Virtual Machines (GPVMs) are also NAS attached storage and so certificates, keytabs, tickets, and proxies should not be stored in users' home area.

Special BlueArc volumes

There are several BlueArc volumes that have been created for unique purposes. These volumes are /grid/fermiapp, /grid/app, and /grid/data. The /grid/fermiapp volume is a volume created specifically for distribution of Scientific Computing Division (SCD) maintained software (e.g. /grid/fermiapp/products/common/ and /grid/fermiapp/products/larsoft/) through UPS product areas. There are also directories on this volume for use by experiments /grid/fermiapp/products/uboone/, but these directories are going away since CVMFS distribution provides all the need capability previously supplied by /grid/fermiapp/products/<exp>/.

dCache Volumes

dCache volumes are a special form a storage volume and is utilized at Fermilab to provide more the 3 PiB of RAID based storage and access to tape storage with a capacity of more than 50 PiB. As well as being a large storage element, it also has the large bandwidth for transfers to and from grid worker nodes. It serves as the main storage element for both experiment production groups and analyzers. And while dCache has many capabilities, it also has some very specific limitations that must be understood in order to effectively utilize these volumes. One of the most important things to understand is that files written to dCache are immutable, meaning that once a file is written into a dCache volume it cannot be modified, only read, moved, or deleted.

There are four distinct types of storage volumes that users should be aware of: persistent, scratch, resilient, and tape-backed volumes. The general properties of each type are listed in the table at the top of the page and you should understand that each volume behaves and handles files differently.

Persistent dCache

The volume that will act the most similarly to a standard hard drive is the persistent volume (a.k.a. /pnfs/<experiment>/persistent/). This volume stores files persistently and they are not copied to tape storage. This means that the data in the file is actively available for reads at any time. As well, the file will not be deleted or removed from the volume unless manually removed by someone on the experiment. However, there are quotas on these are areas an no new files can be written to these areas when they are full. The quota size varies between experiments; the value in the table of 100 TB is just a rough typical value.

Scratch dCache

The next type of dCache volume is the scratch volume. This is a large storage volume (currently about 1.5 PiB) that is shared across all experiments with each experiment having an access point /pnfs/<experiment>/scratch/. This is an important aspect of the volume because it means that the actions of other experiments will affect files stored by your experiment. These files are stored on disk but they are not copied to tape. When a new file is written to scratch space, old files are removed in order to make room for the newer file. This removal is done without any notification to the owner of the file and is permanent. The file least recently read in the scratch volume will be removed until there is space for the new file. The lifetime of files in the volume can be seen here: http://fndca.fnal.gov/dcache/lifetime//PublicScratchPools.jpg Note again, that you will not be notified that your file will be removed, so only use the scratch area as temporary storage of files and know the expected lifetime of the files before they need to either be deleted or transferred to a separate volume.

While we quote some typical retention lifetimes in the above table, it is important to remember that the file lifetime is actually driven by how full the physical pool where your file resides is, not by the path to the file name. Therefore, it's possible for different files in the same logical directory (/pnfs/<experiment>/scratch/users/<username>) to have different lifetimes. Thus there is actually no minimum lifetime for all of "your" files that can be 100% guaranteed at a given moment. It is best practice to move files from scratch to a more permanent location at the earliest possible time.

Resilient dCache

Users sometimes have to import custom code to their grid jobs, often in the form of a tarball. Staging these tarballs to dCache can be problematic if they are large and many jobs of the same type start at once. Having hundreds or thousands of jobs trying to copy the same file from a single dCache pool can quickly overwhelm it, rendering other files on the pool inaccessible. The resilient dCache area automatically replicates files across many pools, reducing the risk of a problem if many jobs start at once. The resilient area is a good choice for these tarballs. In fact, the --tar_file_name jobsub option will stage tarballs here when copying to grid jobs. These area are not tape backed, and in order to keep overall space usage down (remember every file is copied 20x) files may be deleted if they are not accessed for some time, or after a fixed amount of time if they were used in conjunction with the --tar_file_name option. Users are advised to keep a backup copy of their tarball in another location so that it can quickly be restored if needed. Users are also asked to delete their files from the resilient areas once they are no longer needed. Grid job output should NEVER be written to the resilient areas; any such outputs are subject to deletion without warning.

Tape-backed dCache

The tape-backed volumes for an experiment are just that: they are disk based storage areas that have their contents mirrored to permanent storage on Enstore tape. These volumes (name depends on experiment configuration, e.g. /pnfs/uboone/data/, /pnfs/nova/archive/). While the files in this area will not be deleted from tape storage without explicit commands, they are NOT actively available for immediate read from disk. Files on tape must be copied from tape onto the dCache disk that sits in front of the Enstore storage facility (this is why dCache contains the word "cache"). In order for a file to be available for immediate reading, it must be copied from Enstore to dCache disk. Once staged from tape, the data in the file is available for read. This staging is commonly referred to as "prestaging" data since transferring the file from tape to disk (a.k.a Enstore to dCache) should be done before the files are to be processed by grid jobs. The most efficient way to prestage files is to utilize the SAM command: "samweb prestage-dataset <dataset_name>". Users should always consider the size of the dataset that they are prestaging and not prestage large datasets (> 50 TB) without consulting their Offline Coordinators.