Project

General

Profile

Rapid Code Distribution Service (RCDS) via CVMFS using Jobsub

Jump to instructions to use this

Introduction

For the last couple of years, FIFE users have been staging custom code to worker nodes via tar files. At first, these were stored on scratch dCache Pools, and copied into jobs from there. The issue with this was that dCache often became overwhelmed when many jobs tried to access the same tar file simultaneously. To remedy this, dCache and FIFE set up the resilient pools where files are replicated 20x and provided a jobsub interface to seamlessly upload tarballs at submission time and transfer them to jobs at runtime (see http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=6763). Although this solution has worked well to mitigate issues present before, it still presents a number of challenges and strain on the dCache infrastructure.

To remedy this, a new solution was proposed to use CVMFS to publish tarballs, make these repositories widely available, and handle cleanup all within one service. This would allow for similarly rapid code distribution along with taking advantage of the caching and wide availability inherent to CVMFS repositories across the OSG.

After this service was stood up, the jobsub code was extended to allow for native support for distributing code in this manner. This article will first give instructions on how to use the service, and then detail how it works behind the scenes.

How to use it

The short version

  • Use --use-cvmfs-dropbox on the jobsub_submit line
  • Find your tar contents (e.g. for an uploaded tarball called mytar.tar) at ${CONDOR_DIR_INPUT}/mytar in your job.

More details

The jobsub_client API is mostly unchanged. The dropbox/tardir/-f URIs exhibit the same behavior as before Tardir and dropbox URIs. For now, there will be an extra flag required to the jobsub_submit command to use the RCDS, --use-cvmfs-dropbox.

An example submit command looks like this:

jobsub_submit -G my_experiment <other args> --tar_file_name [dropbox|tardir]:///path/to/[mytarfile|dir] --use-cvmfs-dropbox

This will submit the job, upload the tarball to a RCDS CVMFS publish server, and submit a request to publish. A successful submission for the request to publish is considered sufficient for jobsub to declare job submission a success. Upon publishing, the RCDS service will untar the uploaded tarball and make the contents of the tarball available in one of the RCDS repositories (users don't need to know which, as jobsub will handle storing that information). When a job starts to run, it will wait for this previous step to finish (usually not necessary), and then create a softlink in the job's $CONDOR_DIR_INPUT to the CVMFS repository directory where the files reside.

For example, if a user uploads a tarball called "mytar.tar" by doing the following:

jobsub_submit -G nova --tar_file_name dropbox:///path/to/mytar.tar --use-cvmfs-dropbox

the contents of mytar.tar will be available in the job at ${CONDOR_DIR_INPUT}/mytar

Like the resilient dCache service, this is meant for user code, NOT production code. Similarly, make sure to only put code in these tarballs - not flux files, git files, etc.

How it works

Upon the execution of the jobsub_submit command, the jobsub client will calculate the hash of the tarball (or in the case of using the tardir URI, will create a tarball and then calculate its hash). It then checks with one of the RCDS publishing servers whether or not a tarball with that same hash has been published and is available in the RCDS repositories.

If the tarball is determined to not be present in the RCDS repos, the jobsub client will upload the tarball to one of the publish servers. Each server will unpack the tarball into one of (as of this writing 4) the RCDS repositories, located at /cvmfs/fifeuser[1-4].opensciencegrid.org. Once the tarball is unpacked, it should take less than a minute to publish.

The CVMFS Stratum 1 checks these repositories twice per minute for updates, and the the Stratum 1 squids check for updates after caching for one minute. The CVMFS clients across the OSG will check for updates after caching for 15 seconds. The combined effect of these is that the turnaround from tarball upload to availability in jobs is much less than a standard CVMFS-published file.

In the meantime, jobsub will pass the hash to the job. When the job starts, the wrapper that jobsub wraps all jobs in will wait for the hash to appear before executing the payload (user job).

All tarball hash directories are removed from the RCDS repositories 30 days after the last time they were uploaded or checked by either the jobsub client or job.

Release schedule and updates

This feature is now released with jobsub_client v1.3 having been made 'current' in ups