IFDH Output staging vi SRM

This document is currently an early DRAFT. It may contain horrible errors or omissions, and/or contain ideas which will be discarded as utter rubbish in the near future. You have been warned.


Currently, our IF jobs at Fermilab use our SAM-Web data handling system effectively for obtaining input files.
However, for output files, we are not so advanced, and the tools we are using (CPN locking, gridftp)
are not well suited to jobs running at remote sites, due to the potential for flooding gridftp servers.
This wiki page disucusses some possible solutions.

The Current State

Our current design and implementation can be summarized in the following diagram:

This shows a job DAG on the left, with a leader job, N worker job, and a trailer job,
which use our Project/getNextFile interfaces to request input files from SAM-Web (green
arrows in the diagram), which files are then generally transferred via gridftp (blue
arrows in our diagram) from a stager disk. Red arrows indicate output file transfers.

Input Files

As the figure above hints, the delivery of input files is controlled by SAM handing out getNextFile results,
and so file copies can be controlled by the data handling system. Also, the potential exists to run multiple
stager nodes to deliver input files to jobs effectively.

Output Files

Output files are staged back by each worker job at the end of processing an input file
or files. This has the potential to flood our BlueArc fileservers when there are a large
number of jobs (a few hundred) running in tandem (red arrows in our diagram). Currently
this overload is avoided by using a locking system called CPN which causes nodes to block
until they get a "copy slot". Waiting for a copy slot can cause a lot of idle CPU in worker
job slots, as they wait for a chance to run a copy.

Production files are delivered to a "dropbox" directory, where one of our FTS servers awaits
new files being delivered and extracts appropriate metadata and files them away in SAM, generally
in our "enstore" mass storage system.

Output via Gridftp

However, since the grid jobs run as a particular user account (i.e. "minervaana") which is not
the account of the particular user who submitted the job; which makes accounting disk quota etc. problematic.
Hence many such jobs use a per-experimnet gridftp server that maps fiel ownership more appropriately/

Here once again, we use CPN locking to keep from running too many gridftp streams at once.

Moving off-site

However, to move this configuration off site, we have issues:

  • the NFS access for our CPN locking is not available
  • transfer rates from worker jobs directly to our per-experiment gridftp servers are lower,
    so jobs would need to wait a longer time to get a copy slot. This translates to slots with
    lots of idle time waiting for a chance to copy.

There are possible solutions.

Stage via Site OSG_DEFAULT_SE

A possible solution is to use local SRM Storage Elements at the various OSG sites. We can have
each job stage their data there, and have some entity (in this diagram, the trailer job) come
along and copy all the data back to the spool directory at Fermilab.

This takes advantage of the SRM's throttling capabilities to prevent overload, and lets just
one job slot be tied up for the I/O transfers back to Fermilab.

Other variations

The other variations have to do with who copies what, and when. For example, if the
output files were declared to SAM when they were copied to the site-local SRM, then
SAM tools could be configured to migrate the files back to enstore here at Fermilab.
Disadvantages of this solution are that there are output files that need to be transferred
back that currently are not declared to SAM; this approach would require them to be
declared. It is not clear if this is a bug or a feature.

Another possibilty is the one being implemented at SMU currently; files are transferred to
the local SRM, but a SAM FTS server is run on a system which sees that storage as local, and
can treat it as the dropbox directly.

And of course, we could have hybrids of these approaches, or some other variation which we
may soon learn of from the CMS folks.