Project

General

Profile

Overview

Goal

The goal of the "End user SAM" tools is to make SAM data handling more useful to individual analysis users,
as opposed to production coordinators, etc. So we need to make it easier for users to take files
that they have generated, make them into datasets, analyze those datasets, manage those datasets, and
possibly discard them when they are done.

Assumptions

We assume here that frequently analysis users are

  1. developing analysis tools/scripts, etc., and
  2. running them repeatedly, often over the same input data, then
  3. trying to do further analysis on their generated output, and finally
  4. either discarding their generated data (and going back to step 2) or
  5. wanting to keep their generated data more permenantly.

Implications

This workflow has interesting implications:

  • further analysis of generated output (item 2), would be helped by having for a way for end users to declare their own
    datasets, and then use that dataset as input to one or more jobs to test whether their generated data is up to snuff.
  • these generated files, since they are more often than not discarded, would be wasteful to write to tape
    before they are vetted; however if they keep them in scratch, they might get hit by the LRU cahce replacement if
    there is a lull in their analysis cycle -- so a way to manage the lifetime of the files is important.
  • discarding that generated data then reduces to discarding/retiring all the files in a dataset, and then deleting
    the dataset itself.
  • keeping files more permanently means copying them to tape-backed storage, and perhaps making them available to
    other experimenters, etc.

Supporting Tools

These considerations have led to the choice to develop a set of tools to assist in this process:

  • sam_add_dataset -- a script to let users create datasets from files they have stashed, say, in DCache scratch.
    once the add is complete, they can immediately begin using that new dataset as input to SAM-based jobs.
    This involves
    • declaring metadata for the individual files
    • declaring locations for those files where they currently live
    • making a dataset that incldes all those files.
  • sam_pin_dataset -- a script to mange file lifetimes by pinning files in DCache for a limited time so that the user has a chance to
    do their vetting of the files before they expire from scratch.
  • sam_clone_dataset -- a script to copy dataset files to an alternate location, and declare the added location
  • sam_validate_dataset -- becuase perhaps it took you longer to get back to looking at those files than you meant to,
    and now you want to see how much is still left in cache, now that they're no longer pinned.
  • sam_unclone_dataset -- a way to undo sam_clone_dataset, or, to clean up the scratch copy once the tape backed copy is made.

Each of these tools has uses beyond this workflow, (i.e. sam_clone_dataset could be used to pre-place files at a remote institution
before running production jobs there) but this is where their need was most clear.

Tools are also available for use from python, see Using_fife_utils_in_your_python_scripts