Project

General

Profile

User Datasets (SAM for Users)

Neat Tricks you can do with SAM for Users.

In addition to this documentation please see the pnfs tutorial from July 2015 here: DocDB 13747

Setup

Set up novasoft as usual. This will give you access to the following commands1:

sam_add_dataset              (makes a new dataset)
sam_retire_dataset           (retires a dataset)
sam_validate_dataset         (validates that all the files are present, or which aren't)      
sam_clone_dataset            (Makes a replica of a dataset in a different location [i.e. copies it])
sam_unclone_dataset          (Removes the replicas of a dataset in a specific location [i.e. cleans up a copy])
sam_modify_dataset_metadata  (Applies or modifies the metadata associated with a dataset)

Each of these tools is designed to work with a complete dataset which has been defined by the user.

Quick Start

If you are already familiar with SAM then:

  • Make sure you have an X509 certficate
    (kinit followed by kx509)
  • Make some files (or find them)
    They should be on a supported storage area (i.e. bluearc or dCache, etc. See "Find/Register your locations" below.)
  • Define a Dataset

From a directory with your files (Alt: use -f <textfile> to pass in a text file with your file locations listed)

sam_add_dataset -d <path to file> -n <name of dataset>

Will go through and register all the files in the specified directory (or filelist if you passed it a filelist). It will create a dataset name "<name of dataset>" and tag each file (in the Dataset.Tag field) with that name. Each file will have its location set properly.

You can now use this dataset to run standard SAM analysis projects.

  • Delete a Dataset

When you are done with a dataset, you can delete it. This will unregister the files and delete the dataset definition. If you want to keep the files around then also use the "--keep_files" option (otherwise the files are deleted from the disk too). See details below.

sam_retire_dataset -n <name of dataset> [ optional --keep_files ]
  • Copy a Dataset

If you want to copy the files to some other location (i.e. bluearc to dCache scratch or dCache scratch to tape). SAM replica information will be updated automatically. Please be sure that the destination path is group writable, and there is no double slash (//) in the path name.

sam_clone_dataset -n <name of dataset> -d <destination path>
  • Remove a Dataset

If you want to remove the files have been copied in some location as described above (i.e. bluearc to dCache scratch or dCache scratch to tape). SAM replica information will be updated automatically. This is very useful if you want to remove the files on the local disk like /nova/ana/, after the copy to dCache is done.

sam_unclone_dataset -n <name of dataset> -d <destination path>

Complete documentation

Below are the instructions on completing specific tasks and the details regarding the procedures that should be used. In general all the utilities have commandline help facilities which can be accessed through the --help flag. i.e.:

> sam_retire_dataset --help
Usage: sam_retire_dataset [options] dataset [dataset ...]
 delete files, undeclare locations, and delete dataset

Options:
  -h, --help            show this help message and exit
  -v, --verbose         
  -j, --just_say        do not actually copy, just say what you would do
  -k, --keep_files      do not delete actual files, just retire them from SAM
  -m DELETE_MATCH, --delete_match=DELETE_MATCH
                        delete only files matching this regexp
  -e EXPERIMENT, --experiment=EXPERIMENT
  -n NAME, --name=NAME  dataset name to retire

For all these commands you can set the SAM_EXPERIMENT environment variable (i.e. export SAM_EXPERIMENT=nova) or use the -e EXPERIMENT option on the commands.

Making and Using a dataset

Follow steps 1-3 to get a working dataset using the SAM for users tools.

Step 1 -- Define a dataset

For analysis uses the common need is to group a set of files (i.e. output files) of some stage of analysis together as a single entity to make it easy to use the SAM framework for doing things with the files (e.g. running more analysis on them). These files can be the output of art analysis jobs, these can be histogram files, ntuples, log files, photos of toy poodles. The file content and format does not matter. All that matters is that they are files with a non-zero size.

To define a dataset out of files that SAM does not know about yet, you use the "sam_add_dataset" tool. (To define one out of files that SAM does already know about, see directions on "create-definition" in Basic SAM Functions.)

There are two general modes of operation of this tool.

  1. Declaration of a list of files
  2. Declaration of all files in a directory
  • In the first mode you pass program a list of files (in a text file) which contain the full path to each file that you want to be part of the dataset.
  • In the second mode you pass the program the path to a directory which contains some files. All files in this directory are added to the dataset.

The sam_add_dataset command requires that you also specify a number of options. The important options are:

-n NAME or --name=NAME   The dataset name.   Default values is <userdataset+user+date>

This is the name of your dataset. This can be any string, but choose wisely since it is what you will use to refer to your collection of files.

Examples of GOOD dataset names:
  • "norman_analysisskim_nue_custom_pid_v4"
  • "andrew_awesome_ntuples_23NOV20015"
  • "exoticsgroup_monopole_skim_prelim_DEC2015"

The common feature of all of these are that they A) Have some ownership identifier (norman, andrew, exoticsgroup) B) describe what they are (analysisskim, awesome ntuples, monopole skim) C) have extra info to distinguish them from similar datasets ( custom pid v4, a date, prelim + date).

Examples of BAD dataset names:
  • ""data", "mydata", "stuff" (all of these are non-descript, non-unique, etc....)
  • "nue_data" "official_nue_dataset" "zomg_use_this_nue_data" (confusing and could collide with "official" datasets provide to the collaboration)
  • Anything that is not unique or descriptive

Step 2 -- Find/Register your locations

Not all storage is created equal!

An important component to working with your data is knowing where it is located. SAM is "aware" of a number of different major storage systems and can interact with them transparently. HOWEVER, if SAM doesn't know about a storage location (like your laptop's harddrive or some random computer at a home university) then it can't help.

Current for NOvA SAM knows about the following locations that normal users can access:

dcache:/pnfs/nova/scratch dCache Scratch The dCache scratch system (preferred)
enstore:/pnfs/nova/ Enstore/dCache Tape Tape backed parts of the dCache/Enstore system

N.B. You cannot access file locations on any dCache disk on a worker node ( e.g the ana / app / prod disks ).

So:

  • Figure out where your files are
  • If you're files are NOT in one of these areas, then move them there ( i.e. copy them to /pnfs/nova/scratch/users/ )
    • Note: moving files to /pnfs/nova/scratch/users is NOT as simple as using cp!!!
    • To move to /pnfs/nova/scratch/users use "ifdh cp <myfile> /pnfs/nova/scratch/users/<myusername>/"

Once your files are in a supported location you can register them easily:

# If you are in the directory with your files
sam_add_dataset -d . -n <myAwesomeDatasetName> 

If all goes well then you will be able to use SAM to work with your files.

Example:

# List all your files
samweb list-files "defname: myAwesomeDataSetName" 

# List the locations of your files
samweb locate-file <filename>

Step 3: Run on your data

At this point you have a completely valid SAM dataset with files in locations that can be delivered to your offline jobs.

Follow the instructions for setting up and running a job against a standard SAM dataset (found here)

Notes

To avoid name collisions (your file conflicting with someone else's file) the files in your dataset are renamed automatically by the sam_add_dataset tool. The new filename will have the form:

<unique prefix>-<original filename>

The prefix is a UUID (a special number the is unique) but don't worry, you'll never need to type it. SAM will handle that for you.

Additional Tools

The following tools are also available to help you work with datasets that you have defined. These tools all work on the dataset as a whole. You can also use any of the standard samweb tools to work on individual files or SAM catalog entries and searches.

Dataset Validation

Depending on where your files are stored, it may be desirable to "verify" that all the files you think are in your dataset are actually available.

To do this:

sam_validate_dataset --name=<dataset>

This will check the registered locations of the individual files to see if the files are actually there. You will get a report of which files are missing.

Note: This utility is really only needed if you are using the "scratch" dCache area and you have NOT used or touched your files for a long time (meaning > 30 days). In this case validating your dataset can tell you if your files have been purged from that cache area.

Missing Files -- What to do

If the validate utility finds missing files there are basically three things you can do:

  • Replace the files

If you are able to locate a copy of the file from some where else or are able to regenerate the files then you can put them where they should be.

  • Prune the dataset

In this mode you remove from the dataset any files which have disappeared. The resulting dataset is then is smaller (a subset of the original) but doesn't have any missing files which makes it easy to run over.

To prune you use:

sam_validate_dataset --prune --name <dataset>

  • Nothing

The files are just missing and you don't care. You'll get errors when you run jobs that try to grab the missing files.

Removing/Deleting Datasets

There will come a time when you are done with your dataset (and its data) and you'll want to remove it. To do this you do the following:

sam_retire_dataset --name=<dataset>

BUT.......there are a number of variants to this regarding what is actually deleted and what is retained.

The general options are:
  • Delete everything
  • Delete the SAM dataset defition, KEEP the corresponding files
  • Delete a subset of the files, KEEP the SAM file and dataset entries

Each of these are detailed below.

Delete everything

In this mode the utility completely cleans up all files that are associated with the dataset both in SAM and on disk. It also removes the dataset definition in SAM.

The way to invoke this is:

#
# Will not actually do anything, just report what would be done
sam_retire_dataset --just_say --name=<dataset>
#
# Deletes everything
sam_retire_dataset --name=<dataset>

Delete the definiton, Keep the files

You want to keep the files but remove the dataset definitions

#
# Does not delete the files (only the SAM entries)
sam_retire_dataset --keep_files --name=<dataset>

Delete a subset of the files

You may want to prune down your dataset (remove files from it). To do this you can use a regular expression that will be matched against the name of the files:

sam_retire_dataset --delete_match=<regex> --name=<dataset>
#
# Example: Remove files that end in ".log" from your set
sam_retire_dataset --delete_match=".log$" --name=andrews_awesome_data

Notes

1 Requires Python version 2.7+. Using older version will give errors.

Special Metadata

When you use these tools you can specific your own metadata for your files. There are however a number of fields that are automatically filled in for you and constitute the minimum amount of metadata that are needed to find the file.

These are:

     File Name: 1164c6e4-17bb-4139-898c-82d25f3a6b53-fardet_r00013114_s16_t00.raw
       File Id: 97347520
   Create Date: 2015-03-05T18:30:46+00:00
          User: anorman
     File Type: unknown
   File Format: unknown
     File Size: 7206088
      Checksum: (none)
Content Status: good
   Dataset.Tag: NORMAN_RUN13114_TEST

The most important one of these is the Dataset.Tag field. This field has the same name as your dataset.

i.e. The above file was declared using:

sam_add_dataset -n NORMAN_RUN13114_TEST -d .

Where the file was in the current directory.

Troubleshooting

Common Errors:

Your certificate is expired (or doesn't exist)

Error Example:

[anorman@novagpvm01 test4]$ sam_add_dataset -n NORMAN_RUN13114_TEST -d .
oops: SSL error: [Errno 1] _ssl.c:510: error:14094415:SSL routines:SSL3_READ_BYTES:sslv3 alert certificate expired

Solution:

kx509

h3. Your are using Python 2.7.9

Error Example:

[anorman@novagpvm01 test4]$ sam_add_dataset -n NORMAN_RUN13114_TEST -d .
oops: SSL error: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581)

Temporary workaround:

export SSL_CERT_DIR=/etc/grid-security/certificates