Project

General

Profile

UserGuide

Introduction

This guide describes a set of scripts designed to help regular experimenters
(rather than analysis coordinators or other specialists) use SAM
effectively in their analysis and other work. We will provide
an overview of the terminology, and provde several examples of how
to use the tools to get analysis work done.

SAM Terminology

Dataset: a dataset is a group of files that has a name defined in the SAM database.
It is generally defined by a search query using metadata, and so the list files
associated with the name can change over time; a snapshot of the dataset can be taken at any
given time, and used as a file list for analysis, etc.

File: a single bundle of bits stored on a disk or tape somewhere, which might be
part of a dataset, and which SAM might have metadata about, or information
about what is in the file.

Project: an entity that can deliver the files in a dataset to one or more *consumer*s

Consumer: a program that gets files from a Project and (usually) does somehting with them.

Metadata: information about a file, like its size, checksum, the date it was generated,
other parent files whose data went into making it, what software was used to generate it, etc.

Snapshot: the specific list of files a given Dataset referred to at some specific time.

Why

Very often, your experiment stores data in SAM, and the files of interest for analysis are avaliable
as datasets, and there are stock scripts for doing production work on those datasets and you would like to
use the same tools for your personal, intermediate files as you did for the "official" experiment
data files:

You may want to run some other code over those intermediate files, etc. and you would like to
use the same tools for your personal, intermediate files as you did for the "official" experiment
data files. It may then turn out that that batch of intermediate files has a problem, and you want
to scrap them, make a new batch, and work on those new files, etc.

Or you may want to generate a small batch of unusual MonteCarlo data, using the standard tools, but your
own config files.

MonteCarlo Example

You want to take the a given MonteCarlo batch, and re-run it with a different value for some parameter, you
can fetch the .fcl files used, change them, make a dataset of your changed files, and run the montecarlo
using that new dataset:

   # make a new directory in a data area
   cd /nova/data/users/myusername
   mkdir myfclfiles
   cd myfclfiles
   # grab an existing batch of .fcl files
   ifdh_fetch --dims "defname:S14-01-20GenieND-FCLOnly" 
   # change something
   perl -pi -e 's{NOVA.HornConfig:\s*"mn000z200i"}{NOVA.HornConfig: "LE250z200"}' *.fcl
   # make a dataset of the new stuff
   setup fife_utils
   sam_add_dataset --directory `pwd` --name my-GenieND-Alternate-horn1 
   # 
   launchMC --dataset=my-GenieND-Alternate-horn1 --dest=/pnfs/scratch/myexperiment/users/myusername/funnyhornmc

(This assuming your experiment has a "launchMC" script to run the montecarlo on a dataset of .fcl files.)

Then you could run your files as a dataset, by defining it, and using it.

    setup fife_utils
    sam_add_dataset --dir /pnfs/scratch/myexperiment/users/myusername/funnyhornmc --art-metadata --dataset mydataset2 
    my_launcher --dataset mydataset2 --fcl `pwd`/myanalysis.fcl

Cleaning up

Lets say now you have made a dataset 'mydataset2' from a batch of files you previously generated, and you've now realized that the data is Horribly Wrong. You can:

   setup fife_utils
   sam_retire_dataset mydataset2

and the sam_retire_dataset script will hunt down all of those files, delete them, retire them from SAM, and delete the "mydataset2" dataset name, as well.

Analysis example

Now lets say you have some reconstructed files, perhaps in /pnfs/scratch/myexperiment/users/myusername/funnyhornmc,
and you wanto run an analysis over them all as a batch job. You have a .fcl file that you have run on one or two files
interactively, and you want to try running on all of those files, as a batch job. LaunchScripts

Keeping Data around Longer

So lets say you have a dataset of files defined in 'mydataset3' over in /pnfs/yourexperiment/scratch, and you know that
they will eventually dissappear, and you want to make sure they stay around for a while, because you're going to
get around to looking at them properly in a week or so. You can

setup fife_utils
sam_move2persistent_dataset --name=mydataset3

and it will move all the files in your dataset to the experiment persistent area.

Keeping Data more Permanaently

So lets say you have a dataset of files defined in 'mydataset3' over in /pnfs/yourexperiment/scratch, and you know that
they will eventually dissappear, and you want to file them in your experiments tape-backed area:
you can simply:

setup fife_utils
sam_archive_dataset mydataset3

and it will copy each of the files into the tape-backed area, and add the locations for them to SAM.

 

 

Reference:

Here is a summary of the end-user SAM commands, options, and description of usage.

sam_add_dataset

Usage: sam_add_dataset [options]

Add a group of files to SAM and create a dataset out of it.

Options:
   --version             show program's version number and exit
  -h, --help            show this help message and exit
  -e EXPERIMENT, --experiment=EXPERIMENT
                        use this experiment server defaults to $SAM_EXPERIMENT
                        if not set
  -u USER, --user=USER  default is $USER
  -t TAG, --tag=TAG     the value for Dataset.Tag which will be used to
                        distinguish this new dataset default format is
                        user+date
  -n NAME, --name=NAME  the dataset name default is userdataset+user+date
  -d DIRECTORY, --directory=DIRECTORY
                        directory of files to create dataset with
  -r, --recurse         walk down all levels of directory
  -f FILE, --file=FILE  file of file paths to create dataset with
  -m METADATA, --metadata=METADATA
                        json file of metadata you would like added to all
                        files
  -s SUBPROCESS, --subprocess=SUBPROCESS
                        execute a child program in a new process to extract
                        metadata only sam_metadata_dumper currently supported
  -c CERT, --cert=CERT  x509 certificate for authentication. If not specified,
                        use $X509_USER_PROXY, $X509_USER_CERT/$X509_USER_KEY
                        or standard grid proxy location
  -v, --verbose         returns verbose output

This script will take a group of files, give them metadata with a Dataset.Tag = TAG, and make a dataset
with the given NAME. There are several requirements:

  • The files must exist in an area SAM knows about (i.e. experiment DCache areas).
  • You must have a certificate/proxy for your experiment.

Other useful tidits:

  • Files will be renamed to have a unique prefix, as all files in a given experiment SAM instance
    need unique names
  • You can specify an already used Dataset.Tag to expand an existing dataset; this will however
    give you two datasets (the old one, and a new one) with the same files in them.

sam_archive_dataset

Usage: sam_archive_dataset [options] 
 copy files in named dataset to scratch and declare.
  (Use sam_archive_dataset --help for full options list)

Options:
  -h, --help            show this help message and exit
  -v, --verbose         
  -j, --just_say        do not actually copy, just say what you would do
  -e EXPERIMENT, --experiment=EXPERIMENT
  -b BATCH_SIZE, --batch_size=BATCH_SIZE
                        copy then declare in batches of this size
  -n NAME, --name=NAME  dataset name to copy
  -k, --keep            keep existing copies
  -d DEST, --dest=DEST  override destination to archive to, default is
                        /pnfs/$EXPERIMENT/archive/sam_managed_users/$USER/data

This script copies all the files in a given dataset to the experiment's tape-backed
/pnfs/$EXPERIMENT/archive area, declares those locations, and then cleans out copies
which exist elsewhere.

If you merely want a copy without cleaning up others, use the --keep option, or
use sam_clone_dataset with a specific destination.

sam_archive_directory_image

Usage: sam_archive_directory_image [options] 
 archive files sin source directory dcache archviver area and declare.
  (Use sam_archive_directory_image --help for full options list)

Options:
  -h, --help            show this help message and exit
  -v, --verbose         
  -j, --just_say        do not actually copy, just say what you would do
  -e EXPERIMENT, --experiment=EXPERIMENT
  -n NAME, --name=NAME  dataset name/tag to put in: default is
                        "$USER_archive_images" 
  -s SRC, --src=SRC     source directory to archive
  -d DEST, --dest=DEST  override destination to archive to, default is 
                        /pnfs/$EXPERIMENT/archive/sam_managed_users/$USER/image/$DATE
  -4, --nfs4            write tarfile directly via nfs4

This script basically makes a tarfile backup of a directory, and stores it the experiment tape-backed area, and declares it to SAM. It also makes, if it does not exist, a datset $USER_archive_images (with $USER replaced by your username) which lists all your archive images, and puts the path which was archvied in the metadata of the file.

It is very useful if there is a disk area in your DCache persistent, or BlueArc areas which you want to free up, but you're not sure you should throw away; you can archive it, and then remove the directory, and you can unpack it later if needed -- see sam_restore_directory_image.

Note: while you could schedule regular backups of directories with this tool and a cron job, the incremental backup tools provided by the systems folks at the lab are far more efficient. If you find yourself considering running a nightly or weekly backup this way, please see the FNAL Site Backups page, and use their tools, instead.

sam_audit_dataset

Usage: sam_audit_dataset [options] --name dataset --dest location
 Audit all files at destination location to see what files are and are not in named dataset

Options:
  -h, --help            show this help message and exit
  -v, --verbose         
  -d DEST, --dest=DEST  location to audit
  -e EXPERIMENT, --experiment=EXPERIMENT
                        use this experiment server defaults to $SAM_EXPERIMENT
                        if not set
  -k KEEPLISTS, --keeplists=KEEPLISTS
                        keep file lists in directory KEEPLISTS
  -n NAME, --name=NAME  dataset name to audit

This utility is useful if you have attempted to move a large dataset to a particular directory tree rooted at "dest" , and expect that those should be the only files there; it will compare the recursive listing of the directory with the contents of the dataset, and give you a summary and optional file lists of files that are in the directory tree but not in the dataset, files that are in the dataset but not the directory tree, files that are there but not declared properly, etc. so that one can clean up from transient errors, etc. that might have occurred in the transfer.

It is also useful for reviewing a directory you might want to clean up, to see if there are files there that are not tracked by SAM, etc. as long as you can name a dataset that should be a superset of the files that should be there.

sam_clone_dataset

Usage: sam_clone_dataset [options]
 copy files in dataset to destination and declare

Options:
  -h, --help            show this help message and exit
  -v, --verbose         
  -n, --name xxx        name of dataset
  -d, --dest url        destination 
  -j, --just_say        do not actually copy, just say what you would do
  -e EXPERIMENT, --experiment=EXPERIMENT
  -b BATCH_SIZE, --batch_size=BATCH_SIZE
                        copy then declare in batches of this size
  -0, --zerodeep        make no subdirectories in destination
  -1, --onedeep         make subdirectories one deep in destination
  -2, --twodeep         make subdirectories two deep in destination

This utility makes copies of the files in a dataset at a given location.
It will by default make hashed subdirectories so as not to put too many files
in a given directory. It can be useful for prestaging files at a given site
(i.e. in Amazon S3: storage) or putting a copy of a set of BlueArc files in
DCache scratch before using the dataset in Grid jobs.

sam_condense_dataset

Usage: sam_condense_dataset [options] 
 slurp dataset through program and generate summed files

Options:
  -h, --help            show this help message and exit
  -p PROJNAME, --projname=PROJNAME
  -v, --verbose         
  -e EXPERIMENT, --experiment=EXPERIMENT
  -b BATCH_SIZE, --batch_size=BATCH_SIZE
                        run in batches of this size
  -n NAME, --name=NAME  dataset name to sum
  -1 PHASE_1, --phase-1=PHASE_1
                        command to generate summary
  -2 PHASE_2, --phase-2=PHASE_2
                        command to sum summary files
  -w WORK, --work=WORK  working directory location
  -a, --art             phase 1 command is an art executable

This script attempts to apply a pair of programs to a dataset of files.
The first "phase-1" program reads the input files and generates some sort
of merged/summary/histogram file. The second "phase-2" program combines
merged/summary/histogram files into a single file. For small datasets,
it can run interactively, and just run the phase-1 program; for large datasets
it can launch a set of grid jobs which will run phase-1 on files, and generate
some number of intermediate files, and then run phase-2 on those to generate
a final output.

sam_copy2scratch_dataset

Usage: sam_copy2scratch_dataset [options] 
 copy files in named dataset to scratch and declare.
  (Use sam_copy2scratch_dataset --help for full options list)

Options:
  -h, --help            show this help message and exit
  -v, --verbose         
  -j, --just_say        do not actually copy, just say what you would do
  -e EXPERIMENT, --experiment=EXPERIMENT
  -b BATCH_SIZE, --batch_size=BATCH_SIZE
                        copy then declare in batches of this size
  -n NAME, --name=NAME  dataset name to copy
  -d DEST, --dest=DEST  override destination to archive to, default is
                        /pnfs/$EXPERIMENT/scratch/sam_managed_users/$USER

This script is basicaly a wrapper on sam_archive_dataset, that copies your data to the experiment DCache scratch area without you having to specify a location.

sam_dataset_duplicate_kids

Usage: sam_dataset_duplicate_kids [options] --dims dimensions 
 Check files in dims for duplicate children of same parent

Options:
  -h, --help            show this help message and exit
  -v, --verbose         
  -e EXPERIMENT, --experiment=EXPERIMENT
                        use this experiment server defaults to $SAM_EXPERIMENT
                        if not set
  --dims=DIMS           dimension query for files to check
  --include_metadata=INCLUDE_METADATA
                        metadata field to include in comparisons
  --mark_bad            mark duplicate files as 'bad' in content_status
  --retire_file         retire duplicate files
  --delete              delete duplicate files

sam_dataset_stage_status

Usage: sam_dataset_stage_status [options] dataset [dataset ...] 
 make sure files in dataset actually exist

Options:
  -h, --help            show this help message and exit
  -v, --verbose         
  -j, --just_say        do not actually copy, just say what you would do
  -p, --prune           prune locations we cannot reach
  -e EXPERIMENT, --experiment=EXPERIMENT
                        use this experiment server defaults to $SAM_EXPERIMENT
                        if not set
  -n NAME, --name=NAME  dataset name to validate
  -f FILE, --file=FILE  single file to validate
  -l, --locality        check DCache locations to see what is staged
  -L, --listtapes       list what tapes SAM thinks files are on
  -T, --tapeloc         check DCache locations to see what tapes things are on
  --location=LOCATION   only check matching locations
  --stage_status        generate staging status report

This is really an alias for sam_validate_dataset with options --stage_status --location=/pnfs.

sam_modify_dataset_metadata

Usage: sam_modify_dataset_metadata [options] 
 modify metadata on all files in dataset

Options:
  -h, --help            show this help message and exit
  -v, --verbose         
  -j, --just_say        do not actually copy, just say what you would do
  -e EXPERIMENT, --experiment=EXPERIMENT
  -n NAME, --name=NAME  dataset name to modify
  -m METADATA, --metadata=METADATA
                        metadata file with updates

This script will run through all the files in a dataset and basically call samweb modify-metadata on each file with th eprovided metadata file. This lets you tag a whole dataset with addiitional information, as needed.

sam_move2archive_dataset

Usage: sam_move2archive_dataset [options] 
 copy files in named dataset to archive and declare and clean out other locations.
  (Use sam_move2archive_dataset --help for full options list)

Options:
  -h, --help            show this help message and exit
  -v, --verbose         
  -j, --just_say        do not actually copy, just say what you would do
  -e EXPERIMENT, --experiment=EXPERIMENT
  -b BATCH_SIZE, --batch_size=BATCH_SIZE
                        copy then declare in batches of this size
  -n NAME, --name=NAME  dataset name to copy
  -d DEST, --dest=DEST  override destination to archive to, default is
                        /pnfs/$EXPERIMENT/archive/sam_managed_users/$USER/data
  -k, --keep            Do not clean up other copies of the data

This command will move all the files in a given dataset to the tape-backed experiment archive area, and then clean out the copies elsewhere, updating the SAM bookkeepig as to the locations. You can use the --keep option to prevent the cleaning out of other copies.

sam_move2persistent_dataset

Usage: sam_move2persistent_dataset [options] 
 copy files in named dataset to persistent and declare and clean out other locations.
  (Use sam_move2persistent_dataset --help for full options list)

Options:
  -h, --help            show this help message and exit
  -v, --verbose         
  -j, --just_say        do not actually copy, just say what you would do
  -e EXPERIMENT, --experiment=EXPERIMENT
  -b BATCH_SIZE, --batch_size=BATCH_SIZE
                        copy then declare in batches of this size
  -n NAME, --name=NAME  dataset name to copy
  -d DEST, --dest=DEST  override destination to archive to, default is
                        /pnfs/$EXPERIMENT/persistent/sam_managed_users/$USER

This command will move all the files in a given dataset to the experiment persistent DCache area, and then clean out the copies elsewhere, updating the SAM bookkeepig as to the locations. You can use the --keep option to prevent the cleaning out of other copies.

sam_move_dataset

Usage: sam_move_dataset [options] 
 copy files in named dataset to named location and declare and clean out other locations.
  (Use sam_move_dataset --help for full options list)

Options:
  -h, --help            show this help message and exit
  -v, --verbose         
  -j, --just_say        do not actually copy, just say what you would do
  -e EXPERIMENT, --experiment=EXPERIMENT
  -b BATCH_SIZE, --batch_size=BATCH_SIZE
                        copy then declare in batches of this size
  -n NAME, --name=NAME  dataset name to copy
  -d DEST, --dest=DEST  destination to move to
   -k, --keep           Do not clean up other copies

This moves all the files for a given dataset to a given location -- first by copying them there, and then cleaning up copies of the files in other locations. You can prevent the cleanup pass by using --keep.

(sam_pin_dataset)

NOTICE: This command has been dropped, as it relied on grid tools which are no longer available.
If there is sufficient interest, we may re-add a version based on newer DCache API's.

sam_prestage_dataset

Prestage a dataset

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  --help-commands       list available commands

  Base options:
    -e EXPERIMENT, --experiment=EXPERIMENT
                        use this experiment server. If not set, defaults to
                        $SAM_EXPERIMENT.
    --dev               use development server
    -s, --secure        always use secure (SSL) mode
    --cert=CERT         x509 certificate for authentication. If not specified,
                        use $X509_USER_PROXY, $X509_USER_CERT/$X509_USER_KEY
                        or standard grid proxy location
    --key=KEY           x509 key for authentication (defaults to same as
                        certificate)
    -r ROLE, --role=ROLE
                        specific role to use for authorization
    -z TIMEZONE, --timezone=TIMEZONE
                        set time zone for server responses
    -v, --verbose       Verbose mode

  prestage-dataset options:
    --defname=DEFNAME   
    --snapshot_id=SNAPSHOT_ID
    --max-files=MAX_FILES
    --station=STATION   
    --parallel=PARALLEL
                        Number of parallel processes to run
    --delivery-location=DELIVERY_LOCATION
                        Location to which the files should be delivered
                        (defaults to the same as the node option)
    --node=NODE         The current node name. The default is the local
                        hostname, which is appropriate for most situations

As you can probably tell by the help output, this is simply a wrapper that calls samweb prestage-dataset, which will go through the files in your dataset and pull any that are on tape into DCache. This can make processes that do not use SAM projects to fetch their files find them ready and waiting, or it can be used in advance of scripts that do use SAM, to reduce file access latency.

sam_project_caffeine

Usage: sam_project_caffeine [options] 
 keep production SAM projects alive longer for production
 (Use sam_project_caffeine --help for full options list)

Options:
  -h, --help            show this help message and exit
  -e EXPERIMENT, --experiment=EXPERIMENT
                        use this experiment server defaults to $SAM_EXPERIMENT
                        if not set
  --station=STATION     use this station name -- defaults to $SAM_EXPERIMENT
                        if not set

This script will keep sam projects awake, when batch jobs are taking a long time to run.

sam_remove_location_dataset

Usage: sam_remove_location_dataset [options] dataset dest_url 
 remove copies of files in dataset under destination

Options:
  -h, --help            show this help message and exit
  -v, --verbose         
  -j, --just_say        do not actually copy, just say what you would do
  -e EXPERIMENT, --experiment=EXPERIMENT
  -n NAME, --name=NAME  dataset name to clean
  -d DEST, --dest=DEST  destination pattern to match

See sam_unclone_dataset

sam_retire_dataset

Usage: sam_retire_dataset [options]
 delete files, undeclare locations, and delete dataset

Options:
  -h, --help            show this help message and exit
  -v, --verbose         
  -n, --name            name of dataset (required) 
  -n, --just_say        do not actually copy, just say what you would do
  -k, --keep_files      do not delete actual files, just retire them from SAM
  -m DELETE_MATCH, --delete_match=DELETE_MATCH
                        delete only files matching this regexp
  -e EXPERIMENT, --experiment=EXPERIMENT

This command is useful when you have made a dataset of preliminary files, and then discover you want to just throw them out and start over. You can call sam_restire_datset on them, and the files will be removed from wherever SAM thinks they are, the locations will be undeclare from SAM, and then the files will be retired in SAM, and the dataset name deleted. You can then re-run whatever generated the dataset initially, and declare the new files, and go on with the new data.

sam_unclone_dataset

Usage: sam_unclone_dataset [options] 
 remove copies of files in dataset under destination

Options:
  -h, --help            show this help message and exit
  -v, --verbose         
  -n, --name NAME       datset name (required)
  -d, --dest DEST       destination url match (required)
  -j, --just_say        do not actually copy, just say what you would do
  -e EXPERIMENT, --experiment=EXPERIMENT

This command removes files in a datset stored at a given locaion -- for example if you previously staged a copy
in your experiment persistent area, you might

sam_unclone_dataset --dest=/pnfs/$EXPERIMENT/persistent

and it would clean the files from that dataset out of that area.

Note: sam_unclone_dataset will not remove a file if it is the only remaining copy of the file. So it may not always clean a given directory out entirely.

sam_validate_dataset

Usage: sam_validate_dataset [options] dataset [dataset ...] 
 make sure files in dataset actually exist

Options:
  -h, --help            show this help message and exit
  -v, --verbose         
  -j, --just_say        do not actually copy, just say what you would do
  -p, --prune           prune locations we cannot reach
  -e EXPERIMENT, --experiment=EXPERIMENT
                        use this experiment server defaults to $SAM_EXPERIMENT
                        if not set
  -n NAME, --name=NAME  dataset name to copy
  -l, --locality        check DCache locations to see what is staged
  -L, --listtapes       list what tapes SAM thinks files are on
  -T, --tapeloc         check DCache locations to see what tapes things are on

This command is particularly targeted at scratch DCache -- it will check the locations that SAM thinks it has for files in the given dataset, and let you knwo if any are not present. It can also clean out the location data for those pruned locaitons.