SAMLite or SAM for User Datasets¶
- Table of contents
- SAMLite or SAM for User Datasets
- Quick Start
- Complete documentation
- Making and Using a dataset
- Additional Tools
- Special Metadata
Neat Tricks you can do with SAM for Users.
Detailed documentation is in the fife_utils Wiki
In addition to this documentation please see the pnfs tutorial from July 2015 here: DocDB 13747
The user tools for using SAM with small (unofficial) data sets are found in the "FIFE Utilities" package.
To setup this package use:
# setup_your experiment framework first if you haven't done so. setup fife_utils
This will give you access to the following commands1:
sam_add_dataset (makes a new dataset) sam_validate_dataset (validates that all the files are present, or which aren't) sam_clone_dataset (Makes a replica of a dataset in a different location [i.e. copies it]) sam_unclone_dataset (Removes the replicas of a dataset in a specific location [i.e. cleans up a copy]) sam_modify_dataset_metadata (Applies or modifies the metadata associated with a dataset) sam_retire_dataset (retires a dataset) sam_archive_dataset (copies a dataset to tape backed area, and removes the old copy) sam_archive_directory_image (copies a tarfile of a directory to a tape backed area) sam_copy2scratch_dataset (copies a dataset to the scratch area) sam_move2archive_dataset (alias for sam_archive_dataset) sam_move2persistent_dataset (copies a dataset to the persistent area) sam_move_dataset (alias for sam_clone_dataset) sam_prestage_dataset (tells DCache to prestage all the files in a dataset) sam_project_caffeine (keeps production SAM projects awake, for when job submission is slow) sam_remove_location_dataset (alias for sam_unclone_dataset) sam_restore_directory_image (inverse of sam_archive_directory_image) sam_revert_names (undoes uniqifying of names from sam_add_dataset) sam_extract_dataset_metadata (v3_2_8+ runs an extractor program and updates metadata on files) sam_dataset_duplicate_kids (v3_2_8+ look for multiple children of same file) sam_dataset_stage_status (v3_2_8+ alias for sam_validate_dataset with staging report turned on)
Each of these tools (except the first) is designed to work with a complete dataset definition, which can be defined by the user.
If you are already familiar with SAM then:
- Make some files (or find them)
They should be on a supported storage area (i.e. bluearc or dCache, etc...)
- Define a Dataset
From a directory with your files (Alt: use -f <textfile> to pass in a text file with your file locations listed)
sam_add_dataset -d <path to file> -n <name of dataset>
Will go through and register all the files in the specified directory (or filelist if you passed it a filelist). It will create a dataset name "<name of dataset>" and tag each file (in the Dataset.Tag field) with that name. Each file will have its location set properly.
You can now use this dataset to run standard SAM analysis projects.
- Delete a Dataset
When you are done with a dataset, you can delete it. This will unregister the files and delete the dataset definition. If you want to keep the files around then also use the "--keep_files" option (otherwise the files are deleted from the disk too). See details below.
sam_retire_dataset -n <name of dataset> [ optional --keep_files ]
- Copy a Dataset
If you want to copy the files to some other location (i.e. bluearc to dCache scratch or dCache scratch to tape). SAM replica information will be updated automatically. Please be sure that the destination path is group writable, and there is no double slash (//) in the path name.
sam_clone_dataset -n <name of dataset> -d <destination path>
- Remove a Dataset
If you want to remove the files have been copied in some location as described above (i.e. bluearc to dCache scratch or dCache scratch to tape). SAM replica information will be updated automatically. This is very useful if you want to remove the files on the local disk like /nova/ana/, after the copy to dCache is done.
sam_unclone_dataset -n <name of dataset> -d <destination path>
Detailed documentation is in the fife_utils Wiki
Below are the instructions on completing specific tasks and the details regarding the procedures that should be used. In general all the utilities have commandline help facilities which can be accessed through the --help flag. i.e.:
> sam_retire_dataset --help Usage: sam_retire_dataset [options] dataset [dataset ...] delete files, undeclare locations, and delete dataset Options: -h, --help show this help message and exit -v, --verbose -j, --just_say do not actually copy, just say what you would do -k, --keep_files do not delete actual files, just retire them from SAM -m DELETE_MATCH, --delete_match=DELETE_MATCH delete only files matching this regexp -e EXPERIMENT, --experiment=EXPERIMENT -n NAME, --name=NAME dataset name to retire
For all these commands you can set the SAM_EXPERIMENT environment variable (i.e. export SAM_EXPERIMENT=nova) or use the -e EXPERIMENT option on the commands.
Making and Using a dataset¶
Follow steps 1-3 to get a working dataset using the SAM for users tools.
Step 1 -- Define a dataset¶
For analysis uses the common need is to group a set of files (i.e. output files) of some stage of analysis together as a single entity to make it easy to use the SAM framework for doing things with the files (e.g. running more analysis on them). These files can be the output of art analysis jobs, these can be histogram files, ntuples, log files, photos of toy poodles. The file content and format does not matter. All that matters is that they are files with a non-zero size.
To define a dataset you use the "sam_add_dataset" tool.
There are two general modes of operation of this tool.
- Declaration of a list of files
- Declaration of all files in a directory
- In the first mode you pass program a list of files (in a text file) which contain the full path to each file that you want to be part of the dataset.
- In the second mode you pass the program the path to a directory which contains some files. All files in this directory are added to the dataset.
The sam_add_dataset command requires that you also specify a number of options. The important options are:
-n NAME or --name=NAME The dataset name. Default values is <userdataset+user+date>
This is the name of your dataset. This can be any string, but choose wisely since it is what you will use to refer to your collection of files.Examples of GOOD dataset names:
The common feature of all of these are that they A) Have some ownership identifier (norman, andrew, exoticsgroup) B) describe what they are (analysisskim, awesome ntuples, monopole skim) C) have extra info to distinguish them from similar datasets ( custom pid v4, a date, prelim + date).Examples of BAD dataset names:
- ""data", "mydata", "stuff" (all of these are non-descript, non-unique, etc....)
- "nue_data" "official_nue_dataset" "zomg_use_this_nue_data" (confusing and could collide with "official" datasets provide to the collaboration)
- Anything that is not unique or descriptive
Step 2 -- Find/Register your locations
Not all storage is created equal!¶
An important component to working with your data is knowing where it is located. SAM is "aware" of a number of different major storage systems and can interact with them transparently. HOWEVER, if SAM doesn't know about a storage location (like your laptop's harddrive or some random computer at a home university) then it can't help.
Current for NOvA SAM knows about the following locations that normal users can access:
|dcache:/pnfs/nova/scratch||dCache Scratch||The dCache scratch system|
|novadata:/nova/ana||Bluearc Disk||The entire /nova/ana volume|
|novadata:/nova/prod||Bluearc Disk||The entire /nova/prod volume|
|enstore:/pnfs/nova/||Enstore/dCache Tape||Tape backed parts of the dCache/Enstore system|
And a few special ones for DAQ and MC.
|novadata:/nova/data/rawdata||Bluearc Disk||Raw data collection area (do not touch)|
|novadata:/nova/data/mc||Bluearc Disk||Monte Carlo Files (deprecated)|
- Figure out where your files are
- If you're files are NOT in one of these areas, then move them there (i.e. copy them to the /nova/ana/users/ area or to the /pnfs/nova/scratch/users/ area)
- Note: moving files to /pnfs/nova/scratch/users is NOT as simple as using cp!!!
- To move to /pnfs/nova/scratch/users use "ifdh cp <myfile> /pnfs/nova/scratch/users/<myusername>/"
Once your files are in a supported location you can register them easily:
# If you are in the directory with your files sam_add_dataset -d . -n <myAwesomeDatasetName>
If all goes well then you will be able to use SAM to work with your files.
# List all your files samweb list-files "defname: myAwesomeDataSetName" # List the locations of your files samweb locate-file <filename>
Step 3: Run on your data¶
At this point you have a completely valid SAM dataset with files in locations that can be delivered to your offline jobs.
Follow the instructions for setting up and running a job against a standard SAM dataset (found here)
To avoid name collisions (your file conflicting with someone else's file) the files in your dataset are renamed automatically by the sam_add_dataset tool. The new filename will have the form:
<unique prefix>-<original filename>
The prefix is a UUID (a special number the is unique) but don't worry, you'll never need to type it. SAM will handle that for you.
The following tools are also available to help you work with datasets that you have defined. These tools all work on the dataset as a whole. You can also use any of the standard samweb tools to work on individual files or SAM catalog entries and searches.
Depending on where your files are stored, it may be desirable to "verify" that all the files you think are in your dataset are actually available.
To do this:
This will check the registered locations of the individual files to see if the files are actually there. You will get a report of which files are missing.
Note: This utility is really only needed if you are using the "scratch" dCache area and you have NOT used or touched your files for a long time (meaning > 30 days). In this case validating your dataset can tell you if your files have been purged from that cache area.
Missing Files -- What to do¶
If the validate utility finds missing files there are basically three things you can do:
- Replace the files
If you are able to locate a copy of the file from some where else or are able to regenerate the files then you can put them where they should be.
- Prune the dataset
In this mode you remove from the dataset any files which have disappeared. The resulting dataset is then is smaller (a subset of the original) but doesn't have any missing files which makes it easy to run over.
To prune you use:
sam_validate_dataset --prune --name <dataset>
The files are just missing and you don't care. You'll get errors when you run jobs that try to grab the missing files.
There will come a time when you are done with your dataset (and its data) and you'll want to remove it. To do this you do the following:
BUT.......there are a number of variants to this regarding what is actually deleted and what is retained.The general options are:
- Delete everything
- Delete the SAM dataset defition, KEEP the corresponding files
- Delete a subset of the files, KEEP the SAM file and dataset entries
Each of these are detailed below.
In this mode the utility completely cleans up all files that are associated with the dataset both in SAM and on disk. It also removes the dataset definition in SAM.
The way to invoke this is:
# # Will not actually do anything, just report what would be done sam_retire_dataset --just_say --name=<dataset> # # Deletes everything sam_retire_dataset --name=<dataset>
Delete the definiton, Keep the files¶
You want to keep the files but remove the dataset definitions
# # Does not delete the files (only the SAM entries) sam_retire_dataset --keep_files --name=<dataset>
Delete a subset of the files¶
You may want to prune down your dataset (remove files from it). To do this you can use a regular expression that will be matched against the name of the files:
sam_retire_dataset --delete_match=<regex> --name=<dataset> # # Example: Remove files that end in ".log" from your set sam_retire_dataset --delete_match=".log$" --name=andrews_awesome_data
1 Requires Python version 2.7+. Using older version will give errors.
When you use these tools you can specific your own metadata for your files. There are however a number of fields that are automatically filled in for you and constitute the minimum amount of metadata that are needed to find the file.
File Name: 1164c6e4-17bb-4139-898c-82d25f3a6b53-fardet_r00013114_s16_t00.raw File Id: 97347520 Create Date: 2015-03-05T18:30:46+00:00 User: anorman File Type: unknown File Format: unknown File Size: 7206088 Checksum: (none) Content Status: good Dataset.Tag: NORMAN_RUN13114_TEST
The most important one of these is the Dataset.Tag field. This field has the same name as your dataset.
i.e. The above file was declared using:
sam_add_dataset -n NORMAN_RUN13114_TEST -d .
Where the file was in the current directory.
Your certificate is expired (or doesn't exist)¶
[anorman@novagpvm01 test4]$ sam_add_dataset -n NORMAN_RUN13114_TEST -d . oops: SSL error: [Errno 1] _ssl.c:510: error:14094415:SSL routines:SSL3_READ_BYTES:sslv3 alert certificate expired
h3. Your are using Python 2.7.9 Error Example:
[anorman@novagpvm01 test4]$ sam_add_dataset -n NORMAN_RUN13114_TEST -d . oops: SSL error: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581)