Project

General

Profile

Data Handling and SAM

Setting up your environment

Interactive setup

Initialize ups using the microboone standard initialization script.

source /grid/fermiapp/products/uboone/setup_uboone.sh

The easiest way to setup all required ups products and set all required environment variables is to setup the top-level ups product uboonecode.

setup uboonecode v03_02_00 -q e6:prof

Overview of SAM ups products

sam_web_client

Sam_web_client provides full-featured samweb clients for command line (samweb command) and python (import samweb_cli).

Ifdhc

Ifdhc (Intensity Frontier Data Handling Client) provides portable data handling tools, including samweb clients for command line (ifdh command), python, and c++. Ifdh tools come with a strong guarantee to be portable and grid-friendly.

Ifdh_art

Ifdh_art supplies a layer on top of the ifdh c++ samweb client in the form of an art service. Using ifdh_art, users can configure their art programs to interact with the samweb server to read files from SAM.

Authentication

Some samweb commands require authentication, and some don't. Generally, any command that modifies the SAM database requires authentication.

Becoming a registered SAM user

In order for any secure samweb command to succeed, you need to be a registered SAM user. You can get a list of registered users using the line mode command (does not require authentication):

samweb list-users

If you are not a registered SAM user, ask the Analysisi Tools conveners to add you.

Using a kca x509 certificate for authentication

Samweb accepts a kca x509 certificate as authentication. Obtaining an x509 certificate is a two-step process of getting a kerberos ticket using kinit, then converting it to an x509 certificate using kx509.

kinit
kx509   # or get-cert

The kx509 command stashes the x509 certificate in a standard place where samweb knows to look for it.

Using a grid proxy for authentication

A grid proxy is a souped up kca x509 certificate that is accepted as authentication by standard grid tools, like gridftp. The SAM server also accepts grid proxies as authentication. While the x509 certificate is good enough for most samweb commands (including commands that update the SAM database), operations that make use of grid tools, such as fetching files from cache, may require a grid proxy.

You can obtain a grid proxy interactively (after obtaining a kca x509 certificate) using the following command.

voms-proxy-init -noregen -rfc -voms fermilab:/fermilab/uboone/Role=Analysis

If you plan on using SAM interactively at all, it will be advisable to attach the kx509 + voms-proxy-init commands to an alias.

The following command will print out informatiion about your current grid proxy or kca x509 certificate.

voms-proxy-info -all

Testing your authentication

You can determine if you are authenticated to the SAM server using this command. The option "-s" instructs samweb to contact the secure server.

samweb -s server-info

If you are properly authenticated, you should get a response like this:
$ samweb -s server-info 
SAMWeb API for uboone
Server version: 1.6.4
Cherrypy version: 3.2.2
SQLAlchemy version: 0.8.2
Connected to: oracle+cx_oracle://samdbs:******@csoravprd.fnal.gov:1526/microbp1
HTTP User-Agent: SAMWebClient/v1_3 (samweb) python/2.4.3
User information:
  Untrusted identity: greenlee@uboonegpvm02.fnal.gov
  Authenticated username: greenlee

If you aren't authenticated, you will see a result like this:
$ samweb -s server-info 
SSL error: (1, 'error:14094410:SSL routines:SSL3_READ_BYTES:sslv3 alert handshake failure'): no client certificate found

If you omit the "-s" option, you should get the following response whether or not you are authenticated. You can use this command to test whether the samweb server is reachable at all.
$ samweb server-info
SAMWeb API for uboone
Server version: 1.6.4
Cherrypy version: 3.2.2
HTTP User-Agent: SAMWebClient/v1_3 (samweb) python/2.4.3
User information:
  Untrusted identity: greenlee@uboonegpvm02.fnal.gov
  Unauthenticated

Authentication in batch jobs

In general, you don't need to manually authenticate in batch. Jobsub takes care of getting a valid grid proxy automatically.

Files and datasets

The SAM database stores information (metadata) about data files. SAM metadata can be used to query files from the SAM database.

Dataset Definitions

A dataset definition is a memorized query with a name. The following command will list all known dataset definitions.

samweb list-definitions

MCC5 datasets can be viewed from this web page:
http://www-microboone.fnal.gov/at_work/AnalysisTools/mc/mcc5.0/

You can view a particular dataset definition using the following command.

$ samweb describe-definition prodgenie_bnb_nu_uboone_mcc5.0
Definition Name: prodgenie_bnb_nu_uboone_mcc5.0
  Definition Id: 667
  Creation Date: 2014-09-25T01:21:57
       Username: uboonepro
          Group: uboone
     Dimensions: file_type mc and data_tier reconstructed and ub_project.name prodgenie_bnb_nu_uboone and ub_project.stage mergeana and ub_project.version v02_05_01 and availability: anylocation

File queries

Basic samweb commands for executing a file query are:

samweb list-files "<query>" 
samweb count-files "<query>" 

The query syntax is described on the samweb wiki. Here is one particularly simple and useful query for executing the query associated with a dataset definition.

$ samweb count-files "defname: prodgenie_bnb_nu_uboone_mcc5.0" 
125

Before leaving this topic, here are two modified versions of the above query to consider.

$ samweb count-files "defname: prodgenie_bnb_nu_uboone_mcc5.0 and availability: physical" 
87
$ samweb count-files "defname: prodgenie_bnb_nu_uboone_mcc5.0 and availability: virtual" 
38

The extra clause "and availability: physical" selects files that have a location stored in the SAM database. The clause "and availability: virtual" selects files that do not have a physical location (only metadata).

Pre-staging files for processing

Files NOT located in /pnfs/uboone/persistent or /pnfs/uboone/scratch are in areas that are taped-backed storage, Enstore. And while the files are permanently stored on tape, in order to access those files, they must be staged to disk within dCache in order to be processed. For a file to be staged to tape, the system must wait for a tape drive to become available and for the file to be located on that tape. This process can take from minutes up to many hours (20+ hours depending on Enstore load) This means that in order to process a file, you are strongly encouraged to stage the file well ahead of time so that reading of the file will be prompt and not cause delay. (Trying to interactively read a file while that is not staged from tape will cause the interactive session to hang waiting for access. Again, this can be hours.) In order to prestage a dataset you should issues the following commands:

$kx509
$samweb prestage-dataset --defname=<your_dataset_name_here>

Note that you will have to have setup uboonecode in the environment before issuing the commands. As well, you will need to keep the session you issued the command active while the command is running. Otherwise, your credentials generated with the "kx509" command will be flushed from the system when you log out.

Please make sure that you understand the size of the dataset you are trying to prestage. If you prestage all of the raw binary data (>4 PB) that will exceed the size of the disk cache in front of Enstore (2 PB shared by all FIFE experiments) and you will basically be trying to shove 20 pounds of fertilizer into a 10 pound bag. It will NOT go well.

If you want to check if a file is staged to disk, you can use this command interactively:

<uboonegpvm06.fnal.gov> cat /pnfs/uboone/data/uboone/raw/online/crt_seb/v2_0/00/00/00/00/".(get)(ProdRun20170629_161007-crt03.1.crtdaq)(locality)" 

The output of the command can contain two options: "ONLINE" and/or "NEARLINE". If the response is "ONLINE_AND_NEARLINE", then the file is on tape (the "ONLINE" part) and staged to disk for immediate reading (the "NEARLINE" part). If the response of the command only says "NEARLINE", this means that the file must be staged from tape before it can be read. If the response only say "ONLINE", then the file is on disk and available for immediate reading but is NOT stored to tape.

Accessing files in SAM

Copying files from a SAM dataset to local disk (including laptops)

You need to get a grid proxy onto the destination node. If you are on an interactive node, you can run these commands and the proxy will be setup for you.

<uboonegpvm01.fnal.gov> cigetcert -i "Fermi National Accelerator Laboratory" 
<uboonegpvm01.fnal.gov> voms-proxy-init -noregen -voms fermilab:/fermilab/uboone/Role=Analysis

If you are transferring to a laptop, you will then need to securely copy the file generated (e.g. /tmp/x509up_u8957) to your laptop using scp. Once located on your laptop, you must set the variable X509_USER_PROXY to point at that file.

Next you must setup uboonecode in your terminal. You can use a local install or if you have CVMFS mounted, it is recommended to use that source. CVMFS Installation Instructions

mac-124102:~ kirby$ source /cvmfs/uboone.opensciencegrid.org/products/setup_uboone.sh 
Setting up larsoft UPS area... /cvmfs/fermilab.opensciencegrid.org/products/larsoft/
Setting up uboone UPS area... /cvmfs/uboone.opensciencegrid.org/products/
mac-124102:~ kirby$ setup uboonecode v06_25_00 -q e10:prof
mac-124102:~ kirby$ export X509_USER_PROXY=/tmp/x509up_u8957 #You must change the file location!!!!
mac-124102:~ kirby$ samweb run-project --defname=prodcosmics_corsika_cmc_uboone_intime_mcc8_detsim --max-files=2 "source /cvmfs/uboone.opensciencegrid.org/products/setup_uboone.sh; setup uboonecode v06_25_00 -q e10:prof; globus-url-copy -vb %fileurl /tmp/" 

Change the defname value to your dataset, change the destination directory, and the number of files to be copied over to the destination.

Copy one file in SAM to a local disk

Sometimes you just want to copy one file from SAM to a local disk for testing. This can be done simply using command ifdh_fetch. The command:

ifdh_fetch <file>

will copy the specified file from SAM to your current directory. Note in the above command <file> is just the name of a file in SAM, not a path. For the above command to work, the specified file must have one or more locations stored in the SAM database.

SAM Projects

Often you will want to access not just one file, but many or all of the files in a SAM dataset. Furthermore, you may want to distribute processing of these files to many consumer processes (perhaps batch jobs). You do this using a SAM project. There are several ways to do this in which various of the steps are handled manually or automatically.

For the record, here are all of the steps involved in running a SAM project.
  1. Select (or create) a SAM dataset definition to be used as input.
  2. Generate a unique SAM project name.
  3. Start the SAM project.
  4. Define your local scratch directory.
  5. Start a consumer process.
  6. Request the location (uri) of the next file.
  7. Copy the file to the local scratch disk.
  8. Process the file.
  9. Release the file.
  10. Delete the file from the local scratch disk.
  11. Repeat steps 6-10 as often as desired, or until no more files are available.
  12. Stop the consumer process.
  13. Stop the SAM project.

A single SAM project can have many consumer processes (steps 4-12). Any of the above steps that require interaction with the SAM server can be done using samweb client command line tools, ifdh client command line tools, python clients, or c++ clients.

Running a SAM project using samweb command line tools

This section explains how to do every step of running a SAM project using samweb line mode commands. These instructions are not intended as a practical way of running SAM projects, but are included as an example for framework developers or script authors who may want to include these steps in their programs or scripts (perhaps using other samweb clients).

In this section we will show how to run a SAM project interactievly using (mostly) samweb command line tools. You will need to make sure your environment is set up, and you are authenticated, as described in the previous sections. These examples assume bash-style command line syntax.

Choose dataset

We will store the name of the SAM dataset in a shell variable.

def=prodgenie_bnb_nu_uboone_mcc5.0

Generate unique SAM project name

The name can be anything, but should be unique for all time. The project name is stored permanently in the SAM database.

prjname=${USER}_${def}_`date +%Y%m%d_%H%M%S`
echo $prjname
greenlee_prodgenie_bnb_nu_uboone_mcc5.0_20141006_214839

The python samweb client has a built in command to help you with this step.

Start the project

This command returns the url of the running project.

prjurl=`samweb start-project --defname=$def $prjname`
echo $prjurl
http://samweb.fnal.gov:8480/sam/uboone/api/projects/uboone/greenlee_prodgenie_bnb_nu_uboone_mcc5.0_20141006_214839

Define your local scratch directory

Interactively, use environment variable TMPDIR to specify the location of your local scratch directory. Make sure this directory exists. If you neglect to set TMPDIR for interactive SAM jobs, ifdh will use /var/tmp as the default, which will probably fill up /var/tmp on the uboonegpvmXX nodes, so you should not neglect to set TMPDIR.

export TMPDIR=/uboone/data/users/$USER/temp
mkdir -p $TMPDIR

In a batch job, ifdh will use the directory specified by environment variable _CONDOR_SCRATCH_DIR in preference to TMPDIR.

Start the consumer process

This command returns the consumer process id (an integer).

cpid=`samweb start-process --appfamily=art --appname=lar --appversion=v03_01_00 $prjurl --max-files=10 --schemas=root`
echo $cpid
8551

The application (family, name, version) can be anything. The max-files and schemas options are optional.

Request the location of the next file

This command returns a location in the form of a "uri" that is understandable to ifdh.

fileuri=`samweb get-next-file $prjurl $cpid`
echo $fileuri
gsiftp://fndca1.fnal.gov:2811/mc/uboone/reconstructed/prodgenie_bnb_nu_uboone/mergeana/v02_05_01/prodgenie_bnb_nu_uboone_53047_51_gen_53053_51_g4_53065_51_detsim_53067_51_reco2D_61211_51_reco3D_75646_32_merged.root

Copy the file to the local scratch disk

This step requires the ifdh command, as opposed to a samweb command. Depending on how ifdh chooses to copy the file (i.e., whether ifdh chooses to use grid tools or not), this step may require a valid grid proxy (see above).

If when running interactively you encounter permission failures, set environment variable IFDH_FORCE as "expftp".

export IFDH_FORCE=expftp   # Optional.  Only do this for interactive jobs.
loc=`ifdh fetchInput $fileuri | grep $TMPDIR`
echo $loc
/uboone/data/users/greenlee/temp/ifdh_22029/prodgenie_bnb_nu_uboone_53047_51_gen_53053_51_g4_53065_51_detsim_53067_51_reco2D_61211_51_reco3D_75646_32_merged.root

The location of the fetched file is printed on standard output by the "ifdh fetchInput" command, along with possibly other output. The above syntax is one way to capture the location of the file in a shell variable.

Process the file

Do whatever you want with the file.

cp $loc /where/I/want/it

Release the file

samweb release-file $prjurl $cpid `basename $fileuri`

A record is kept in the SAM database that this file was consumed by this project.

Delete the file from the local scratch disk

rm -f $loc

Stop the consumer process

samweb stop-process $prjurl $cpid

Stop the project

samweb stop-project $prjname

Reading from SAM in art programs.

Art programs can interact with the samweb server using the ifdh c++ client. As already mentioned above, the ifdh c++ client is packaged as an art service by the ifdh_art ups product.

The steps involved in running a SAM project involving an art program are the same as a command line interactive project, except that locating, fetching, processing, and deleting data files are handled internally by the art program. The initialization steps, up to starting the consumer process, and the finalization steps, beginning with stopping the consumer process, must be done external to the art program. If you encounter permission failures running interactively, you can set environment variable IFDH_FORCE as "expftp," the same as in the command line case.

The steps involved in running an art program SAM project can be summarized as follows.
  1. Choose a SAM dataset definition for input.
  2. Generate a unique SAM project name.
  3. Start the SAM project.
  4. Define your local scratch directory.
  5. Start a consumer process.
  6. Generate a SAM wrapper fcl job file.
  7. Run larsoft program (lar -c wrapper.fcl).
  8. Stop the consumer process.
  9. Stop the SAM project.

To summarize, the initial steps are the same as for a manual project.

def=prodgenie_bnb_nu_uboone_mcc5.0
prjname=${USER}_${def}_`date +%Y%m%d_%H%M%S`
prjurl=`samweb start-project --defname=$def $prjname`
export TMPDIR=/uboone/data/users/$USER/temp
mkdir -p $TMPDIR
cpid=`samweb start-process --appfamily=art --appname=lar --appversion=v03_01_00 $prjurl`

SAM fcl configuration

Configuring an art program to read data from SAM requires four art services,
  • IFDH
  • FileCatalogMetadata
  • IFFileCatalog
  • IFFileTransfer

plus the art RootInput module and optionally the RootOutput module. The IFDH service contains the c++ samweb client proper.

If you have an fcl job configuration that works for ordinary file or file list input, the first step in making a SAM-capable fcl configuraiton is to add the the FileCatalogMetadata service to your fcl configuraiton. Here is a typical configuration.

#include "services_microboone.fcl" 

services:
{
  FileCatalogMetadata:  @local::art_file_catalog_mc
}

Additionally, if you have a RootOutput module, you should include the "dataTier" parameter, like this:
outputs:
{
 out1:
 {
   module_type: RootOutput
   fileName:    "output.root" 
   dataTier:    "reconstructed" 
   compressionLevel: 1
 }
}

The standard fcl configurations in uboonecode/fcl all include configurations for the FileCatalogMetadata service already, and also include the dataTier parameter.

A convenient way to generate fcl configurations for the remaining art services is using a wrapper fcl file (this is the way project.py and condor_lar.sh do it). The wrapper fcl file needs to incorporate two parameters that were obtained during project initialization, namely, the project url ($prjurl in the above examples), and the consumer process id ($cpid).

Here is a shell script fragment for generating a wrapper fcl. This fragment assumes that your job fcl is called copy.fcl, which is a standard fcl file in uboonecode/fcl/utility.

cat > wrapper.fcl <<EOF
#include "copy.fcl" 

services.FileCatalogMetadata.processID: "${cpid}" 

services.user.IFDH:
{
  IFDH_BASE_URI: "http://samweb.fnal.gov:8480/sam/uboone/api" 
}

services.user.CatalogInterface:
{
  service_provider: "IFCatalogInterface" 
  webURI: "${prjurl}" 
}

services.user.FileTransfer:
{
  service_provider: "IFFileTransfer" 
}

source.fileNames: [ "${cpid}" ]

EOF

After the wrapper fcl is generated, the lar executable is invoked in the usual way, without specifying any input file(s) on the command line. Other standard art command line options are allowed (such as -n, number of events, in the following example).

lar -c wrapper.fcl -n 5

And to summarize, the finalization steps are the same as for a manual project.

samweb stop-process $prjurl $cpid
samweb stop-project $prjname

Reading from SAM in batch

Running a sam project in batch involves the same steps as an interactive sam project. However, in the case of batch jobs, the steps are divided among the submission script, separate batch jobs for starting and stopping the project, and batch worker jobs. Here is how these steps should normally be divided (best practice).

Submission script
  1. Choose a sam dataset definition for input.
  2. Generate a unique sam project name (which will be passed as a parameter to all batch jobs).
Start project batch job
  1. Start the sam project.
Batch worker
  1. Define your local scratch directory.
  2. Get project url using project name.
  3. Start a consumer process.
  4. Generate a sam wrapper fcl job file.
  5. Run larsoft program (lar -c wrapper.fcl).
  6. Stop the consumer process.
Stop project batch job
  1. Stop the sam project.

The standard tools (project.py and condor_lar.sh) are able to perform all of the steps needed for running SAM projects. Here is an example xml file for submitting batch jobs that read from SAM. The key field is the <inputdef>...</inputdef> field, which specifies the name of the input SAM dataset.

<?xml version="1.0"?>

<!-- Production Project -->

<!DOCTYPE project>

<project name="test">

  <!-- Group -->
  <group>uboone</group>

  <!-- Project size -->
  <numevents>10</numevents>

  <!-- Batch OS -->
  <os>SL6</os>

  <!-- Batch resources -->
  <resource>DEDICATED,OPPORTUNISTIC</resource>
  <server>-</server>

  <!-- Larsoft information -->
  <larsoft>
    <tag>v03_02_00</tag>
    <qual>e6:prof</qual>
  </larsoft>

  <!-- Project stages -->

  <stage name="copy">
    <inputdef>prodgenie_bnb_nu_uboone_mcc5.0</inputdef>
    <fcl>copy.fcl</fcl>
    <outdir>/pnfs/uboone/scratch/users/greenlee/test/testsam</outdir>
    <workdir>/uboone/app/users/greenlee/work/test/testsam</workdir>
    <numjobs>10</numjobs>
  </stage>
</project>

Submit batch jobs using the following command.

project.py --xml test.xml --submit --clean

When the batch jobs finish, check output using:
project.py --xml test.xml --check

Defining SAM Datasets

The easiest way to read data from sam is to use predefined sam dataaset definitions maintained by the microboone production team. These predefined datasets can be found on the microboone Analysis Tools web page (scroll down to "Datasets"). Monte Carlo dataset names can be found here. However, at some point it may be necessary to make your own dataset definitions.

Making SAM dataset definitions from scratch

Let's take look at a predefined sam dataset.

samweb describe-definition prodgenie_bnb_nu_uboone_mcc5.0
Definition Name: prodgenie_bnb_nu_uboone_mcc5.0
  Definition Id: 667
  Creation Date: 2014-09-25T01:21:57
       Username: uboonepro
          Group: uboone
     Dimensions: file_type mc and data_tier reconstructed and ub_project.name prodgenie_bnb_nu_uboone and ub_project.stage mergeana and ub_project.version v02_05_01 and availability: anylocation

The key part of the output is the line labeled "Dimensions:". The dimensions are the query part of the dataset definition. We can define a new dataset with the identical query by cutting and pasting the dimension query into the following command.

samweb create-definition greenlee_test_definition "file_type mc and data_tier reconstructed and ub_project.name prodgenie_bnb_nu_t0_uboone and ub_project.stage merge and ub_project.version v1_5 and availability: anylocation" 

We can obtain information about this definition in the usual ways.

samweb describe-definition greenlee_test_definition
samweb count-files "defname: greenlee_test_definition" 
samweb list-files "defname: greenlee_test_definition" 

Note that for this dataset, and for standard microboone MC datasets generally, files are mainly defined by a unique combination of three microboone-specific sam metadata attributes: ub_project.name, ub_project.stage, and ub_project.version. If you want to put your own files into sam, you can define your own unique values of these attributes. Another reasonable strategy would be to define your own unique values of the built in sam metadata attributes application_family, application_name, and application_version.

Making SAM dataset definitions from existing dataset definitions

You can make sam datasets definitions by adding constraints to existing sam datasets. One reason you might want to do this is an existing dataset definition contains too many files to process in a single batch submission. In general, you can "inherit" an existing definition by putting the string "defname: exsting_definition" inside your dimension query. Here are some examples.

Adding a run number constraint

sam create-definition mydef1234 "defname: existing_data_def and run_number >= 1000 and run_number < 2000" 

Adding a date/time constraint

samweb create-definition mydef1235 "defname: existing_def and start_time > '2013-11-29T17:10:00'" 
samweb create-definition mydef1236 "defname: existing_def and end_time > '2013-11-29T17:10:00'" 
samweb create-definition mydef1237 "defname: existing_def and create_date > '2013-11-29T17:10:00'" 

Here are some more tips about timestamps.
  • There are many accepted data/time formats.
  • Times should be quoted inside dimension strings.
  • The start_time and end_time parameters are sam built in metadata attributes which generally reflect when the data file was opened and closed by the art program (in the case of art files). In principle, it is possible to have data files that don't have start_time and end_time (although standard microboone files should always have these attributes).
  • The create_date is a database built-in attribute that reflects when the file was declared to the sam database. All files in the sam database should have this attribute.

Getting a full list of SAM database dimensions
If you want to make custom datasets, you will need to know the full list of dimensions that can be queried. This command will print all available dimensions.

samweb list-files --help-dimensions

Sam bookkeeping

Snapshots and Frozen Datasets

A SAM dataset definition is a memorized query. As such, the contents of a dataset can change over time. Sometimes you want to define a collection of files that doesn't change. A snapshot is the SAM concept that corresponds to a collection of files that doesn't change or evolve.

You can create a snapshot based on a dataset definition using the commnand "samweb take-snapshot".

samweb take-snapshot existing-def

Snapshots are identified by a snapshot id (an integer). The snapshot id will be printed out by the "samweb take-snapshot" command, which you can use to create a frozen dataset definition.

samweb create-definition frozen-def "snapshot_id 123456" 

Whenever you run a sam project, a snapshot is implicitly created for that project. You can determine the snapshot id corresponding to a particular project using the following command.

samweb project-summary myproject

Project Bookkeeping

Sam supports a number of project-related dimension constraints that can be used in support of bookkeeping. For example, the following are some kinds of queries that can be used to find out which files were prcessed by a particular sam project or list of projects.

samweb list-files "project_name myproject and consumed_status consumed" 
samweb list-files "project_name myproject1, myproject2 and consumed_status consumed" 
samweb list-files "project_name %prodgenie_bnb_nu_uboone_mcc5.0% and consumed_status consumed" 

Note that "%" is a database wildcard (equivalent of "*" for unix shell globbing). Since project names are generally derived in such a way as to include the name of the dataset definition, a frequently useful database wildcard is the one derived from a definition name, as "%definition_name%".

Sam also supports queries based on consumer process ids.

samweb list-files "consumer_process_id id1, id2 and consumed_status consumed" 

Recovery datasets.

Suppose you want to process a dataset definition mydef. You submit a sam project called myproject, but myproject doesn't process all of the files because some jobs failed, or the dataset is too large. Then you can define a recovery dataset like this.

samweb create-definition mydef_recover "defname: mydef minus (project_name myproject and consumed_status consumed)" 

Samweb provides a command line tool for generating (slightly different than above) recovery dimensions automatically.

samweb project-recovery myproject

Recursive datasets

Recursive datasets are a type of recovery dataset that can be reused indefinitely. Suppose you want to process a (possibly large, possibly growing) dataset called mydef. Then, you can define a recursive dataset like this.

samweb create-definition mydef_recur "defname: mydef minus (project_name %mydef_myproject% and consumed_status consumed)" 

The dataset definition mydef_recur will initially contain all files contained in mydef. Then you should submit analysis projects with project names that match the database wildcard %mydef_myproject%. Every time a file is successfully processed (consumed), it drops out of mydef_recur. You can keep submitting projects until no files are left in mydef_recur. In the case of a growing dataset, you can periodically submit projects that will only see newly added files.

Generating sam metadata.

Microboone sam metadata are described in document 2414 (public) in the microboone docdb.

Art has the feature of being able to generate and store sam metadata internally in artroot files. Technically, sam metadata are stored in the internal sqlite database (the same object where fcl paremeters are stored). Art internal sam metadata are formatted as a json string.

Art supplies a utility program called sam_metadata_dumper, which is invoked by specifying an art file name as a single argument.

sam_metadata_dumper mydata.root

Sam_metadata_dumper dumps internal sam metadata in a human-readable json format. Sam_metadata_dumper is used by the metadata extractor (see below), as well as being useful for verifying the correctness of generated sam metadata.

The generation of art internal sam metadata are controlled by the following art services and modules.
  • RootOutput module (built in art module).
  • FileCatalogMetadata service (built in art service).
  • FileCatalogMetadataMicroBooNE service (MicroBooNE-specific art service).
  • TFileMetadataMicroBooNE service (MicroBooNE-specific art service).

RootOutput Configuration

RootOutput has two fcl parameters that are relevant for generating sam metadata.

streamName
dataTier

The streamName parameter is optional. If missing, streamName defaults to the RootOutput module label. In art versions prior to 1.17 (larsoft and uboonecode versions prior to v04.27.00), there was an art bug such that the stream name was always set as the RootOutput module label regardless of the fcl parameter. The swizzler production releases, which are based off of larsoft v04.26.04 are affected by this bug.

The microboone metadata proposal, specifies the following allowed values for dataTier (more allowed values may be defined, if needed).

raw
generated
simulated
detector-simulated
reconstructed-2d
reconstructed
root-tuple
root-histogram
unknown

The dataTier parameter has the additional significance to RootOutput, that specifying this parameter to have a non-empty value triggers the storing of sam metadata. If dataTier is not specified, or is the empty string, no sam metadata will be stored in the output file.

FileCatalogMetadata Configuration

The built-in art service FileCatalogMetadata may be configured to generate additional per-job (the same for every output file) sam metatata. FileCatalogMetadata has the following fcl parameters, which add the corresponding metadata to each output file.

applicationFamily
applicaitonVersion
fileType
runType

By convention in microboone, the applicationFamily is always "art," the applicationVersion corresponds to the larsoft release version, and the fileType may be "data" or "mc."

Fcl parameters for FileCatalogMetadata should be configured in the "services" block of the fcl job file. Default values may be specified using standard include files as follows.

#include "services_microboone.fcl" 

services:
{
  FileCatalogMetadata:  @local::art_file_catalog_mc
}

or

#include "services_microboone.fcl" 

services:
{
  FileCatalogMetadata:  @local::art_file_catalog_data
}

If you are submitting batch jobs using project.py, the work flow may generate overrides for some FileCatalogMetadata parameters based on project xml fields.

FileCatalogMetadataMicroBooNE Configuration

FileCatalogMetadataMicroBooNE is an art service that lives in uboonecode (Utilities package). FileCatalogMetadataMicroBooNE adds MicroBooNE-specific per-job metadata.

The following fcl parameters are understood by FileCatalogMetadataMicroBooNE service.

FCLName
FCLVersion
ProjectName
ProjectStage
ProjectVersion

Fcl parameters for FileCatalogMetadataMicroBooNE should be configured in the "services" block of the fcl job file. None of these parameters have default values. That is, the configuration of FileCatalogMetadataMicroBooNE service needs to be completely specified for each job. For jobs submitted using project.py, a complete configuration of the FileCatalogMetadataMicroBooNE service is generated as part of the work flow and included in a wrapper fcl based on parameters in the project xml file. Here is an example configuration found in the wrapper fcl for some job.

services.FileCatalogMetadataMicroBooNE: {
  FCLName: "prod_muminus_0.1-2.0GeV_isotropic_uboone.fcl" 
  FCLVersion: "v05_14_00" 
  ProjectName: "greenlee_prod_muminus_0.1-2.0GeV_isotropic_uboone" 
  ProjectStage: "gen" 
  ProjectVersion: "v05_14_00" 
}

TFileMetadataMicroBooNE Configuration

TFileMetadataMicroBooNE is an art service that lives in uboonecode (Utilities package). TFileMetadataMicroBooNE generates sam metadata for non-artroot files.

The art framework does not provide any support for generating sam metadata for non-artroot files. Therefore, TFileMetadataMicroBooNE is able to generate all sam metadata (built-in and MicroBooNE-specific) for the files that it supports. TFileMetadataMicroBooNE is able to generate sam metadata for one non-artroot file per art program execution, which file would typically be the histogram and/or ntuple root file managed by art's TFileService, but could be any non-artroot file created by an art program. Generated sam metadata are output as a json file.

The following fcl parameters are understood by FileCatalogMetadataMicroBooNE service.

GenerateTFileMetadata
JSONFileName
dataTier
fileFormat

The first parameter GenerateTFileMetadata is a boolean which determines whether sam metadata should be generated or not. Parameter JSONFileName should conventionally be set to match the name of the non-artroot file for which sam metadata are being generated, plus an extra ".json" at the end (this is the naming convention that is normally expected by the work flow). Parameters dataTier and fileFormat are usually set as "root-tuple" and "root" respectively. Remaining sam metadata are captured using callbacks or inherited from the other sam metadata art services (FileCatalogMetadata and FileCatalogMetadataMicroBooNE).

Fcl parameters for TFileMetadataMicroBooNE should be configured in the "services" block of the fcl job file. A default configuration may be specified using standard include files as follows.

#include "services_microboone.fcl" 

services:
{
  TFileMetadataMicroBooNE:  @local::microboone_tfile_metadata
}

In the default configuration, fcl parameter JSONFileName is specified as "ana_hist.root.json," which is apppropriate for the default AnalysisTree ntuple file.

For jobs submitted by project.py, the work flow adds the above default configuration to the wrapper fcl file that it generates, if appropriate (that is, if it determines that sam metadata are being generated), so there is normally no need to include a configuration for TFileMetadataMicroBooNE in a job file. If you want to override some fcl parameters of TFileMetadataMicroBooNE to nondefault values in a job that is being run by project.py, the correct way to do that is to redefine the alias microboone_tfile_metadata, rather than to configure TFileMetadataMicroBooNE directly in the services block.

Storing files in sam.

Extracting sam metadata.

Before files can be declared to sam, art internal sam metadata needs to be "extracted," which is to say they must be converted into some external format that is understandable by samweb.

Given that there is no lab-wide convention for the format of internal sam metadata (other than it consists of name-value pairs), metadata extraction is experiment-specific, and there is no standard utility for doing this.

Microboone has a python script called extractor_dict.py, which is part of the ubtools ups product. When invoked as a command line tool with an art file as argument, extractor_dict.py will convert art internal sam metadata into a .json file, which can be used with samweb command line tools. Extractor_dict.py can be also used as a python module (imported), in which case it can be used to convert art internal sam metadata into a python dictionary, which can be used with the samweb python client.

In command line mode, extractor_dict.py is invoked with a single argument, which is an art file containing internal sam metadata. External (.json) format sam metadata is written to standard output.

extractor_dict.py mydata.root > mydata.root.json

Declaring files to sam.

Declaring a file to sam simply means storing metadata in the sam database. The line mode command for declaring a file is as follows.

samweb mydata.root.json

Adding file locations.

The sam database is used to store file locations. Locations are considered to be separate from sam metadata. A file can have no locations or many locations. Locations can be on disk or tape.

The following command may be used to list all of the locations associated with a file.

samweb locate-file prodgenie_bnb_nu_t0_uboone_10433669_49_gen_10433828_49_g4_11828882_49_detsim_11874447_49_reco_merged.root
uboonedata:/uboone/data/uboonepro/reco/S2013.10.21/prodgenie_bnb_nu_t0_uboone/12002360_0
enstore:/pnfs/uboone/mc/uboone/reconstructed/prodgenie_bnb_nu_t0_uboone/merge/v1_5(173@vp8134)

The location prefix "uboonedata:" means the bluearc disk. The location prefix "enstore:" means tape.

Use the following command to add a disk location.

samweb add-file-location mydata.root uboonedata:/uboone/data/users/me/my/path/

File locations can also be removed.

samweb remove-file-location mydata.root uboonedata:/uboone/data/users/me/my/path/

Note that the location is just the directory, not including the file name.

Uploading files to tape.

Files are uploaded to tape (enstore) by the file transfer service. Uploading is triggered by copying files to a specially designated directory drop box on either dCache or BlueArc disk. Microboone currently has six drop boxes. The preferred drop boxes are the three located on dCache (/pnfs/uboone/scratch/uboonepro/dropbox):

/pnfs/uboone/scratch/uboonepro/dropbox/data/uboone/raw/
/pnfs/uboone/scratch/uboonepro/dropbox/data/uboone/reconstructed/
/pnfs/uboone/scratch/uboonepro/dropbox/mc/uboone/reconstructed/

The other three drop boxes are located on the BlueArc volume /uboone/data and are not recommended but are maintained as a backup for dCache.

/uboone/data/uboonepro/dropbox/data/uboone/raw/
/uboone/data/uboonepro/dropbox/data/uboone/reconstructed/
/uboone/data/uboonepro/dropbox/mc/uboone/reconstructed/

In general, the drop box directory should follow the pattern /uboone/data/uboonepro/dropbox/<file-type/<group>/<data-tier>, based on the corresponding sam metadata.

Microboone SAM Tools

There are a number of microboone tools, available as scripts in the ubtools ups product (setup ubtools), that support various aspects of running sam projects.

SAM wrapper fcl

Script make_sam_wrapper.sh can be used to generate a sam wrapper fcl file.

make_sam_wrapper.sh <fcl file> <project url> <consumer process id> > wrapper.fcl

Start and stop project batch scripts

Ubtools contains two scripts called condor_start_project.sh and condor_stop_project.sh, which are batch-ready scripts for starting and stopping sam projects.

condor_start_project.sh --sam_defname=<definition-name> --sam_project=<project-name> --outdir=<dir>
condor_stop_project.sh --sam_project=<project-name> --outdir=<dir>

Batch worker script

The general purpose larsoft batch worker script condor_lar.sh knows how to do the following actions in support of sam projects.
  1. Start consumer process.
  2. Generate sam wrapper.
  3. Stop consumer process.
  4. Save the sam project name in a file called sam_project.txt.
  5. Save the consumer process id in a file called cpid.txt.

These actions are triggered if condor_lar.sh is invoked with options --san_defname and --sam_project.

condor_lar.sh --sam_defname=<definition-name> --sam_project=<project-name>

Metadata extractor

For converting art internal metadata to external format.

extractor_dict.py mydata.root > mydata.root.json

Production script

The production script project.py also supports many functions related to sam and sam metadata.

Project.py will do the following things for you when reading files from sam.
  • Generate a unique sam project name.
  • Generate a .dag file and submit it using dagNabbit.py,
  • Use the sam database to help with bookkeeping and generating makeup jobs.

Sam input is selected in project xml files using the xml tag <inputdef>.

<stage name="ana">
  <inputdef>prodgenie_bnb_nue_t0_cosmic_3window_uboone_summer2013</inputdef>

Project.py will do the following things for you when generating sam metadata or storing files in sam.
  • Generate a wrapper fcl job file for configuring RootOutput, FileCatalogMetadata, and FileCatalogMetadataExtras based on sam metadata stored in project (.xml) file. This reduces the need to maintain fcl job files.
  • Declare all files associated with a project (project.py --declare).
  • Disk location management.
      project.py --check_locations
      project.py --add_locations
      project.py --remove_locations
      project.py --clean_locations
    
  • Upload files to tape (project.py --upload).
  • Create sam dataset definition (project.py --define).

All of the above functions make use of data stored in the project (.xml) file.

Copying a file with xrootd

macOS instructions

These instructions have been tested with macOS but they should be adaptable to every Linux distribution with small changes.
It is possible to transfer files to your local laptop using xrootd, without going through the mounted filesystem (that's what happens when you use scp or rsync). In order to do that you will need on your laptop:
  • locally installed version of ROOT with xrootd enabled (should be enabled by default, check if you have the xrdcp command)
  • CVMFS
  • Virtual Organization Membership Service client (voms)

CVMFS
Since macOS does not provide kx509 we will need to mount the CVMFS Fermilab directory and setup the kx509 product.
So, first of all, you need to install FUSE (https://osxfuse.github.io) and the CVMFS macOS package (https://cernvm.cern.ch/portal/filesystem/downloads). Then, you need to edit the /etc/cvmfs/default.local file setting

CVMFS_REPOSITORIES=fermilab.opensciencegrid.org

In case you get an error about the CVMFS_HTTP_PROXY not being set, you may need to add also the following line to your /etc/cvmfs/default.local file:
CVMFS_HTTP_PROXY=DIRECT

Now, you can mount it and setup the environment with
mkdir -p /cvmfs/fermilab.opensciencegrid.org
sudo mount -t cvmfs fermilab.opensciencegrid.org /cvmfs/fermilab.opensciencegrid.org
source /cvmfs/fermilab.opensciencegrid.org/products/common/etc/setup

Also, you may want to mount the uboone directory with
mkdir -p /cvmfs/uboone.opensciencegrid.org
sudo mount -t cvmfs uboone.opensciencegrid.org /cvmfs/uboone.opensciencegrid.org

You now need to load the corresponding ups module:
setup kx509

Virtual Organization Membership Service client
Now, we need the set Virtual Organization Membership Service client. On macOS, you have to install the voms package with the brew package manager (https://brew.sh). Equivalent packages are available for the most common Linux distributions, too.
Once you've installed voms, you need the right configurations files. You can copy them from any uboonegpvm machine: you need the /etc/grid-security directory and the /etc/vomses file, to be copied in the same path in your local machine.
Now, you do:

kinit
kx509
voms-proxy-init -noregen -voms fermilab:/fermilab/uboone/Role=Analysis
xrdcp xroot://fndca1.fnal.gov/your/xroot/path /your/local/destination

For example, the xroot address for a file located in /pnfs/uboone/scratch becomes xroot://fndca1.fnal.gov/pnfs/fnal.gov/usr/uboone/scratch/

It is always the case that access to dCache through xrootd takes the form of xroot://fndca1.fnal.gov/pnfs/fnal.gov/usr/ followed by the regular path to the file or directory

Alternative to voms
If you can't / don't want to install voms on your laptop there is an alternative method. You connect to any uboonegpvm machine and then you do:

source /cvmfs/fermilab.opensciencegrid.org/products/common/etc/setup
kx509
voms-proxy-init -noregen -voms fermilab:/fermilab/uboone/Role=Analysis

Now, you have to securely copy (with scp) the ticket that has been generated in /tmp/ (something like /tmp/x509xxxxxx) to your local /tmp/ folder and then set
export X509_USER_PROXY=/tmp/x509xxxxxx

After that you should be able to copy files using xrdcp.

Note: you still have to install a version of ROOT with xrootd enabled as well as CVMFS (see instructions above).
Also, you need to copy the /etc/grid-security directory from a gpvm machine to the same path locally (otherwise you may get an "unknown CA" error).

A script to copy files from a grid output directory

The following script allows to index all the subdirectories that you will find at the end of multiple batch jobs, and to copy all the output root files to you local directory
Usage: source ./copy_files.sh username/path/to/outdir/ output_filename.root

For example, if all the output files have been saved in
/pnfs/uboone/scratch/users/mdeltutt/v06_26_01_10/grid_output/*/output.root,
then you will do
source ./copy_files.sh mdeltutt/v06_26_01_10/grid_output/ output.root.

copy_files.sh:

#!/bin/bash
kx509
voms-proxy-init -noregen -voms fermilab:/fermilab/uboone/Role=Analysis
DIRS=`xrdfs xroot://fndca1.fnal.gov/ ls /pnfs/fnal.gov/usr/uboone/scratch/users/$1`
c=0
for d in $DIRS
do
  echo "Copying from $d ..." 
  file="${2%.*}"_$c.root
  xrdcp xroot://fndca1.fnal.gov/$d/ubxsec_output_mc_bnbcosmic.root $file
  let c=c+1
done