Project

General

Profile

How to use Recursive Datasets in Production

This article explains how to use recursive datasets for production, in particular with POMS, but the same concepts can be used in stand alone project.py projects.

A recursive dataset is a sam dataset definition that includes a "minus" clause to automatically remove files from a static input dataset after they have been processed.

Snapshot Recursion

With snapshot recursion, a file will drop out of a recursive dataset as soon as it is included in a sam project snapshot. The basic template for this kind of dataset definition is as follows:

samweb create-definition recurdef "defname: staticdef minus snapshot_for_project_name recurdef_%" 

where "staticdef" is the static input dataset definition, and "recurdef" is the recursive definition. To configure a project with snapshot recursion, and have project.py create the recursive dataset for you, include the following elements in your project.py xml file.
<inputdef>staticdef</inputdef>   <!-- Static input dataset -->
<recurdef>recurdef</recurdef>    <!-- Recursive input dataset -->
<recurtype>snapshot</recurtype>
<prestart>1</prestart>           <!-- Prestart flag -->

The element <prestart> instructs project.py to start the sam project before job submission, rather than from a batch job. This removes files from the recursive dataset at the earliest possible time, which reduces the chance of input files being delivered more than once.

Step by Step Instructions

  • Define a static sam dataset definition containing the files you want to process. Add this dataset definition in an <inputdef> element of your xml file stage.
  • Invent a unique name for your recursive dataset definition (can be anything, doesn't have to match <inputdef>). Add this name in a <recurdef> element of your xml file stage. Do not define this dataset definition.
  • Add "<recurtype>snapshot</recurtype>" in your xml file stage.
  • Add "<prestart>1</prestart>" in your xml file stage.
  • If you want to restart a project from scratch, use a different name for <recurdef>.

Advantages of Snapshot Recursion

  • Easy to configure.
  • Low chance of getting duplicate files.
  • Can be used even if output files aren't declared to sam or given locations in sam.

Disadvantages of Snapshot Recursion

  • Single pass processing. There is no built in mechanism to retry failed jobs. You may have to make a makeup dataset definition manually at some point.

Child Recursion (Basic)

With child recursion, a file will drop out of a recursive dataset as soon as a child file appears in sam. The basic template for this kind of dataset definition is as follows:

samweb create-definition recurdef "defname: staticdef minus isparentof: ( <child dimensions> with availability physical )" 

where "staticdef" is the static input dataset definition, "recurdef" is the recursive definition, and "<child dimensions>" are a set of sam query dimensions that can identify child files. To configure a project with basic child recursion, and have project.py create the recursive dataset for you, include the following elements in your project.py xml file.
<inputdef>staticdef</inputdef>   <!-- Static input dataset -->
<recurdef>recurdef</recurdef>    <!-- Recursive input dataset -->
<recurtype>child</recurtype>

The main advantage of child recursion is that failed files can be resubmitted automatically.

The main disadvantage of basic child recursion is that there is a possibility of the same file being delivered more than once, due to the time delay between job submission and child files being available in sam. To avoid duplicate processing of input files, you need wait before submitting new batch jobs until previously submitted batch jobs are finished, and all output files have been stored and given locations in sam.

Obviously, child recursion can only be used if output files are declared and given locations in sam.

Step by Step Instructions

  • Define a static sam dataset definition containing the files you want to process. Add this dataset definition in an <inputdef> element of your xml file stage.
  • Invent a unique name for your recursive dataset definition (can be anything, doesn't have to match <inputdef>). Add this name in a <recurdef> element of your xml file stage. Do not define this dataset definition.
  • Add "<recurtype>child</recurtype>" in your xml file stage.
  • Wait for previously submitted jobs to finish before submitting new jobs.

Advantages of Basic Child Recursion

  • Easy to configure.
  • Failed files can be retried automatically.

Disadvantages of Basic Child Recursion

  • Possibility of duplicate files.

Child Recursion (Advanced)

Advanced child recursion attempts to correct the main defect of basic child recursion, which is the possibility of processing files more than once. It does this by adding additional minus clauses in the recursive dataset definition to remove files that are currently being processed, but don't yet have children in sam. The template for advanced child recursion is as follows.

samweb create-definition recurdef "defname: staticdef minus isparentof: ( <child dimensions> with availability physical ) minus <files in active sam projects> minus <files waiting to be stored>" 

As sam does not provide any built-in support in the form of query dimensions for the additional minus clauses, project.py will create and/or update a sam dataset definition for active sam projects and files waiting to be stored. Then the template for advanced child recursion looks like this:
samweb create-definition recurdef "defname: staticdef minus isparentof: ( <child dimensions> with availability physical ) minus defname: activedef minus defname: waitingdef" 

In the above definition, "activedef" and "waitingdef" are sam dataset definitions that are created and/or updated by project.py when you submit new jobs. To configure a project with advanced child recursion, include the following elements in your xml stage.
<inputdef>staticdef</inputdef>      <!-- Static input dataset -->
<recurdef>recurdef</recurdef>       <!-- Recursive input dataset -->
<recurtype>child</recurtype>
<prestart>1</prestart>              <!-- Prestart flag -->
<activebase>activedef</activebase>  <!-- Project name stem for matching -->
<dropboxwait>3</dropboxwait>   

The purpose of adding the prestart flag is so that files will be added to the active projects dataset as soon as possible. Element <activebase> represents a stem to be matched with sam project names. It should be the same as <recurdef> or a truncated version of <recurdef>. Element <dropboxwait> should be a floating point number that represents a maximum waiting period in days between the time that a file is declared to sam and the time when it will have a location. The assumption is that files that have been declared to sam, and still don't have a locaiton after <dropboxwait> days, will never get a location, and therefore can be resubmitted.

Step by Step Instructions

  • Define a static sam dataset definition containing the files you want to process. Add this dataset definition in an <inputdef> element of your xml file stage.
  • Invent a unique name for your recursive dataset definition (can be anything, doesn't have to match <inputdef>). Add this name in a <recurdef> element of your xml file stage. Do not define this dataset definition.
  • Add "<recurtype>child</recurtype>" in your xml file stage.
  • Add "<prestart>1</prestart>" in your xml file stage.
  • Add an <activebase> element in your xml stage that is compatible with <recurdef>.
  • Add a <dropboxwait> element in your xml stage.

Advantages of Advanced Child Recursion

  • Low chance of duplicate files.
  • Failed files can be retried automatically.

Disadvantages of Advanced Child Recursion

  • More complicated to configure.

One-to-one Processing

Use one-to-one processing when you want to ensure that each input file produces one output file. To do this add the following element in your xml file:

  • <maxfilesperjob>1</maxfilesperjob>

Merging.

Sometimes it is beneficial to process multiple input files into a single output file. A typical use case is filtering, where some input events are dropped from the output.

There are three xml elements that allow you to control how many input files are processed in a single job.

<maxfilesperjob>10</maxfilesperjob>
<targetsize>200000000000K/targetsize>
<singlerun>1</singlerun>

For any kind of merging set element <maxfilesperjob> to some value larger than one. Set element <targetsize> to the total size, in bytes, of input files that should be delivered to each worker. Based on the number of files and average size of files in the input dataset, project.py may use <targetsize> to adjust the number of submitted jobs downward (from a maximum specified by <numjobs>) to achieve the desired amount of data for each worker. Use element <singlerun> to specify that files delivered to each worker will come from the same run. This will allow run selection (e.g. for data quality) to be made at the sam level later on.

Prestaging.

Use element <prestage> to request prestaging of files by a single batch job before starting workers.

<prestagefraction>1</prestagefraction>

The numeric argument of element <prestagefraction> is the fraction of files in the project snapshot that should be prestaged before workers are allowed to start. The numeric argument should be a floating point number in the range (0,1).

Best Practices

  • Prefer Advanced Child Recursion. Avoid snapshot recursion.
  • Configure either one-to-one processing or merging. That is, always specify element <maxfilesperjob>. If <maxfilesperjob> is larger than one, also specify a <targetsize> and single run processing (<singlerun>1</singlerun>). A good target size is one that will generate output files with a typical size of 2 GB.
  • Always use prestaging (i.e. include element <prestagefraction> with numeric argument close to one).