Definition Management with defman


The defman definition management script is designed to automate the process of creating dataset definitions. In particular, the script optimizes creation of a chain of definitions for various data tiers where files in all tiers match a base set of constraints.


The defman interface is through the command line, with usage as follows:

     defman  [-h] [--dry DRY] [--prodCAF] [--linear] [--dq] [--out OUT] [--delete]

The script has one required argument: a JSON file which specifies the data tiers and constraints that will be used to create the definitions. There are just a few other options:

  --dry DRY          Dryness level, integer. Three modes: 
                                                         0: make datasets.
                                                         1: count-files and show constraints (default).
                                                         2: just show constraints
  --prodCAF, -p      " Create standard production suite of of reco-pidpart-
                     lem-pid-caf, including funny lem draining things
  --linear           Make draining datasets based assuming linear flow. Uses
                     'children' specified in each tier block. Branching is ok,
                     but makes no attempt at combining branches.
  --dq               Make a set of datasets with the good run criterion
                     (dq.isgoodrun true) corresponding to all input datasets.
  --out OUT, -o OUT  File which to direct output.
  --delete           " Delete all datasets that are specified in

NB: In a pinch, it is very handy to use the -h or --help option to jog your memory.

The -p, --prodCAF option is handy since that mode understands the crazy LEM definitions. For any processing structures which do not re-merge after splitting and require multiple parents, the children field in the configuration along with --linear will handle the draining datasets. The --dq option duplicates all of the definitions requested in the configuration file with another set including the dq.isgoodrun true constraint and _goodruns in the definition name. Use the -o, --out option to produce a record of the definitions created. The --delete options will delete all definitions if they already exist, assuming that they can be deleted without the "admin" role for samweb.

The --dry option has three options itself as seen above. The first option: --dry 0, makes the definitions. The other two options are test modes to verify that the number of files and the constrains are those expected.

Basic Example

To run defman at the command line, provide the name of the JSON file and any other options:

$ defman  name_of_file.json --out name_of_file.out

Above, we demonstrate the --out option, which creates a log, as described above. An example of the output written to the log is as follows:

****** artdaq *****
Definition name: prod_artdaq_FA14-10-03x.a_nd_genie_fhc_nonswap_v2
Constraints: data_tier artdaq  and  defname: parent_FA14-10-03x.a_nd_genie_fhc_n
onswap_v2  and  simulated.base_release FA14-10-03x.a

****** reco *****
Definition name: prod_reco_S15-05-04_nd_genie_fhc_nonswap_v2
Constraints: data_tier reco  and  defname: parent_FA14-10-03x.a_nd_genie_fhc_non
swap_v2  and  reconstructed.base_release S15-05-04

****** pidpart *****
Definition name: prod_pidpart_S15-05-04a_nd_genie_fhc_nonswap_v2
Constraints: data_tier pidpart  and  defname: parent_FA14-10-03x.a_nd_genie_fhc_
nonswap_v2  and  pidpart.base_release S15-05-04a  and  reconstructed.base_releas
e S15-05-04

****** lemsum *****
Definition name: prod_lemsum_S15-05-04_nd_genie_fhc_nonswap_v2
Constraints: data_tier lemsum  and  defname: parent_FA14-10-03x.a_nd_genie_fhc_n
onswap_v2  and  pidpart.base_release S15-05-04  and  reconstructed.base_release 

This output is very similar to what is printed to the screen, which will also include some extra input depending on the --dry setting. With --dry 1, the default, the output will include file counts for each definition. With --dry 0, the output will include the result of dataset creation.

JSON Configuration File

The definitions are configured through a JSON file. JSON is a simple markup standard for strings, numbers and booleans which can be stored in lists and key-value pairs called objects. It takes just a few minutes to master the standard; the homepage which describes the standard is a good place to start.

In short, the allowed values are true, false, null and any string wrapped in single quotes or any number. These values can be stored in lists wrapped in square brackets ([ ]), with entries separated by commas. The other allowed data structure is called an object which is a set of key:value pairs separated by commas. Keys must be strings, values can be any of those described above, or another object or a list.

General Parameters

The "general" parameters exist at top level in the JSON file/object. These parameters affect all of the datasets which will be created. The general parameters are:

Parameter Description
base Base constraints which will be used for all tiers. These usually include detector, flavor, trigger, run range, etc. Leading and trailing and operators should not be included, defman takes care of those. Tier names and software release should be specified in the tier blocks, along with arbitrary additional constraints. See the tiers section for more details.
special Name for the definitions which describes what make it special. The definition names will all begin with prod_tier_release_ and appends the contents value of special. The production standard for this would be detector_flavorset_special. Leading and trailing underscores should not be included, but ones in the middle should be.
tiers JSON object which enumerates all tiers for which definitions will be made. See the tier-specific parameters section for more details.

Tier-specific Parameters

The tier-specific parameters are held within the object mapped to the general "tiers" key described above. Each tier for which definitions will be made is represented in this block by a key and an object with configurable parameters. The data-tier will be taken from the key for each block, unless the actual-tier The parameters available to configure each tier/definition are as follows:

Parameter Description
release NOvA software version in which the files will be or were processed. Example: S12-12-12. If this varies across the dataset, the any-release parameter should be set instead.
release-field Metadata parameter which is used to store the release for this data-tier. Example: reconstructed.base_release
any-release Indicates that the files in this tier come from a variety of releases so that release can be excluded from the definition name and constraints. Overrides release and release-field.
actual-tier Indicates that the key for this tier block does not actually match the data-tier. Mainly intended for --prodCAF mode when particular keys are required but tiers could actually change. Example: for MRCC, the data-tier is mrccreco, but the JSON tier key would be reco.
extra Any other constraints which should be added to the definition for this tier only. Leading and trainling and operators should be excluded.
children JSON list of children tiers, identified by key in the tiers block (as opposed to actual-tier parameter). If --linear or --prodCAF mode, draining datasets are created which account for all children.

In --prodCAF mode, all standard production tiers must be included in tiers block: "artdaq", "reco", "pidpart", "lemsum", "lempart", "pid", "caf".

Example Configuration File

  # Base constraints, can add more for a given tier with "extra" below
  "base": "nova.detectorid nd and simulated.generator genie and simulated.genieflavorset nonswap",

  # Special name.  Standard is detector_flavor_special.  
      # Tagged release of nova sofware
      # Metadata parameter for which to match release
      # Optional parameter to allow any release, use instead of release and release-field 
      # "any-release":"True", 
      # Additional constraints, separated by "and", but not prepended
      "extra":"nova.subversion 8",  
      # Child tiers, multiple allowed.  
      "children":["pidpart", "lemsum"]

A Comment on Comments

Note that the example above includes # style line comments. Comments are not included in the JSON standard, but a comment parser has been built into defman to allow users to make it easier to work with the configuration files. Users should acknowledge that the comments do not cohere with the JSON standard and will not work in any standard JSON parser.