Project

General

Profile

TPC Swizzling using POMS

Software Versions

The procedures listed in this article are dependent on production scripts (in ups products larbatch and ubutil) available by setting up uboonecode v06_26_01_30 or later, and larbatch v01_47_04 or later. These versions should be set up in the POMS launch template.

Requirements

The procedures described in this article for TPC swizzling satisfy the following requirements.

  • Runs the standard TPC swizzler, with CRT merging. As of uboonecode v06_26_01_30, the TPC swizzler produces 11 artroot output streams, plus one histogram ("daq_hist") file.
  • Batch jobs are submitted using a DAG, consisting of "start," "worker," and "stop" job. The start job differs from a standard sam work flow by the fact that it prestages both main (TPC binary) and secondary (CRT swizzled) input files before releasing the workers.
  • Output files produced by one batch job are merged within one stream and one run to increase their size, up to some maximum size.
  • Duplicate-processed artroot output files are deleted automatically before being merged, declared to sam, or copied to the FTS dropbox.

SAM datasets.

Input datasets.

Start by defining an input sam dataset consisting of TPC binary data for the period of interest. Here is a typical definition.

samweb create-definition prod_swizzle_binary_crt_merge_run4a \
  "data_tier raw and file_type data and file_format 'binaryraw-compressed', 'binaryraw-uncompressed' \
  and run_number 18961-19752 \
  and not file_name %Calibration%,Test% \
  and defname: crt_swizzled_ready" 

The definition should include the following elements.

  • Basic sam metadata (file_type, file_format, data_tier).
  • Run range.
  • Exclusions for non-software-triggered data.
  • CRT readiness mix-in.

Use the following CRT readiness definitions for different running periods.

  • "crt_swizzled_ready3" (reswizzled stream 1 / top panel) after the CRT GPS clock fix (runs 14117 and later).
  • "crt_swizzled_ready2" before the CRT GPS clock fix (runs 14116 and earlier).

Create an indirection or snapshot dataset that refers to files from the main input dataset. Initially, this can just be a copy of the main input dataset.

samweb create-definition prod_swizzle_binary_crt_merge_run4a_snap1 "defname: prod_swizzle_binary_crt_merge_run4a" 

Later (after the input data is frozen, or if the sam queries become too complicated for sam to handle), this definition can be replaced with a snapshot.

Output datasets.

Define the output dataset definitions for each of the 11 trigger streams and the histogram output. Here is an example artroot definition.

samweb create-definition prod_bnb_swizzle_crt_inclusive_v7 \
  "file_type data and file_format artroot and data_tier raw \
  and ub_project.name swizzle_crt_merge_run4 and ub_project.version prod_v06_26_01_20 \
  and data_stream outbnb" 

The definition should include the following elements.

  • Basic sam metadata (file_type, file_format, data_tier).
  • MicroBooNE-specific metadata (ub_project.name, ub_project.version). Do not include ub_project.stage.
  • Data stream, corresponding to a specific trigger and swizzler output stream.

Create a dataset definition which is the union of artroot files from all trigger streams. This is done by omitting the data_stream clause from the above definitions.

samweb create-definition prod_allstreams_swizzle_crt_inclusive_v7 \
  "file_type data and file_format artroot and data_tier raw \
  and ub_project.name swizzle_crt_merge_run4 and ub_project.version prod_v06_26_01_20" 

Finally, create a dataset definition for the histogram output files.

samweb create-definition prod_daq_hist_swizzle_crt_inclusive_v7 \
  "file_type data and file_format root and data_tier 'root-tuple' \
  and ub_project.name swizzle_crt_merge_run4 and ub_project.version prod_v06_26_01_20" 

Once these dataset are created, you should post them on a web page.

Recursive datasets.

Best practice for defining recursive (aka draining) swizzling dataset definitions is to use child-type recursion (same as any POMS project). Although project.py has facilities for creating recursive dataset definitions automatically, for swizzling you should not use these facilities, because they don't handle multiple data streams. Instead, you should manually define recursive input dataset definitions. Start by doing this for each trigger stream.

samweb create-definition prod_swizzle_binary_crt_merge_run4a_bnb_recur1 \
  "defname: prod_swizzle_binary_crt_merge_run4a_snap1 \
  minus isparentof:( defname: prod_bnb_swizzle_crt_inclusive_v7 with availability physical ) \
  minus defname: prod_swizzle_binary_crt_merge_run4a_active \
  minus defname: prod_swizzle_binary_crt_merge_run4a_wait" 

The recursive definition should start with the indirection input dataset defined above, and should include three "minus" clauses.

  • The first minus clause should be an "isparentof:" claused based on the output dataset definition. Be sure to include a "with availability physical" subclause.
  • The second minus clause is for active sam projects.
  • The third minus clause is for virtual files that are not yet stored.

Also create a recursive dataset definition corrsponding to the "allstreams" output dataset. Normally this can be done by replacing the stream name with "allstreams" in the definition name and in the first minus clause.

samweb create-definition prod_swizzle_binary_crt_merge_run4a_allstreams_recur1 \
  "defname: prod_swizzle_binary_crt_merge_run4a_snap1 \
  minus isparentof:( defname: prod_allstreams_swizzle_crt_inclusive_v7 with availability physical ) \
  minus defname: prod_swizzle_binary_crt_merge_run4a_active \
  minus defname: prod_swizzle_binary_crt_merge_run4a_wait" 

Finally, create a dataset definition that is the union of all 11 single-stream recursive definitions.

samweb create-definition prod_swizzle_binary_crt_merge_run3a_everystream_recur1 \
  "defname: prod_swizzle_binary_crt_merge_run3a_bnb_recur1 \
  or defname: prod_swizzle_binary_crt_merge_run3a_bnbhsnc0_recur1 \
  or defname: prod_swizzle_binary_crt_merge_run3a_bnbunbiased_recur1 \
  or defname: prod_swizzle_binary_crt_merge_run3a_extbnb_recur1 \
  or defname: prod_swizzle_binary_crt_merge_run3a_exthsnc0_recur1 \
  or defname: prod_swizzle_binary_crt_merge_run3a_extnumi_recur1 \
  or defname: prod_swizzle_binary_crt_merge_run3a_extunbiased_recur1 \
  or defname: prod_swizzle_binary_crt_merge_run3a_mucs_recur1 \
  or defname: prod_swizzle_binary_crt_merge_run3a_notpc_recur1 \
  or defname: prod_swizzle_binary_crt_merge_run3a_numi_recur1 \
  or defname: prod_swizzle_binary_crt_merge_run3a_numiunbiased_recur1" 

Operationally, these recursive dataset definitions operate as follows.

  • A TPC binary file will drop out of the single-stream recursive datasets when a swizzled file is available for that trigger stream.
  • A TPC binary file will drop out of the all-streams recursive dataset when a swizzled file is available for any trigger stream.
  • A TPC binary file will drop out of the every-stream recursive dataset when a swizzled file is available for all trigger streams.

Submitting jobs using the all-streams recursive dataset definition has the lowest risk creating duplicate-processed files. Submitting jobs using the every-stream recursive dataset offers the most inclusive recovery potential, but has a risk of creating duplicate-processed swizzled files for some trigger streams. Submitting jobs using any single-stream recursive dataset also has a risk of duplicate-processing some trigger streams (other that the trigger stream for this the dataset is defined). Obviously, any time batch jobs are submitted using the every-stream or single-stream input datasets, the workflow must include some protection against storing duplicate-processed swizzled files.

XML File

This section contains instructions for configuring a project.py xml file for TPC swizzling using POMS. In general, the xml file should contain 13 stages, including 11 stages corresponding to each trigger stream, plus an "all-streams" stage and an "every-stream" stage. Each stage should be identical, except for the <recurdef> element. Here is an element-by-element breakdown of what should be included in the xml file.

Project elements.

  • <project name=...> - Project name should match "ub_project.name" clause in output dataset definitions.
  • <numevents>1000000</numevents> - Some large number.
  • <os>SL6</os> - For any MCC8 release (v06_26_01_xx) os can only be SL6. SL7 should work eventually. For now, it is recommended to stick with SL6 for maximum compatibility.
  • <resource>DEDICATED,OPPORTUNISTIC,OFFSITE</resource> - Starting with uboonecode v06_26_01_30, can include "OFFSITE" for running on OSG.
  • <larsoft><tag> and <larsoft><qual> - Should be compatible with requirements specified at the top of this article, and should match POMS launch template.
  • <version>...</version> - Should match "ub_project.version" clause in output dataset definitions.
  • <check>1</check> - Enable on-worker validation.
  • <copy>1</copy> - Enable on-working storing of files.

Stage elements.

  • <stage name=...> - Stage name doesn't matter (provided output dataset definition doesn't include "ub_project.stage" clause).
  • <inputdef> - This element isn't actually used when recursive dataset definitions are created manually. It is good to specify it as the main input dataset for TPC binary files, as a reminder.
  • <recurdef> - Use the manually defined recursive dataset definitions created as described above (different for each stage).
  • <recurtype>child</recurtype> - Child recursion.
  • <singlerun>0</singlerun> - Single run processing no longer needed (or leave this element out).
  • <prestagefraction>1</prestagefraction> - Prestage all files in DAG start job.
  • <activebase>...</activebase> - Generally should match <inputdef>.
  • <dropboxwait>3</dropboxwait> - Waiting period for virtual files to be stored (in days) before attempting recovery.
  • <fcl>swizzle_software_trigger_streams_optfilter_crt_merge_extra_v06_26_01_13_nomerge.fcl</fcl> - Use "nomerge" fcl files for this workflow, which means that RootOutput is configured with "fileProperties: { maxInputFiles: 1 }." Different fcl files are available for different CRT merging and optical filtering options.
  • <outdir> - In /pnfs/uboone/scratch.
  • <logdir> - Should match <outdir>.
  • <workdir> - In /pnfs/uboone/resilient.
  • <bookdir> - In /uboone/data. In general, this area isn't used unless you manually invoke "project.py --check."
  • <numjobs>100</numjobs> - Maximum number of jobs that will be submitted at a time.
  • <maxfilesperjob>10</maxfilesperjob> - Maximum number of files to process in one batch job. This is limited by cpu time, and more particularly, by the amount of scratch disk space available on each worker.
  • <targetsize>20000000000</targetsize> - Specifies the ideal size of input files to be delivered to each worker. This parameter may cause the number of batch jobs submitted to be reduced below <numjobs> to boost the amount of data delivered to each worker.
  • <memory>4000</memory> - Standard swizzle jobs can't run in 2000 MB.
  • <submitscript>maxjobs2.sh n1 10 -j1 start.*-swizzle_crt_merge -n2 500 -j2 .*-swizzle_crt_merge</submitscript> - Throttle job submissions.
  • <jobsub>--expected-lifetime=medium --subgroup=prod</jobsub> - Extra jobsub options for DAG worker jobs.
  • <jobsub_start>--expected-lifetime=medium --memory=1000MB --subgroup=prod</jobsub_start> - Extra jobsub options for start/stop DAG jobs.
  • <startscript>condor_start_project_crt_merge_v06_26_01_13.sh</startscript> - DAG start batch worker script. This element should be included for any batch job that includes CRT merging. Use one of the following scripts.
    • condor_start_project_crt_merge_v06_26_01_13.sh - For post-CRT-GPS-fix data (runs 14117 and later).
    • condor_start_project_crt_merge_v06_26_01_26.sh - For pre-CRT-GPS-fix data (runs 14116 and earlier).
  • <endscript>consolidate_swizzled.py</endscript> - This post-processing script filters duplicate-processed files and merges artroot files within a single stream and run up to some maximum size.
  • <schema>gsiftp</schema> - Gsiftp schema needed for reading any kind of non-root file (including TPC binary).

POMS Configuration

POMS configuration is pretty simple. Obvserve the following points.

  • Clone a campaign or add a stage to an existing camapgign using the campaign GUI editor.
  • Edit the newly created stage as follows.
    • The launch template should set up uboonecode and larbatch versions compatible with the requirements specified at the top of this article. The launch template uboonecode version should match the uboonecode version specified in the xml file.
    • Edit the stage parameters as follows.
    • Version should match uboonecode version specified in xml file.
    • Specify input dataset as "none."
    • Specify input dataset type as "draining."
    • Edit the parameter overrides to override parameters --xml and --stage. Specify the full path of the newly created xml file, and select one of the stages.
    • The job type and campaign stage should be configured to submit jobs using a command like this:
      project.py --xml <xml-file> --stage <stage-name> --submit
      

In general, it should be sufficient to define a campaign with a single stage. The POMS campaign stage may be reconfigured to use different xml stage names as needed. In the early stages of a campaign, it is probably best to submit jobs using one of the one of the single stream xml stages. Successful jobs will produce all trigger streams as a side effect, but some files may need to be made up. In the final wrap-up part of a campaign, switch to the "every stream" stage. The only reason for not doing this at the beginning is that it will put a larger load on sam, with greater risk of sam errors.