Project

General

Profile

Configuring Offline PUBS Projects

Pubs projects are configured by the pubs configuration stored in the pubs database (and usually stored as a .cfg file in the pubs repository), the project.py xml file (or pubs xml template), and the larsoft fcl file(s).

PUBS repository organization

Offline scripts and configurations normally live in subdirectroy "dstream_prod" of the pubs repository. The main production script for most projects is called "production.py." Pubs configurations live in subdirectory dstream_prod/cfg. Xml files and xml templates live in dstream_prod/xml.

PUBS project configuration.

Pubs project configurations consist of parameters (name-value pairs). Some built-in pubs parameters are intended to be understood and used by the daemon. Others are intended to be used by the project script. The latter are called "resources". Built-in parameters are encoded as name-value pairs, one per line in the configuration file. Resources are encoded like this:

RESOURCE <resource name> => <resource value>

Resources are available to the project script. Resource names and values can be anything.

Resources for production.py.

Here is the complete list of resources understood by production.py.

PARENT
PARENT_STATUS
NRUNS
XMLFILE
XML_TEMPLATE
XML_OUTDIR
PUBS_XMLVAR_*
NRESUBMISSION
EXPERTS
STAGE_NAME
STAGE_STATUS
MIN_RUN
MAX_RUN
MIN_SUBRUN
MAX_SUBRUN
MAX_STATUS
NSUBRUNS
STORE
STOREANA
ADD_LOCATION
ADD_LOCATION_ANA
CHECK
CHECKANA
NJOBS_LIMIT
NJOBS_TOTAL_LIMIT

The meanings of these parameters are as follows.

NRUNS - The maximum number of status changes that will be processed in one invocation of production.py.

PARENT, PARENT_STATIS - Do not process any (run, subrun) unless this (run, subrun) has the specified status in the specified parent project.

XMLFILE - Path of project.py xml file. Specify either XMLFILE or (XML_TEMPLATE, XML_OUTDIR).

XML_TEMPLATE - Path of xml template.

XML_OUTDIR - Generate xml file from template in this directory.

PUBS_XMLVAR_* - Substitute tail part of any resource matching this pattern in xml template.

NRESUBMISSION - Number of automatic resubmissions before marking a (run, subrun) with error status.

EXPERTS - E-mails of experts.

MIN_RUN, MIN_SUBRUN - Minumim (run, subrun) to process.

MAX_RUN, MAX_SUBRUN - Maximum (run, subrun) to process.

Multistage projects and resources.

The following resources are specified as a colon-separated list of values, such that each value applies to one stage.

NSUBRUNS
STAGE_NAME
STAGE_STATUS
STORE
STOREANA
ADD_LOCATION
ADD_LOCATION_ANA
CHECK
CHECKANA

The interpretation of these parameters for each stage is as follows.

NSUBRUNS - Number of subruns processed in a single batch submission. The number of batch workers contained in one batch submission is variable. For generator jobs, the number of batch workers is specified using xml parameter <numjobs>. For any stage that is reading some kind of input, the number of batch workers is calculated according to the following formula.

njobs = (NSUBRUNS + <maxfilesperjob> - 1) / <maxfilesperjob>

Note the following limits and special cases.
  • If NSUBRUNS is one, the number of batch workers will always be one.
  • If <maxfilesperjob> is one, the number of batch workers will be the same as NSUBRUNS (i.e. one file per batch job).
  • If <maxfilesperjob> is very large, the number of batch workers will be one. In this case, NSUBRUNS input files will be read by each batch worker.

STAGE_NAME - Should match stage name specified in xml file.

STAGE_STATUS - Conventionally, increment by ten for each stage in a multistage project. For example:

RESOURCE STAGE_STATUS => 0:10:20:30:40:50

STORE - Boolean (0 or 1) to indicate whether to store artroot output files from each stage in enstore. By default, final stage is stored in enstore.

STOREANA - Boolean (0 or 1) to indicate whether to store analysis root output files from each stage in enstore. By default, final stage is stored in enstore.

ADD_LOCATION - Boolean (0 or 1) to indicate whether to add disk location of artroot output files in sam database. Default is false.

ADD_LOCATION_ANA - Boolean (0 or 1) to indicate whether to add disk location of analysis root output files in sam database. Default is false.

CHECK - Boolean (0 or 1) to indicate whether to do standard artroot output checks (project.py --check). Default is true. Set this parameter to false if this stage doesn't generate artroot output.

CHECKANA - Boolean (0 or 1) to indicate whether to do standard analysis output checks (project.py --checkana). Default is false. If CHECK is false, you should probably set CHECANA to be true.

Recommended setting for different use cases.

The use cases described in this section apply to a single stage in case of multistage projects.

Generator stage.

  • Specify xml project parameter <numevents> to be the number of events per file.
  • Specify xml stage parameter <numjobs> to be 1.
  • Specify pubs parameter NSUBRUNS to be one for the generator stage (it can and should be larger than one for later stages).
  • Use pubs parameter MAX_SUBRUN to specify the size of the generator sample (number of generated events = MAX_SUBRUN * <numevents>).

The reason for setting <numjobs> and NSUBRUNS to be one, is to ensure that each generated file gets a unique subrun number. If either parameter is greater than one, then multiple files will be generated with the same subrun number.

One-to-one pipeline.

This section refers to later stages of multistage projects such that each batch job will read one input file from the previous stage.

  • Do not specify any xml input (one of <inputfile>, <inputlist>, <inputdef>).
  • You can use xml parameter <previousstage> to specify the input stage, if different than the preceding stage in the xml file.
  • Xml parameters <numevents> and <numjobs> do not matter.
  • Specify xml stage parameter <maxfilesperjob> to be 1 or omitted.
  • Specify pubs parameter NSUBRUNS to be some not too large integer (values in range 10-100 are convenient). This parameter specifies the number of files/subruns processed in each batch job cluster. It isn't very sensitive.

Many-to-one pipeline (merge stage).

Same as one-to-one pipeline, except:

  • Specify xml stage parameter <maxfilesperjob> to be at least as large as the desired merge factor.
  • Specify xml stage parameter <targetsize> to be the maximum desired file size.
  • Specify pubs parameter NSUBRUNS to be at least as large as the desired merge factor.

The merge factor will roughly be the minimum of pubs parameter NSUBRUNS and xml parameter <maxfilesperjob>.

Input from sam dataset.

  • Specify xml parameter <inputdef> to contain the name of the input sam dataset definition.
  • Xml parameters <numevents> and <numjobs> do not matter.

The average merging factor is specified in a manner similar to the pipeline cases using pubs paramater NSUBRUNS and xml parameter <maxfilesperjob>. If <maxfilesperjob> is specified, sam will strictly enforce that as the maximum number of files delivered to a single batch job. If <maxfilesperjob> is not specified in the xml stage definition, the average number of files delivered to a single batch job will be one, but individual jobs may get more than one file.

Processing multiple files in a single batch job without merging.

By default, art will merge multiple input files into a single output file. Sometimes you want to open a new output file whenever a new input file is opened. This behavior can be configured in art using fcl parameters. Do the following:

  • Configure the "scheduler" art service as follows:
    scheduler: { fileMode: NOMERGE }
    
  • Each RootOutput module must be configured to have an output file name template that is capable of generating different output file names for different input file names. Here is a typical example:
    module_type: RootOutput
    fileName: "%ifb_%tc_bnb.root"