Project

General

Profile

Construct a Draining Dataset

Draining datasets allow you to reprocess files which have not yet produced children. The general pattern is:

samweb create-definition my_draining_dataset "<parent_query> minus isparentof:( <child_query> )" 

Often, the <parent_query> is just a definition name, which can refer to using dataset_def_name_newest_snapshot <defname>. This will almost always work well because a snapshot gets taken when you start a project. If for whatever reason the most recent snapshot is out-of-date, you can update it with samweb take-snapshot <defname>.

The <child_query> can, in principle, also be a full definition name, but generally performance is poor in situations like this. You will get better performance from constructing a minimal description of the types of children you are checking for. For example, if you started with artdaq files and want to rerun jobs which produce pid files in release R16-03-03-prod2reco.f, your child_query could be as simple as: data_tier pid and nova.release R16-03-03-prod2reco.f. You can make fairly broad queries here since they will only be applied to the children of files in the <parent_query>.

So, pulling it all together you might have the following for a full chain reco/pid job (which produces pid files) that ran in the R16-03-03-prod2reco.f tag:

samweb create-definition my_draining_dataset "dataset_def_name_newest_snapshot some_artdaq_files minus isparentof:( data_tier pid and nova.release R16-03-03-prod2reco.f )" 

You can work out the above constraint from your submission log or configuration file. The data_tier is one of your --outTier arguments and R16-03-03-prod2reco.f would be the argument of the --tag argument.

Jobs with multiple output files

Some jobs, like the "full chain" reco/pid jobs, produce multiple output files. Generally you can just drain on a single file type (for full chain this could be pid) since in most circumstances a job will produce all its output files or none of them. However, occasionally some jobs produce only some of their output files. The symptom of this is inconsistent counts across the different tiers of files.

In this circumstance you can construct a single draining definition but it will generally have very poor performance. You are better off creating multiple draining definitions and combining them after the fact:

As we no longer make files for the reco stage you no longer want to include them. However, you add them by adding the relevant lines (substituting data_tier reco into the below lines of code.

samweb create-definition my_pid_draining_dataset "dataset_def_name_newest_snapshot some_artdaq_files minus isparentof:( data_tier pid and nova.release R16-03-03-prod2reco.f )" 
samweb create-definition my_caf_draining_dataset "dataset_def_name_newest_snapshot some_artdaq_files minus isparentof:( data_tier caf and nova.release R16-03-03-prod2reco.f )" 
samweb take-snapshot my_pid_draining_dataset
samweb take-snapshot my_caf_draining_dataset
samweb create-definition my_combo_draining_dataset "dataset_def_name_newest_snapshot dataset_def_name_newest_snapshot my_pid_draining_dataset or dataset_def_name_newest_snapshot my_caf_draining_dataset" 

The Far Detector Data

Draining the far detector data is especially challenging for two reasons: the datasets contain a very large number of files, and each input file (artdaq) produces 3 children (pid, caf, restrictedcaf) and 2 grandchildren (decaf, restricteddecaf). So, a fully expanded example is below for creating draining definitions for periods 1-3 of the FD data in production 3.

samweb create-definition prod_artdaq_S15-03-11_fd_numi_period123_v2 "dafname: prod_artdaq_S15-03-11_fd_numi_period1_snapshot20170530 or defname: prod_artdaq_S15-03-11_fd_numi_period2_snapshot20170531 or defname: prod_artdaq_S15-03-11_fd_numi_period3_snapshot20170531" 
samweb take-snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2

samweb create-definition prod_artdaq_S15-03-11_fd_numi_period123_v2_pid_draining  "dataset_def_name_newest_snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2 minus isparentof:( data_tier pid and nova.release R17-03-01-prod3reco.k)" 

samweb create-definition prod_artdaq_S15-03-11_fd_numi_period123_v2_caf_draining  "dataset_def_name_newest_snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2 minus isparentof:( data_tier caf and nova.release R17-03-01-prod3reco.k)" 

samweb create-definition prod_artdaq_S15-03-11_fd_numi_period123_v2_blindcaf_draining  "dataset_def_name_newest_snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2 minus isparentof:( data_tier restrictedcaf and nova.release R17-03-01-prod3reco.k)" 

samweb create-definition prod_artdaq_S15-03-11_fd_numi_period123_v2_decaf_draining  "dataset_def_name_newest_snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2 minus isancestorof:( data_tier decaf and nova.release R17-03-01-prod3reco.k)" 

samweb create-definition prod_artdaq_S15-03-11_fd_numi_period123_v2_blinddecaf_draining  "dataset_def_name_newest_snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2 minus isancestorof:( data_tier restricteddecaf and nova.release R17-03-01-prod3reco.k)" 

samweb take-snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2_pid_draining
samweb take-snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2_caf_draining
samweb take-snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2_blindcaf_draining
samweb take-snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2_decaf_draining
samweb take-snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2_blinddecaf_draining

samweb create-definition prod_artdaq_S15-03-11_fd_numi_period123_v2_combo_draining  "dataset_def_name_newest_snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2_pid_draining or dataset_def_name_newest_snapshot
prod_artdaq_S15-03-11_fd_numi_period123_v2_caf_draining or dataset_def_name_newest_snapshot
prod_artdaq_S15-03-11_fd_numi_period123_v2_blindcaf_draining or dataset_def_name_newest_snapshot
prod_artdaq_S15-03-11_fd_numi_period123_v2_decaf_draining or dataset_def_name_newest_snapshot
prod_artdaq_S15-03-11_fd_numi_period123_v2_blinddecaf_draining " 

samweb take-snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2_combo_draining

samweb count-files "dataset_def_name_newest_snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2_combo_draining" 

Updating draining datasets

Because the draining datasets often contain dataset_def_name_newest_snapshot, which relies on snapshots being up-to-date, complex draining datasets (like those shown above) are difficult to update by hand. You can use the script update_snapshots.py (from NovaGridUtils) to recursively update all the dataset snapshots used in a definition:

$ update_snapshots.py prod_artdaq_S15-03-11_fd_numi_period123_v2_combo_draining
 Updating definition: prod_artdaq_S15-03-11_fd_numi_period123_v2_combo_draining
   Updating definition: prod_artdaq_S15-03-11_fd_numi_period123_v2_pid_draining
     Updating definition: prod_artdaq_S15-03-11_fd_numi_period123_v2
       samweb take-snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2
     samweb take-snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2_pid_draining
   Updating definition: prod_artdaq_S15-03-11_fd_numi_period123_v2_caf_draining
     samweb take-snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2_caf_draining
   Updating definition: prod_artdaq_S15-03-11_fd_numi_period123_v2_blindcaf_draining
     samweb take-snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2_blindcaf_draining
   Updating definition: prod_artdaq_S15-03-11_fd_numi_period123_v2_decaf_draining
     samweb take-snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2_decaf_draining
   Updating definition: prod_artdaq_S15-03-11_fd_numi_period123_v2_blinddecaf_draining
     samweb take-snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2_blinddecaf_draining
   samweb take-snapshot prod_artdaq_S15-03-11_fd_numi_period123_v2_combo_draining