Project

General

Profile

Tutorial for Analyzers Using the Grid

This page attempts to walk new analyzers through running the DUNE software using the Fermilab batch system (aka "the grid") it is intended for those who know nothing about using the grid or DUNE analysis code. It doesn't assume any knowledge of linux or per se, but you will probably get the most out of this if you are vaguely familiar with concepts of a command line and programming. Currently, this tutorial focuses on generation and simulation of Monte Carlo. In the future, it will be expanded to handle reconstruction and analysis of data.

Prerequisites

For starters, you will require a Fermilab computing account and access to the DUNE gpvms (dunegpvmXX.fnal.gov, [01-10] SLF6, [11-15] SLF7). You will also need a grid proxy to run jobs in the batch system. Instructions for acquiring these can be found here: https://wiki.dunescience.org/wiki/DUNE_Computing/Getting_Started_Tutorial#Getting_Accounts_and_Logging_In_at_Fermilab
Make sure you are able to log on to one of the dunegpvms, and have X11 forwarding allowed. If either of these doesn't work, you should ask your advisor for help.

Getting Started

Once you have managed to log onto a DUNE machine, you need to set up ups and duentpc:

source /cvmfs/dune.opensciencegrid.org/products/dune/setup_dune.sh
setup dunetpc v08_03_00 -q e17:prof

Note that this version of dunetpc was chosen as a stable production release for running the tutorial, but you can find more up-to-date releases here: https://wiki.dunescience.org/wiki/DUNE_LAr_Software_Releases
The first line sets up the ups framework DUNE and lar uses to manage the different builds of its software. It also sets up some experiment specific environment variables you will need for batching running and using SAM.

Once you have dunetpc setup, you should be able to start the tutorial by running "make_dune_tutorial.sh:"

mkdir -p /dune/app/users/$USER/project_py_tutorial
cd /dune/app/users/$USER/project_py_tutorial
/dune/app/users/kirby/make_dune_tutorial.sh

This will set up some files in your working directory. The one we are going to start with is "prodgenie_anu_dune10kt_1x2x6.xml". Note that this tutorial is based off of the larbatch package that is fully documented here: https://cdcvs.fnal.gov/redmine/projects/larbatch/wiki/User_guide

How to generate Monte Carlo

XML files and You

Next, open prodgenie_anu_dune10kt_1x2x6.xml in your favorite text editor (for some advice in choosing a text editor, look here).

XML is a markup language similar to html which is designed to encapsulate data in a human and computer readable format. Like html. xml elements are enclosed within <element></element> tags. The first element you should look for is the <numevents> element:

<numevents>50</numevents>

This element tells you the total size of the simulation is 50 events. You will notice this element is embedded in another element:

<project name="&name;">
A project consists of multiple stages. Each stage can be thought of as a separate grid-job which runs a particular step of the simulation or reconstruction. Projectcs consist of setting which are common between multiple stages such as:
  1. the number of events to simulate
  2. the version of the simulation being used
  3. how metadata is declared
  4. etc.

Do not worry about these for now, instead scroll down to the "gen" stage:

<stage name="gen">

There are several elements here, but let's focus on <numbjobs> for now. This is the number of jobs which will launch on the worker node when you invoke the job submission command. If we wanted to generate a large number of events (say, 500) we would not want to do it in one job, but instead break it into multiple submissions.

Checking your xml file

Before launching your jobs, you should check to make sure that your xml file is properly formatted and contains all the needed information:

project.py --xml prodgenie_anu_dune10kt_1x2x6.xml --stage gen --status

You should get an output like this:

Project prodgenie_anu_dune10kt_1x2x6:

Stage gen output directory does not exist.
Stage gen batch jobs: 0 idle, 0 running, 0 held, 0 other.

Project prodgenie_anu_dune10kt_1x2x6_reco1:

Stage reco1 output directory does not exist.
Stage reco1 batch jobs: 0 idle, 0 running, 0 held, 0 other.

Don't worry about the output directories not existing, project.py will create this directories for you when jobs are launched.

All of the actions project.py is capable of running may be found by running (warning! lots of text):

project.py --help

Launching your First Job

Let's get some simulation running. From the command line, type the following command:

project.py --xml prodgenie_anu_dune10kt_1x2x6.xml --stage gen --submit

you should get an output like this:

Stage gen:
Invoke jobsub_submit
jobsub_submit finished.

if not, contact an expert as it's likely something else is wrong.

jobsub_submit sends your command to generate 50 anti-neutrino events to the grid, where it needs to wait for a free computer to run on. You can check the status of your job by typing:

project.py --xml prodgenie_anu_dune10kt_1x2x6.xml --stage gen --status

you should see something like this:


Project prodgenie_anu_dune10kt_1x2x6:

Stage gen: 0 art files, 0 events, 0 analysis files, 0 errors, 0 missing files.
Stage gen batch jobs: 1 idle, 0 running, 0 held, 0 other.

Project prodgenie_anu_dune10kt_1x2x6_reco1:

Stage reco1 output directory does not exist.
Stage reco1 batch jobs: 0 idle, 0 running, 0 held, 0 other.

Note that project.py tracks all the stages in your xml file, and not just the one currently running. Depending on how many jobs are currently running on the gird, your job may be running right away or waiting for a open slot (idle.)

Fermilab's Computing Division maintains a variety of tools to measure and monitor the conditions of the grid:

After your Job Finishes: Checking the Output

Once your job finishes (remember you can check the status with project.py --xml prodgenie_anu_dune10kt_1x2x6.xml --stage gen --status), the next step is to check the output and make sure everything ran normally. This is accomplished through project.py and --check:

project.py --xml prodgenie_anu_dune10kt_1x2x6.xml --stage gen --check

you should see something like this. Again, your directories will be different:

Stage gen:
Checking directory /pnfs/dune/scratch/users/kirby/tutorial/v08_03_00/detsim/prodgenie_anu_dune10kt_1x2x6
50 total good events.
1 total good root files.
1 total good histogram files.
0 processes with errors.
0 missing files.

viola! You now have 50 good events run through the generator stage! In addition to checking all the root files, project.py also creates a "files.list" file in the directory "/pnfs/dune/scratch/users/kirby/tutorial/v08_03_00/detsim/prodgenie_anu_dune10kt_1x2x6/". This file is used by project.py to determine the input for the next stage.

Launching your Second Grid Job

The next step in simulation is to take the generated muons and run them through the reconstruction. This is accomplished by running project.py again, only this time specifying the stage to be reco1:

project.py --xml prodgenie_anu_dune10kt_1x2x6.xml  --stage reco1 --submit

Note that you do not have to specify an input file at any point. project.py does this for you. It is of course possible to specify a certain input file or dataset, and we will get to that eventually.

In the meantime, you can check the status of these jobs by running:

project.py --xml prodgenie_anu_dune10kt_1x2x6.xml  --stage reco1 --status

you should see output that looks like this:


Project prodgenie_anu_dune10kt_1x2x6:

Stage gen: 1 art files, 50 events, 1 analysis files, 0 errors, 0 missing files.
Stage gen batch jobs: 0 idle, 1 running, 0 held, 0 other.

Project prodgenie_anu_dune10kt_1x2x6_reco1:

Stage reco1 output directory does not exist.
Stage reco1 batch jobs: 0 idle, 0 running, 0 held, 0 other.

You'll notice you still have 50 events in the gen stage. project.py will keep track of the entire history of your project, so long as the directories and files are there.

Advance Usage of project.py

By now, you've probably noticed all of our project.py commands follow a pattern of <xml file>, <some stage>, <some action>. XML files describe the stages of an analysis and the order in which they are run. XML files also describe how each stage receives input files. Each file contains a list of stages. Each time you call project.py with a specific xml file and stage, it looks into the xml file to decide what to run and how to run it.

Actions by contrast specify what you want project.py to do. For most users, the actions used most often are submit (which submits jobs to the grid), check (which checks the output of jobs after they've returned from the grid) and status (which tells you if your job is running or idle). More advanced actions include declare (which declares files to SAM) and define (which generates a SAM dataset based on your configuration).

All of the actions project.py is capable of running may be found by running (warning! lots of text):

project.py --help

90% of the work you will do with project.py will follow this paradigm of --stage <stage> --submit, --stage <stage> check. There are a few tricks for speeding up this process detailed below, however in the next stage we will introduce how to run over actual DUNE data.

Reconstructing Data using Project.py

Fcl Files for Reconstruction

The full chain of reconstruction code is always being modified and improved. As a result, it would be impossible to keep this tutorial at the cutting edge of reconstruction, or even official reconstruction versions usable for analysis. The fcl files included in the following section will work with the input datasets, but there is no guarantee these should be used for a physics analysis. Chances are, we have a more modern version of reconstruction being used by analyzers. Make sure you consult with your working group, or the production team about what the most updated reconstruction version is.

That being said, the following instructions (and dataset) should be usable with any version of the reconstruction. You can swap fcl files (and dunetpc versions) at will and cook up your own xml file.

Using SAM to Reconstruct Data

DUNE runs reconstruction in two stages, an initial 2-dimensional reconstruction and a more sophisticated 3-dimensional reconstruction. Open the xml file titled "prod_reco_data.xml," and scroll down to stage "reco1."

You'll notice a few differences from the xml file we used to generate MC muons. The first, is we now have a stage parameter <inputdef>. This parameter tells project.py to input data from a dataset defined by SAM (Sequential Access to Metadata). Most of DUNE's data is stored on magnetic tape, and is not immediately accessible from disk. SAM facilitates accessing and copying these files from tape. There is some more information about SAM and how to use it here: Basics_with_SAM_dCache_and_Tape

You can see how many files and events make up a data set by running:

samweb list-definition-files --summary kirby_np04_raw_cosmics_3ms_run6693_v07_08_00_05_v2

alternatively, you can run:

samweb list-files --summary defname:kirby_np04_raw_cosmics_3ms_run6693_v07_08_00_05_v2

you should see an output which looks like this for both commands:

File count:    13
Total size:    107193148950
Event count:    1672

The dataset kirby_np04_raw_cosmics_3ms_run6693_v07_08_00_05_v2 contains 1672 events spread over 13 different files. If you want to see the parameters which make up a definition, you can use samweb describe-definition :

samweb describe-definition kirby_np04_raw_cosmics_3ms_run6693_v07_08_00_05_v2

this is a good way to learn what sam metadata is available, and how to use it.

Look in prod_reco_data.xml for the line <numjobs>. You will see <numjobs> = 13; which is the same number of files in the dataset. You should also look for the filed <maxfilesperjob>, which is set to 1.

When you submit a job with project.py using a dataset as input, project.py splits the dataset up into multiple jobs. This is because datasets can be arbitrarily large (up to millions of events!), and we do not want jobs running on the grid over multiple days. Project.py makes this decision based on the <numjobs> and <maxfilesperjob> are specified. project.py will always launch <numjobs> regardless of how large the dataset is. If you launch a reconstruction job with 10k events in 1 job, be prepared to wait a long time for that job to finish. In fact, it won't finish, since by default your job will automatically terminate after 8 hours of run time.

<maxfilesperjob> tells project.py you only want SAM to copy over so many files for a given job. Combined with <numjobs>, <maxfilesperjob> defines how many files from a dataset are processed based on:

Nfiles = <maxfilesperjob>*<numjobs>

If you do not specify a <maxfilesperjob>, project.py will evenly distribute the files across <numjobs> as best it can. Doing this with <numjobs> = 1 is potentially disastrous, as SAM will try and copy potentially thousands of files to a grid node. Do not do this!

With 13 files in our dataset, <numjobs> =13 and <maxfilesperjob> =1, we expect this xml file to launch 13 jobs, each with 1 file per job, and reconstruct the entire dataset. Go ahead and check my math with:

project.py --xml prod_reco_data.xml --stage reco1 --submit

Make sure to run check when your jobs finish:

project.py --xml prod_reco_data.xml --stage reco1 --check

Understanding Numevets

If you look in prod_reco_data.xml, you'll notice the <numevnets> field is set to 1M events. But, we now our dataset only contains 1672 events. Where do the extra events come from?

The answer involves understanding what <numevents> actually means. <numevents> tells lar, the underlying event processing framework, the maximum number of events to process in a given grid job. Setting this to 1M in prod_reco_data.xml tells project.py to process up to 1M events per job.

Archiving your Files with SAM4Users

Files located on /pnfs/dune/scratch have limited lifetimes. The scratch space is a read/write pool, and is designed for quickly accessing files across many different locations. It is not designed as a permanent home for a large number of processed data or simulation files. To keep your files permanently, you have two options:

  1. Copy your files to a permanent storage directory such as bluearc (/dune/data) or persistent dCache space (/pnfs/dune/persistent). Neither of these is recommended, as DUNE's storage space on both is limited.
  2. Archive your files on tape using sam4users to declare and catalog your files. This is the preferred option.

Creating your own Dataset with SAM4Users

Since SAM4Users is not setup by default with dunetpc, we need to set up another ups product, fife_utils:

setup fife_utils

NOTE THAT YOU SHOULD NOT SPECIFY A VERSION WHEN SETTING UP FIFE_UTILS The product is actively being developed and bugs are being fixed all the time. Make sure you are using the most up-to-date version!

The basic SAM4Users command registers your files in SAM, and creates a dataset which encapsulates these files. You run it by doing after running project.py --check:

sam_add_dataset -n $USER_prod_reco_data_tutorial -f /pnfs/dune/scratch/users/kirby/tutorial/reco1/v07_08_00_05/prod_reco_data/files.list

The first thing SAM4Users does is rename your files. This is because files in SAM must have unique file names. Note the practical effect of this is if you want to run a downstream stage (say reco2 or mergeaana) on these files, you need to re-run project.py --check so your file.list will pick up the updated file names.

Moving Files to Tape

Once your files have been declared, it's trivial to move them to the tape-backed area running:

sam_move2archive_dataset -n $USER_prod_reco_data_tutorial

Working with Larger Samples

The preceding sections should give you enough knowledge on how to run jobs using input from SAM, as well as generating your own MC samples. The following sections deal with a bit more advanced topics, like generating large (10k) events MC samples are running over entire datasets.

Running your own code with project.py

Project.py can run virtually anything which invokes "lar" as the primary executable. This means you can input arbitrary .fcl files, input arbitrary lar files, and execute arbitrary art modules on the grid using project.py. But there are a few rules to keep in mind:

  • All input files must be located on /pnfs space This means no copying data from /dune/data or /nashome/. You job will most likely not work if you do.
  • The grid knows nothing about your environment It is possible to pass environment variables to the grid using the <jobsub> field, but it is not done by default.
  • Files that your code needs to run must be included in the dunetpc package dunetpc is a ups product that is regularly tagged, built and visible to the grid.

Project.py with custom input files

You can always specify files to be copied and read in by grid jobs by defining <inputlist> at the stage level. Project.py will launch <numjobs> jobs, dividing the inpulist as evenly as it can while still respecting <maxfilesperjob>. Just be sure the input files are located on /pnfs!

Example:

<inputlist>/pnfs/uboone/scratch/users/jiangl/mcc7.0_val/v05_08_00/reco2/prodgenie_single_proton_uboone_23/files.list</inputlist>

Project.py with a custom build of larsoft or dunetpc

Checking out and building ups packages is beyond the scope of this tutorial. however there is a good walkthrough here

Fortunately, after building your own version of duentpc, getting it to run on the grid is fairly simple. After you've built your release, run:

make_tarball.sh $USER_local_`date +%F-%H%M`.tar

This will set up a tar ball of your release in the current working dir. You then need to copy the file to resilient storage (a subdirectory of /pnfs/dune/resilient/users/&user;/) and then
add a line in the <larsoft> block of your xml file to tell project.py to copy the tarball from there to the grid:

  <larsoft>
    <tag>&relreco1;</tag>
    <qual>e17:prof</qual>
    <local>FullPathToTarBall/$USER_local_<date>.tar</local>
  </larsoft>

It is important to have your tarball in "resilient" storage rather than in /dune/apps or some other place. Using the wrong storage volume might cause problems with dCache.
See Understanding Storage Volumes for more information about the use of different storage volumes, and Sample_XML_file for a sample XML file.

Best Practices for Submitting Analysis Jobs

Now that you have access to all of the data and MC, you are ready to submit thousands of jobs to run over millions of events, right?

DUNE currently has very large data products and a lot of data. As a result, running over full datasets amounts to transferring 100s of TBs of data to remote nodes, and possibly from tape. This can be a significant strain on production resources, and is very inefficient because remote nodes are sitting idle waiting for input. As a result, please adhere to the following guidelines when developing, testing and running analysis code.

Best Practices for Using the Grid to Develop Analysis Code
  • Always test analyzer modules interactively before submitting to the grid.
  • In order to get an input file to test on, you can follow this procedure:
    1. Find the definition name for the data set you wish to test
    2. Get the name of an input file from your chosen data set
        samweb list-files "defname:definition> with limit 1" 
        

    3. Find the path to the directory containing the file
        samweb locate-file <file_name>
        

    4. Check if the file has been staged for use (if not expect the next step to take some time)
        cat <path_to_directory>/".(get)(<file_name>)(locality)" 
        # E.g:
        # cat /pnfs/uboone/data/uboone/raw/swizzle_trigger_streams/mergebnb/prod_v04_26_04_05/00/00/60/98//".(get)(PhysicsRun-2016_4_28_0_17_54-0006098-00045_20160428T143449_bnb_20160501T071825_merged.root)(locality)" 
        
  • ONLINE or ONLINE_AND_NEARLINE - means the file has been staged and can be used immediately
  • NEARLINE - means the file is only available on tape
  • If your file is only on tape then it will need to staged before it can be copied to your user area for your tests. Staging will happen automatically when you try to execute the copy, but expect this to be a slow process.
    6. Get the access url of the file and copy it to your user area
        samweb get-file-access-url <file_name>
        ifdh cp -D <file_url> .
        
  • Before launching jobs on the grid, you should first check that your input files have been staged from tape using the command above. If you don't do this, your jobs will sit idle while the files are staged making them very inefficient.
  • You can pre-stage a whole dataset using samweb
        samweb prestage-dataset --defname=<definition>
        
  • Issue this command around 24 hours before you need to submit your jobs
  • Consider using nohup or screen as this command can take a while
  • If the dataset has already been staged then this command will tell you and you can submit your jobs right away
  • The files will remain on disk for around 30 days since the last time they were accessed before being flushed
  • When you are ready to launch jobs on the grid, test your code on an escalating number of jobs. First try one job, then 10, and then 100 (if you need that many).
  • Use developmental datasets instead of full datasets while developing. Developmental datasets contain roughly 200k events, so you should have reasonable statistics.
  • Grid jobs should take between 1 and 8 hours to complete. This can be tricky depending on what kind of module you are running. It is possible to combine files together in a single lar job by increasing the <maxfilesperjob> parameter, and if you have a very fast module with a total output less than 10 GB per file, this is a good option.
  • If you plan to launch more than 1k jobs, please get in touch with the data management conveners or physics coordinators so we can understand your requirements.

Best Practices for Using the Grid to Run Analysis Code

When you have your analyzer nearly complete and are ready to run over large scale data or MC samples, please follow these guidelines:

  • First and foremost, notify the physics coordinators and DM conveeners of your request. Be specific in your request. Indicate what sample(s) you want to run over, what code you are running, what your timeline is, etc.
  • Once the DM team understands your request, begin prestaging your anticipated dataset. Prestaging begins the process of copying files from tape to cache space and speeds up the transfers to interactive nodes.
  • If you are trying to isolate a rare (less than 10%) process for MC, or are only interested in a select few data events (again, less than 10%), write a filter module which strips only events you are interested in to a separate file. You can then run over the stripped down file much faster and much more efficiently.
  • If running over an entire dataset, it is definitely preferred that your analysis code is part of a tagged dunetpc release. This way, the executable of your code can be mounted directly on a grid worker node over cvmfs. This reduces the needs to copy thousands of tarballs to remote workers.
  • Ideally, you want to limit the rate of job submission to 1k per minute. In practice, this can mean carving up the total dataset into several different datasets and submitting them one by one. The easiest way to do this is via "limit" and "offset" predicates in sam:
samweb create-definition <dataset_first1k> defname:<dataset> with limit 1000
samweb create-definition <dataset_first12> defname:<dataset> with limit 1000 and offset 1000