Project

General

Profile

CAFAna on the grid

First, are you sure you need to run on the grid?

How long is your macro expected to take (look at the progress bar)? If it's less than about an hour the grid is definitely more hassle than it's worth (not for CAFAna reasons, just in general).

Second, are you really sure you need to run on the grid?

The goal of CAFAna is for macros to run relatively quickly interactively. If you have very large runtimes there may well be something you can do.

Optimized build

Probably doesn't help too much if your problem is large file sizes, but it can be a big help (2-3x) in other cases.

When you setupnova

setupnova -r $REL -b maxopt

Or you can do it later

export SRT_QUAL=maxopt
srt_setup

Remember if you're using a test-release your local changes won't have any effect unless you've built in the same version (debug vs maxopt) as you're running.

Running over fewer files

The least satisfying solution, but also the easiest. Can you get away with running over fewer files? Pass an option like -s 10 to cafe to run over only 1/10th of the files in your dataset or wildcard.

Particularly when developing your macro you want as quick a turnaround as possible, and probably don't care so much what the results actually look like.

If you spend most of your time rerunning a multiple-minutes macro to make a series of changes to the plotting styles it will be well worth your time to split your macro up. A good pattern is one (slower) macro that simply does all the loading from files and immediately writes out all the Spectrum or Prediction etc objects you need using SaveTo, and a second (very quick) macro to load those in (LoadFrom) and plot them, that you can iterate as many times as necessary.

Reducing file size (sumdecaf/concat)

Take a look at your input files

samweb list-definition-files --summary my_query

or

du -sh my_dir/

If your dataset is large (more than a gigabyte or so) you can definitely go faster with smaller files.

Are there sumdecafs/concats you should be using rather than CAFs? The analysis groups take the CAFs produced by production and produce analysis specific sumdecafs/concats where a preselection has been applied (some combination of quality, containment and loose cosmic rejection cuts). These files also drop some excessive truth information carried around by full CAFs. They are usually substantially smaller than the corresponding CAFs.

If the existing sumdecafs are unsuitable for you for some reason, e.g. they cut out events you need, or you could make your files substantially smaller by applying more stringent cuts, look into making your own sumdecafs (Making decafs). It's not that hard, and could save you a lot of time in future. If your sumdecafs might be broadly useful, please discuss the cuts and metadata with the relevant people (group conveners, production conveners) and look into making them broadly available.

Reducing file number (concat)

If you have a large number of input files (thousands) time can be dominated by opening and closing files, especially from pnfs over xrootd, instead of doing useful work. If it only takes a second or so to process each individual file concatenating your inputs could make a big difference. Production should provide a concatenated version of every completed decaf dataset for the Second Analysis onwards at /pnfs/nova/production/concat/ and /nova/prod/concat/. If you have some unusual requirements, or you need to concatenate your own decafs, look at Concatenating CAFs.

So you really do need to run on the grid

First, all the regular grid rules apply (LEARN MORE). Remember that the bluearc areas are completely inaccessible for reading/writing/execution, though you should never read any large files from bluearc anyway. Best practice is to write your output files to /pnfs/nova/scratch/users/$USER/. If you need to be certain that you keep the files around for a while, either transfer them to /pnfs/nova/persistent/users/$USER/ or /nova/ana/users/$USER/ but remember to tidy these files up as the disks are limited.

submit_cafana.py is much like submit_nova_art.py. It will submit N copies of your macro to the grid. Each one will automatically set the --stride and --offset arguments to cafe so that between them the jobs see all the files in whatever datasets the macro uses, but divided evenly between them. submit_cafana.py automatically copies all output files (*.root but also e.g. *.txt in case you're making event lists) to the specified output directory. Outputs will be named e.g. myoutput.1_of_10.root automatically. It also copies back a log file for each process.

usage: submit_cafana.py [-h] [-f FILE] -n N -r REL [-i INPUT_FILE] -o DIR
                        [-t DIR] [-ep PRODUCT:VERSION] [-ss] [-off] [-d] [-x]
                        [--dedicated] [--disk DISK] [--memory MEMORY]
                        [--lifetime LIFETIME] [--source SOURCE]
                        [--reuse_tarball | --user_tarball USER_TARBALL]
                        [--print_jobsub] [--test] [--ifdh_debug]
                        macro.C [args [args ...]]

Submit a CAFAna macro with datasets split between N jobs

positional arguments:
  macro.C               The CAFAna macro to run
  args                  Arguments to the macro

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  Text file containing any arguments to this utility
  -n N, --njobs N       Number of grid processes
  -r REL, --rel REL     Release to use
  -i INPUT_FILE, --input_file INPUT_FILE
                        Copy this input file to work area on worker node
  -o DIR, --outdir DIR  Directory output files will go to
  -ss, --snapshot       Use latest snapshot instead of requerying
  -off, --offsite       Run this cafana job offsite
  -d, --drain           Recover files missing from your output directory
  -x, --xrootdebug      Add extra xrootd debuggin information

Job control options:
  These optional arguments help control where and how your jobs land.

  -t DIR, --testrel DIR
                        Use a test release at location TESTREL. It will be
                        tarred up, and sent to the worker node. Conflicts with
                        --user_tarball
  --user_tarball USER_TARBALL
                        Use existing test release tarball in specified
                        location rather than having jobsub make one for you
                        (conflicts with --testrel, and is redunant with
                        --reuse_tarball)
  --reuse_tarball       Do you want to reuse a tarball that is already in
                        resilient space? If using this option avoid trailing
                        slash in --testrel option. (redundant with
                        --user_tarball)
  --dedicated           Only run on dedicated nodes on fermigrid (default is
                        to run opportunistically)
  --disk DISK           Local disk space requirement for worker node in MB
                        (default is 2000MB).
  --memory MEMORY       Local memory requirement for worker node in MB
                        (default is 1900MB).
  --lifetime LIFETIME   Expected job lifetime. Valid values are an integer
                        number of seconds. (default is 10800=3h)
  --source SOURCE       Source script SOURCE:par1:par2:..
  -ep PRODUCT:VERSION, --extproduct PRODUCT:VERSION
                        Setup this external product on the worker node in
                        format <product>:<version>

Debugging options:
  These optional arguments can help debug your submission.

  --print_jobsub        Print jobsub command
  --test                Do not actually do anything, just run tests and print
                        jobsub cmd
  --ifdh_debug          Verbose output for pinning down IFDH/dCache issues

To prevent multiple simultaneous SAM request hammering the SAM database, CAFAna automatically detects when it is being run on the grid and shares a single snapshot of each dataset among all the processes.

Once the jobs are complete you can combine the results using hadd_cafana. This behaves exactly the same as regular hadd, except it can handle the TObjString contents in CAFAna output files that cause regular hadd problems. In particular, all it does is sum the various histograms, so you can only combine things where that does something sensible. This is true for regular Spectrum objects, and for PredictionNoExtrap. It's not true right now for e.g. MichelDecomp, but in general these problems can be resolved by changes in the serialization format of unsuitable classes.

Tips

  1. Make sure to run your hadd_cafana command on a BlueArc disk (preferably /nova/ana/users/$USER/...). Avoid using dCache (/pnfs/nova/...).
  2. If combining 1000 files, it is more efficient to run 10 incantations of hadd_cafana over a reasonable number of files (O100), and then combining them, versus running a single command over all 1000 files. This is because of a restriction on the number of simultaneous disk access requests allowed on dCache. Coding a simple bash/python script to do this shouldn't be too difficult.
  3. You can hadd_cafana over files in pnfs with
    hadd_cafana out.root `pnfs2xrootd /pnfs/.../in*.root`
    
  4. To view your log files, cat doesn't work from pnfs, but you can dccp the file to stdout, which has the same effect
    dccp /pnfs/.../log.txt -