CAFAna on the grid » History » Version 14
CAFAna on the grid¶
First, are you sure you need to run on the grid?
How long is your macro expected to take (look at the progress bar)? If it's less than about an hour the grid is definitely more hassle than it's worth (not for CAFAna reasons, just in general).
Second, are you really sure you need to run on the grid?
The goal of CAFAna is for macros to run relatively quickly interactively. If you have very large runtimes there may well be something you can do.
Probably doesn't help too much if your problem is large file sizes, but it can be a big help (2-3x) in other cases.
When you setupnova
setupnova -r $REL -b maxopt
Or you can do it later
export SRT_QUAL=maxopt srt_setup
Remember if you're using a test-release your local changes won't have any effect unless you've built in the same version (debug vs maxopt) as you're running.
Running over fewer files¶
The least satisfying solution, but also the easiest. Can you get away with running over fewer files? Pass an option like
-s 10 to
cafe to run over only 1/10th of the files in your dataset or wildcard.
Particularly when developing your macro you want as quick a turnaround as possible, and probably don't care so much what the results actually look like.
If you spend most of your time rerunning a multiple-minutes macro to make a series of changes to the plotting styles it will be well worth your time to split your macro up. A good pattern is one (slower) macro that simply does all the loading from files and immediately writes out all the
Prediction etc objects you need using
SaveTo, and a second (very quick) macro to load those in (
LoadFrom) and plot them, that you can iterate as many times as necessary.
Reducing file size (sumdecaf/concat)¶
Take a look at your input files
samweb list-definition-files --summary my_query
du -sh my_dir/
If your dataset is large (more than a gigabyte or so) you can definitely go faster with smaller files.
Are there sumdecafs/concats you should be using rather than CAFs? The analysis groups take the CAFs produced by production and produce analysis specific sumdecafs/concats where a preselection has been applied (some combination of quality, containment and loose cosmic rejection cuts). These files also drop some excessive truth information carried around by full CAFs. They are usually substantially smaller than the corresponding CAFs.
If the existing sumdecafs are unsuitable for you for some reason, e.g. they cut out events you need, or you could make your files substantially smaller by applying more stringent cuts, look into making your own sumdecafs (Making decafs). It's not that hard, and could save you a lot of time in future. If your sumdecafs might be broadly useful, please discuss the cuts and metadata with the relevant people (group conveners, production conveners) and look into making them broadly available.
Reducing file number (concat)¶
If you have a large number of input files (thousands) time can be dominated by opening and closing files, especially from pnfs over xrootd, instead of doing useful work. If it only takes a second or so to process each individual file concatenating your inputs could make a big difference. Production should provide a concatenated version of every completed decaf dataset for the Second Analysis onwards at
/nova/prod/concat/. If you have some unusual requirements, or you need to concatenate your own decafs, look at Concatenating CAFs.
Reducing release tarball size¶
Since the jobs can no longer access the bluearc areas directly, when submitting jobs, your local release will be tarred up and sent to the worker nodes. Occasionally, however, your test release can be bloated by temporary files or other files that are not necessary for running on the grid. Creating a tarball of such sizeable release can fill up the temporary directory on the GPVM nodes and, hence, prevent you from submitting jobs. The script testrel_tarbal is useful in this case, as it allows you to reduce your release tarball size. It trims down the release directory by ignoring common files and directories not required for grid running, these includes any temporary directories, documentation or manual. Besides, intermediate build files (*.o and *.d) and debug build (*debug) will not be included in the produced tarball. testrel_tarbal also minimizes libraries and executable' sizes via the GNU utility
strip command. To run the script, you need to specify your test release directory and the directory at which to place the output tarball :
testrel_tarball </path/to/test_release> </path/to/output_tarball>
This will produce a tarball
output_tarball.tar.bz2 at the specified output directory. Then you can specify to
submit_cafana.py to use this user-made tarball with the option :
It should be noted that the
--user_tarball option conflicts
--testrel. Therefore, once ou specify a particular tarball to use, you do not need to specify the test release directory
-t DIR or
--testrel DIR anymore. In the same token, the
--reuse_tarball is also redundant with
In the case of a failed job submission, it is always good to check the temporary directory on the GPVM nodes with :
ls -lhS /tmp
and make sure to remove any abnormally large resulting temporary files.
So you really do need to run on the grid¶
First, all the regular grid rules apply (LEARN MORE). Remember that the bluearc areas are completely inaccessible for reading/writing/execution, though you should never read any large files from bluearc anyway. Best practice is to write your output files to
/pnfs/nova/scratch/users/$USER/. If you need to be certain that you keep the files around for a while, either transfer them to
/nova/ana/users/$USER/ but remember to tidy these files up as the disks are limited.
submit_cafana.py is much like
submit_nova_art.py. It will submit N copies of your macro to the grid. Each one will automatically set the
--offset arguments to
cafe so that between them the jobs see all the files in whatever datasets the macro uses, but divided evenly between them.
submit_cafana.py automatically copies all output files (
*.root but also e.g.
*.txt in case you're making event lists) to the specified output directory. Outputs will be named e.g.
myoutput.1_of_10.root automatically. It also copies back a log file for each process.
usage: submit_cafana.py [-h] [-f FILE] -n N -r REL [-i INPUT_FILE] -o DIR [-t DIR] [-ep PRODUCT:VERSION] [-ss] [-off] [-d] [-x] [--dedicated] [--disk DISK] [--memory MEMORY] [--lifetime LIFETIME] [--source SOURCE] [--reuse_tarball | --user_tarball USER_TARBALL] [--print_jobsub] [--test] [--ifdh_debug] macro.C [args [args ...]] Submit a CAFAna macro with datasets split between N jobs positional arguments: macro.C The CAFAna macro to run args Arguments to the macro optional arguments: -h, --help show this help message and exit -f FILE, --file FILE Text file containing any arguments to this utility -n N, --njobs N Number of grid processes -r REL, --rel REL Release to use -i INPUT_FILE, --input_file INPUT_FILE Copy this input file to work area on worker node -o DIR, --outdir DIR Directory output files will go to -ss, --snapshot Use latest snapshot instead of requerying -off, --offsite Run this cafana job offsite -d, --drain Recover files missing from your output directory -x, --xrootdebug Add extra xrootd debuggin information Job control options: These optional arguments help control where and how your jobs land. -t DIR, --testrel DIR Use a test release at location TESTREL. It will be tarred up, and sent to the worker node. Conflicts with --user_tarball --user_tarball USER_TARBALL Use existing test release tarball in specified location rather than having jobsub make one for you (conflicts with --testrel, and is redunant with --reuse_tarball) --reuse_tarball Do you want to reuse a tarball that is already in resilient space? If using this option avoid trailing slash in --testrel option. (redundant with --user_tarball) --dedicated Only run on dedicated nodes on fermigrid (default is to run opportunistically) --disk DISK Local disk space requirement for worker node in MB (default is 2000MB). --memory MEMORY Local memory requirement for worker node in MB (default is 1900MB). --lifetime LIFETIME Expected job lifetime. Valid values are an integer number of seconds. (default is 10800=3h) --source SOURCE Source script SOURCE:par1:par2:.. -ep PRODUCT:VERSION, --extproduct PRODUCT:VERSION Setup this external product on the worker node in format <product>:<version> Debugging options: These optional arguments can help debug your submission. --print_jobsub Print jobsub command --test Do not actually do anything, just run tests and print jobsub cmd --ifdh_debug Verbose output for pinning down IFDH/dCache issues
To prevent multiple simultaneous SAM request hammering the SAM database, CAFAna automatically detects when it is being run on the grid and shares a single snapshot of each dataset among all the processes.
Once the jobs are complete you can combine the results using
hadd_cafana. This behaves exactly the same as regular
hadd, except it can handle the
TObjString contents in CAFAna output files that cause regular
hadd problems. In particular, all it does is sum the various histograms, so you can only combine things where that does something sensible. This is true for regular
Spectrum objects, and for
PredictionNoExtrap. It's not true right now for e.g.
MichelDecomp, but in general these problems can be resolved by changes in the serialization format of unsuitable classes.
- Make sure to run your
hadd_cafanacommand on a BlueArc disk (preferably
/nova/ana/users/$USER/...). Avoid using dCache (
- If combining 1000 files, it is more efficient to run 10 incantations of
hadd_cafanaover a reasonable number of files (O100), and then combining them, versus running a single command over all 1000 files. This is because of a restriction on the number of simultaneous disk access requests allowed on dCache. Coding a simple bash/python script to do this shouldn't be too difficult.
- You can
hadd_cafanaover files in pnfs with
hadd_cafana out.root `pnfs2xrootd /pnfs/.../in*.root`
- To view your log files,
catdoesn't work from pnfs, but you can
dccpthe file to stdout, which has the same effect
dccp /pnfs/.../log.txt -
- To submit a macro containing a custom header file you need to move both the macro and header file to your scratch area. All files you send to the grid are put in $CONDOR_DIR_INPUT so for your include statement in your macro you don't need the full file path just #include "header.h" is all you need
submit_cafana.py -n <jobs> -r <rel> -i </path/to/header/in/scratch>/header.h -o <output directory> macro.C
- Similarly, if you need to include an input file into the macro, you will need to move that file and your macro to your scratch area and include that input file using the -i option in the same way as above. Additionally, inside your macro where you open up the input file, you will need to open it via
TFile *f1 = TFile::Open(pnfs2xrootd("/pnfs/nova/scratch/path/to/file/input.root").c_str());