Project

General

Profile

These notes were originally published by Francesco Tortorici on August 10, 2018. They are considered obsolete as soon as ICARUS manages production with POMS.

How to monitor production jobs @ FNAL

by Francesco Tortorici

Version 2018-08-04

Introduction

In order to keep in check the status of the production jobs on Fermigrid, you need to know how to:

• Launch/kill a job;
• Check the status of the jobs on grid;
• Identify the output directory corresponding to a given (held) job.

Of course, you also need a computer account at FNAL.

Useful documentation

Before we start, let me suggest some useful links. Although the information therein is not really required for the purposes of this document, it does not hurt to get a basic knowledge of LarSoft:

Environment setup

Once you login on icarusgpvm01.fnal.gov, give the following command:

source /cvmfs/icarus.opensciencegrid.org/products/icarus/setup_icarus.sh

If you have logon using the shared account icaruspro, then execute also
export X509_USER_PROXY=/opt/icaruspro/icaruspro.Production.proxy

otherwise, i.e. you are using your own account USERNAME, the command kx509 will suffice. It will create a grid certificate valid for at least 24 hours.
Next step, you ought to setup, just in case, a version of icaruscode. At the time of writing, the last version is v07_00_01, with qualifiers e15:prof, so the command to issue is:
setup icaruscode v07_00_01 -q e15:prof

To simplify this procedure, you might want to put all these commands into your login config file.

Launch/kill a job (using project.py)

In order to submit a job to grid, just go where the xml file (you can find an example in appendix A) you want to use is, decide which stage therein you want, and invoke project.py with at least the following options:

project.py --xml XMLFILE --stage STAGE --submit

where XMLFILE and STAGE are of course placeholders. Typical stage names are gen, g4, detsim and reco.
In a few seconds, an output similar to the following should appear on the terminal:
Stage gen:
Invoke jobsub_submit jobsub_submit finished.

At this time, the output, log and work directories are created. Their path is usually set in the xml. An important quirk to know is that a job will not be submitted (unless you force it) if any of such directories is not empty. The easier solution is to just erase them and try again. A good practice is to set all three of them inside a common base directory. In this way, once you locate one of them, in case of problems you just need to remove the base directory.
There are two useful options of project.py that you should know, namely --check and --makeup. They both apply after jobs finished (or have been killed), and are used in place of --submit. The parameters --xml and --stage are still required though. The former (--check) verifies whether all the produced root files are good (may take a while to run) and let you know if there are issues. In that case, --makeup should relaunch only the needed jobs. Warning: if you do the operations --check and --makeup out of order, project.py will complain. Not a problem, just do --check first and nobody will get hurt.
To kill a job, you need to know is JOBID. Please refer to section 5 to learn how to have a list of the jobs currently on grid. Then issue the following:
jobsub_rm -G icarus --jobid=JOBID
or
jobsub_rm --user USERNAME --jobid=JOBID

Note that you can kill more than one job at a time, you just need to substitute a comma separated list of ID's to the placeholder JOBID above.

Check the status of the jobs on grid

This is easily done with

jobsub_q -G icarus
or, if you are interested in your jobs only, as opposed to every ICARUS job on grid:
jobsub_q --user USERNAME

In any case, the output should look like the following:
JOBSUBJOBID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
 10506357.0@jobsub01.fnal.gov icaruspro 08/01 07:50 0+01:37:00 R test_smazza_2018-v06_85_00.sh_20180801_075059_1058854_0_1_wrap.sh
...
10506666.0@jobsub01.fnal.gov icaruspro 08/01 08:31 0+00:00:00 I test_smazza_2018-v06_85_00.sh_20180801_083114_1223639_0_1_wrap.sh
283 jobs; 0 completed, 0 removed, 22 idle, 261 running, 0 held, 0 suspended
0 0.7 Lib-
0 0.0 Lib-

For a given job, the most important informations are the JOBID in the first column (the whole content, including the machine name), the submission time stamp (columns 3 and 4), the run time (column 5) and the status (column 6).
The status letter can be interpreted as follows:

  • I: the job is idle, that is, on queue;
  • R: the job is running;
  • H: the job is held, that is, locked for some reason.

The most common reason for a job to become held is it kept running for longer than the assigned “expected lifetime”. The default value is 8 hours. Such value can be changed in both the xml file corresponding to the job (before launching it!), and/or as an option of project.py. See the online help of project.py for more info.
Another common reason for a job to be held is it used too much RAM. In this case, just like the expected lifetime, you can increase the default value via the xml and/or as an option of project.py.
To minimize the number of held jobs, it is important to find optimal values for those two parameters, and put them in the xml. Note that if the values you put in are too small, the job is almost guaranteed to be held, but on the other hand, if you request too many resources, your job will, on average, wait on queue for a longer time than necessary. The suggested way to proceed is to do some test on a local machine, as opposed to a job on grid, using lar and inspect the logs; the most important files to check are the .json one (file size should be > 0), larStage0.err (if you encounter problems) and larStage0.out (if process was successful, at the end you will find the cpu/ram used). See the documentation in the links section for additional information.
Even if you set memory and cpu time at optimal values, from time to time some job will be held anyway. This can be due, for example, to the load balance algorithms of the grid: they might lower, at any given time, the percentage of cpu power dedicated to some of your jobs, which, then, will require a longer time to finish, and possibly go over the maximum runtime allocated.

Identify the log/work/output directories corresponding to a given (held) job

This section can normally be ignored, thanks to the --check and --makeup options of project.py. However, from time to time manual intervention may be needed.
If all the jobs are held, just kill them (see section 4). Before re-launching the jobs, you need to delete the log/work/out directories corresponding to the failed jobs. To locate the output directories of the successfully finished jobs, just do a simple

find . -name "*.root"
so, of course, all the existing output directories not in that list are the ones you are looking for (see below for a possible workflow). At this point, if the person who launched the jobs has followed the good practice explained in section 4, you just need to remove the held job base directory, and re-launch it. If that person has not, have fun grep'ing the xml's. From now on, let us suppose s/he did.
Now let us consider the difficult case. You have held and running/idle jobs. The procedure described below is mostly heuristics. First, cd into a directory containing all the jobs sub-directories.

  1. As said in section 5, one of the most important info given by jobsub_q is the submission time stamp. Take note of the held jobs timestamps (quick);
    • Suggestion:
      jobsub_q | grep H | awk '{print $4" "$3" "$1"}' > held_list
    • Here and in the following suggested commands in this section, piping to awk is optional. The rationale of using it is to remove unnecessary information, and to move the time in a comfortable position for sorting.
  2. Build a list of the timestamps of the successfully-done-jobs directories (time consuming);
    • Suggestion:
      find . -name *root -ls | awk '{print $8" "$6" "$7" "$9}' > dir_list_root
  3. Build a list of the timestamps of all the existing job directories (quick);
    • Suggestion (ls options are lowercase L and lowercase D):
      ls -ld * | awk '{print $8" "$6" "$7" "$9}' > dir_list_all
  4. If you have done the suggested commands, you will have three files in which the first column is a time, the last column is a name (JOBID / directory), and there is a date in the middle column(s).
    • You might want to sort the files numerically with sort -n.
  5. For each row in held_list:
    • consider the timestamp, and search for it in the dir_list* files via grep. You should obtain two sublists, one for each dir_list*, of “candidates”. Warning: for some reason find -ls and ls -l give the date in different formats. It is possible in principle to pipe the date command to apply a conversion. Usually I do not care.
    • Now look at the list of candidates coming from dir_list_all, and remove the lines such that the directory contains a root file (you know which these are from the list of candidates coming from dir_list_root). The surviving lines are what we are after!
    • The result of the previous step is a (final) list of candidates. Each of them is potentially the directory corresponding to the held job we are considering. If there is only one directory candidate, just remove it. Congratulations, you can now safely relaunch the job! Otherwise, you are out of luck, sorry. Because there are still idle/running jobs, there is no way, yet, to uniquely identify the current held job directory. Wait for some more jobs to complete and try again.

Appendix A

Here is an example of xml, suitable to produce 1000 νe spread on 50 jobs, via the fcl simulation_genie_icarus_Aug2018_nue.fcl, present on the repository, with version v07_00_01 of icaruscode. Note that the time/memory requirements in this example have been tested only for the ratio 1000/50 = 20 events per job. PLEASE change at least the out/log/work directories before use! Good paths are somewhere on /pnfs, as opposed to your home on /nashome, because the latter is not visible from the jobs running on grid, so they would cannot save files. Although the contents of the !ENTITY strings (release, file_type, run_type, name and prod_version) can be anything, it is strongly advised to put sensible information therein.

<?xml version="1.0"?> <!-- Production Project -->
<!DOCTYPE project [
<!ENTITY release <!ENTITY file_type <!ENTITY run_type
"v07_00_01" > "mc" >
"physics" >
<!ENTITY name "prod_nue"> <!ENTITY prod_version "test_v07" > ]>
<job>
<project name="&name;">
<!-- Project size --> <numevents>1000</numevents>
<!-- Operating System --> <os>SL6</os>
<!-- Batch resources --> <resource>DEDICATED,OPPORTUNISTIC</resource>
<!-- Larsoft information --> <larsoft>
<tag>&release;</tag>
<qual>e15:prof</qual> </larsoft>
<!-- Project stages -->
<stage name="gen">
<fcl>simulation_genie_icarus_Aug2018_nue.fcl</fcl> <outdir>/pnfs/icarus/scratch/icaruspro/&prod_version;/&release;/gen/&name;/out</outdir> <logdir>/pnfs/icarus/scratch/icaruspro/&prod_version;/&release;/gen/&name;/log</logdir> <workdir>/pnfs/icarus/scratch/icaruspro/&prod_version;/&release;/gen/&name;/work</workdir> <numjobs>50</numjobs>
<datatier>generated</datatier>
<memory>2000</memory>
<defname>&name;_&prod_version;_gen</defname>
</stage>
<stage name="g4">
<fcl>standard_g4_icarus.fcl</fcl> <outdir>/pnfs/icarus/scratch/icaruspro/&prod_version;/&release;/g4/&name;/out</outdir>
<logdir>/pnfs/icarus/scratch/icaruspro/&prod_version;/&release;/g4/&name;/log</logdir> <workdir>/pnfs/icarus/scratch/icaruspro/&prod_version;/&release;/g4/&name;/work</workdir> <numjobs>50</numjobs>
<datatier>simulated</datatier>
<jobsub>--memory=6000 --expected-lifetime=2h</jobsub>
<defname>&name;_&prod_version;_g4</defname> </stage>
<stage name="detsim">
<fcl>standard_detsim_icarus.fcl</fcl>
<outdir>/pnfs/icarus/scratch/icaruspro/&prod_version;/&release;/detsim/&name;/out</outdir>
<logdir>/pnfs/icarus/scratch/icaruspro/&prod_version;/&release;/detsim/&name;/log</logdir> <workdir>/pnfs/icarus/scratch/icaruspro/&prod_version;/&release;/detsim/&name;/work</workdir>
<numjobs>50</numjobs> <datatier>detector-simulated</datatier> <jobsub>--memory=2000 --expected-lifetime=1h</jobsub> <defname>&name;_&prod_version;_detsim</defname>
</stage>
<stage name="reco">
<fcl>reco_icarus_driver_reco_all.fcl</fcl> <outdir>/pnfs/icarus/scratch/icaruspro/&prod_version;/&release;/reco/&name;/out</outdir> <logdir>/pnfs/icarus/scratch/icaruspro/&prod_version;/&release;/reco/&name;/log</logdir>
<workdir>/pnfs/icarus/scratch/icaruspro/&prod_version;/&release;/reco/&name;/work</workdir> <numjobs>50</numjobs>
<datatier>reconstructed</datatier>
<jobsub>--memory=8000 --expected-lifetime=16h</jobsub> <defname>&name;_&prod_version;_reco</defname>
</stage>
<!-- file type --> <filetype>&file_type;</filetype> <!-- run type --> <runtype>&run_type;</runtype>
</project> </job>