Project

General

Profile

Running on the Grid

Job Logistics

Running on the grid (either Fermi or the OSG) means one must tune the unit of work to match semi-arbitrary constraints. Jobs must run long enough to make the start up costs (fetching input files, setting up the environment) is swamped by the real work done. File handling at the end (returning results) is also a concern. But one is penalized (heavily) for jobs that run a long time -- jobs must be submitted with an estimated run time. Long estimates hurt ones ability to get slots, but exceeding an estimate results in the job being killed mid-calculation (invalidating all the work done by that job).

What appear to be natural ways to segment the work don't easily map to this model. Any given run of art can only run a single "exptsetup" (beam particle, energy, and target). But running a reasonable number of beam particles for one "exptsetup" for one universe it too small a unit of work.

Individual Job Scripts

We wrote a script, genana_g4vmp_proclevel_condor.sh, as part of runartg4tk that allows one to submit jobsub (ie. condor) "clusters" of the same "exptsetup" where each job in the cluster handles a contiguous range of universes. By choosing the number of universes run by each job process one can tune the run time.

One example job fcl for piminus_on_C_at_5GeV (which has experimental data for both HARP and ITEP-771) covering universes 0 (default physics) to 9 looks like:

# this is piminus_on_C_at_5GeV_U0000_0009.fcl
#
# MULTIVERSE        e.g. multiverse170208_Bertini  (fcl base) [multiverse170208_Bertini]
# G4HADRONICMODEL   e.g. Bertini                              [Bertini]
# PROBENAME         e.g. piplus, piminus, proton              [piminus]
# PROBEPDG          e.g. 211,    -211,    2212                [-211]
# PROBEP            e.g. 5.0 6.5   # (GeV)                    [5]
# PROBEPNODOT       e.g. 5   6p5   # used in dossier PROLOGs,
#                                  # but, no trailing 'p0's   [5]
# TARGET            e.g. Cu                                   [C]
# NEVENTS           e.g. 5000                                 [500000]

#include "multiverse170208_Bertini.fcl" 
#include "HARP_dossier.fcl" 
#include "ITEP_dossier.fcl" 

process_name: genanaXpiminusC5GeVU00000009

source: {

   module_type: EmptyEvent
   maxEvents: 500000

} # end of source:

services: {

   message: {
      debugModules : ["*"]
      suppressInfo : []
      destinations : {
         LogToConsole : {
            type : "cout" 
            threshold : "DEBUG" 
            categories : { default : { limit : 50 } }
         } # end of LogToConsole
      } # end of destinations:
   } # end of message:

   RandomNumberGenerator: {}
   TFileService: {
      fileName: "piminus_on_C_at_5GeV_U0000_0009.hist.root" 
   }

   ProcLevelSimSetup: {
      HadronicModelName:  "Bertini" 
      TargetNucleus:  "C" 
      RNDMSeed:  1
   }
   # leave this on ... documentation of what was set
   PhysModelConfig: { Verbosity: true }

} # end of services:

outputs: {

   outroot: {
      module_type: RootOutput
      fileName: "piminus_on_C_at_5GeV_U0000_0009.artg4tk.root" 
   }

} # end of outputs:

physics: {

   producers: {

      PrimaryGenerator: {
         module_type: EventGenerator
         nparticles : 1
         pdgcode:  -211
         momentum: [ 0.0, 0.0, 5 ] // in GeV
      }

      BertiniDefault            : @local::BertiniDefault
      BertiniRandomUniv0001     : @local::BertiniRandomUniv0001
      BertiniRandomUniv0002     : @local::BertiniRandomUniv0002
      BertiniRandomUniv0003     : @local::BertiniRandomUniv0003
      BertiniRandomUniv0004     : @local::BertiniRandomUniv0004
      BertiniRandomUniv0005     : @local::BertiniRandomUniv0005
      BertiniRandomUniv0006     : @local::BertiniRandomUniv0006
      BertiniRandomUniv0007     : @local::BertiniRandomUniv0007
      BertiniRandomUniv0008     : @local::BertiniRandomUniv0008
      BertiniRandomUniv0009     : @local::BertiniRandomUniv0009

   } # end of producers:

   analyzers: {

     BertiniDefaultHARP:
     {
        module_type: AnalyzerHARP
        ProductLabel: "BertiniDefault" 
        IncludeExpData:
        {
            DBRecords:  @local::HARP_piminus_on_C_at_5GeV
        }
     }

     BertiniDefaultITEP:
     {
        module_type: AnalyzerITEP
        ProductLabel: "BertiniDefault" 
        IncludeExpData:
        {
            DBRecords:  @local::ITEP_piminus_on_C_at_5GeV
        }
     }

     BertiniRandomUniv0001HARP:
     {
        module_type: AnalyzerHARP
        ProductLabel: "BertiniRandomUniv0001" 
        IncludeExpData:
        {
            DBRecords:  @local::HARP_piminus_on_C_at_5GeV
        }
     }

     BertiniRandomUniv0001ITEP:
     {
        module_type: AnalyzerITEP
        ProductLabel: "BertiniRandomUniv0001" 
        IncludeExpData:
        {
            DBRecords:  @local::ITEP_piminus_on_C_at_5GeV
        }
     }

...

     BertiniRandomUniv0009HARP:
     {
        module_type: AnalyzerHARP
        ProductLabel: "BertiniRandomUniv0009" 
        IncludeExpData:
        {
            DBRecords:  @local::HARP_piminus_on_C_at_5GeV
        }
     }

     BertiniRandomUniv0009ITEP:
     {
        module_type: AnalyzerITEP
        ProductLabel: "BertiniRandomUniv0009" 
        IncludeExpData:
        {
            DBRecords:  @local::ITEP_piminus_on_C_at_5GeV
        }
     }

   } # end of analyzers:

   path1:     [ PrimaryGenerator
              , BertiniDefault
              , BertiniRandomUniv0001
              , BertiniRandomUniv0002
              , BertiniRandomUniv0003
              , BertiniRandomUniv0004
              , BertiniRandomUniv0005
              , BertiniRandomUniv0006
              , BertiniRandomUniv0007
              , BertiniRandomUniv0008
              , BertiniRandomUniv0009
              ] // end-of path1

   path2:     [
                BertiniDefaultHARP
              , BertiniDefaultITEP
              , BertiniRandomUniv0001HARP
              , BertiniRandomUniv0001ITEP
              , BertiniRandomUniv0002HARP
              , BertiniRandomUniv0002ITEP
              , BertiniRandomUniv0003HARP
              , BertiniRandomUniv0003ITEP
              , BertiniRandomUniv0004HARP
              , BertiniRandomUniv0004ITEP
              , BertiniRandomUniv0005HARP
              , BertiniRandomUniv0005ITEP
              , BertiniRandomUniv0006HARP
              , BertiniRandomUniv0006ITEP
              , BertiniRandomUniv0007HARP
              , BertiniRandomUniv0007ITEP
              , BertiniRandomUniv0008HARP
              , BertiniRandomUniv0008ITEP
              , BertiniRandomUniv0009HARP
              , BertiniRandomUniv0009ITEP
              ] // end-of path2

   stream1:       [ outroot ]
   trigger_paths: [ path1 ]
   end_paths:     [ path2, stream1 ]

} # end of physics:

These individual FCL files are generated by the job itself on the worker node based on passed parameters and the $PROCESS within the "cluster".

Submitting Jobs

As this work is being done for the experiments, and the Geant4 group as no significant grid allocation, I submitted the jobs as dune.

Keeping track of jobs submitted and those completed is difficult. The following script attempts to build a script of jobs to submit (jobs_to_submit.sh). It also caches some of the info (e.g. multiverse170208_Bertini_complete_exptsetup.txt) so run it from the same place every time.

source /cvmfs/fermilab.opensciencegrid.org/products/common/etc/setup

setup jobsub_client
export JOBSUB_GROUP=dune

### on time creation of directories in $DESTDIRTOP ...

#  near the top of this file are important hardcoded (for now) values
${MRB_SOURCE}/runartg4tk/scripts/cleanup_hists.sh

#   near the top of this file are some important hardcoded (for now) values
${MRB_SOURCE}/runartg4tk/scripts/check_for_missing.sh

./jobs_to_submit.sh
# a record of submitted jobs is kept in jobs_to_submit.jobsub.log

Individual submissions of clusters look similar to:

jobsub_submit -g -N 100 \
    file:///geant4/app/rhatcher/mrb_work_area-2018-03-05/srcs/runartg4tk/scripts/genana_g4vmp_proclevel_condor.sh \
    --probe=piplus --target=Be --pz=5 --universes multiverse170208_Bertini,0,10 \
    --tarball=localProducts_runartg4tk_v0_03_00_e15_prof_2018-03-21.tar.bz2

Output

Jobs are configured to put their histogram output (the important bits) in DESTDIRTOP= /pnfs/geant4/persistent/${USER}/genana_g4vmp/${MULTIVERSE} (where MULTIVERSE= multiverse170208_Bertini here)
The art root files are put in the same path with persistent replaced with scratch so as not to fill the limited persistent space.

This expects that the appropriate "exptsetup" directories have been created in $DESTDIRTOP ... how was that done again ? (extract this from older script create_submit_all.sh and push to git?

Dealing with Failures

Our choice of the multiverse was to generate 1000 separate universes.

When running a cluster we generally found that groups of 10 universes were manageable in a standard job (i.e. a job cluster was 100 jobs of 10 universes).

But for some cases (some "exptsetup" and particular sets of universes) this would exceed the allocated grid run time. Our strategy then was to run clusters with fewer universes per job -- make the adjustment in check_for_missing.sh by modifying the USTRIDE (normally 10).

Also there were instances where the histogram "file" existed in PNFS, but didn't contain any contents. These can usually be identified by the file size in comparison to others in the directory and were weeded out by hand; the cached entry for completion was removed from the multiverse170208_Bertini_complete_exptsetup.txt file and check_for_missing.sh was re-run).