- Table of contents
- Running on the Grid
Running on the Grid¶
Job Logistics¶
Running on the grid (either Fermi or the OSG) means one must tune the unit of work to match semi-arbitrary constraints. Jobs must run long enough to make the start up costs (fetching input files, setting up the environment) is swamped by the real work done. File handling at the end (returning results) is also a concern. But one is penalized (heavily) for jobs that run a long time -- jobs must be submitted with an estimated run time. Long estimates hurt ones ability to get slots, but exceeding an estimate results in the job being killed mid-calculation (invalidating all the work done by that job).
What appear to be natural ways to segment the work don't easily map to this model. Any given run of art
can only run a single "exptsetup"
(beam particle, energy, and target). But running a reasonable number of beam particles for one "exptsetup"
for one universe it too small a unit of work.
Individual Job Scripts¶
We wrote a script, genana_g4vmp_proclevel_condor.sh
, as part of runartg4tk
that allows one to submit jobsub
(ie. condor) "clusters" of the same "exptsetup"
where each job in the cluster handles a contiguous range of universes. By choosing the number of universes run by each job process one can tune the run time.
One example job fcl for piminus_on_C_at_5GeV
(which has experimental data for both HARP and ITEP-771) covering universes 0 (default physics) to 9 looks like:
# this is piminus_on_C_at_5GeV_U0000_0009.fcl
#
# MULTIVERSE e.g. multiverse170208_Bertini (fcl base) [multiverse170208_Bertini]
# G4HADRONICMODEL e.g. Bertini [Bertini]
# PROBENAME e.g. piplus, piminus, proton [piminus]
# PROBEPDG e.g. 211, -211, 2212 [-211]
# PROBEP e.g. 5.0 6.5 # (GeV) [5]
# PROBEPNODOT e.g. 5 6p5 # used in dossier PROLOGs,
# # but, no trailing 'p0's [5]
# TARGET e.g. Cu [C]
# NEVENTS e.g. 5000 [500000]
#include "multiverse170208_Bertini.fcl"
#include "HARP_dossier.fcl"
#include "ITEP_dossier.fcl"
process_name: genanaXpiminusC5GeVU00000009
source: {
module_type: EmptyEvent
maxEvents: 500000
} # end of source:
services: {
message: {
debugModules : ["*"]
suppressInfo : []
destinations : {
LogToConsole : {
type : "cout"
threshold : "DEBUG"
categories : { default : { limit : 50 } }
} # end of LogToConsole
} # end of destinations:
} # end of message:
RandomNumberGenerator: {}
TFileService: {
fileName: "piminus_on_C_at_5GeV_U0000_0009.hist.root"
}
ProcLevelSimSetup: {
HadronicModelName: "Bertini"
TargetNucleus: "C"
RNDMSeed: 1
}
# leave this on ... documentation of what was set
PhysModelConfig: { Verbosity: true }
} # end of services:
outputs: {
outroot: {
module_type: RootOutput
fileName: "piminus_on_C_at_5GeV_U0000_0009.artg4tk.root"
}
} # end of outputs:
physics: {
producers: {
PrimaryGenerator: {
module_type: EventGenerator
nparticles : 1
pdgcode: -211
momentum: [ 0.0, 0.0, 5 ] // in GeV
}
BertiniDefault : @local::BertiniDefault
BertiniRandomUniv0001 : @local::BertiniRandomUniv0001
BertiniRandomUniv0002 : @local::BertiniRandomUniv0002
BertiniRandomUniv0003 : @local::BertiniRandomUniv0003
BertiniRandomUniv0004 : @local::BertiniRandomUniv0004
BertiniRandomUniv0005 : @local::BertiniRandomUniv0005
BertiniRandomUniv0006 : @local::BertiniRandomUniv0006
BertiniRandomUniv0007 : @local::BertiniRandomUniv0007
BertiniRandomUniv0008 : @local::BertiniRandomUniv0008
BertiniRandomUniv0009 : @local::BertiniRandomUniv0009
} # end of producers:
analyzers: {
BertiniDefaultHARP:
{
module_type: AnalyzerHARP
ProductLabel: "BertiniDefault"
IncludeExpData:
{
DBRecords: @local::HARP_piminus_on_C_at_5GeV
}
}
BertiniDefaultITEP:
{
module_type: AnalyzerITEP
ProductLabel: "BertiniDefault"
IncludeExpData:
{
DBRecords: @local::ITEP_piminus_on_C_at_5GeV
}
}
BertiniRandomUniv0001HARP:
{
module_type: AnalyzerHARP
ProductLabel: "BertiniRandomUniv0001"
IncludeExpData:
{
DBRecords: @local::HARP_piminus_on_C_at_5GeV
}
}
BertiniRandomUniv0001ITEP:
{
module_type: AnalyzerITEP
ProductLabel: "BertiniRandomUniv0001"
IncludeExpData:
{
DBRecords: @local::ITEP_piminus_on_C_at_5GeV
}
}
...
BertiniRandomUniv0009HARP:
{
module_type: AnalyzerHARP
ProductLabel: "BertiniRandomUniv0009"
IncludeExpData:
{
DBRecords: @local::HARP_piminus_on_C_at_5GeV
}
}
BertiniRandomUniv0009ITEP:
{
module_type: AnalyzerITEP
ProductLabel: "BertiniRandomUniv0009"
IncludeExpData:
{
DBRecords: @local::ITEP_piminus_on_C_at_5GeV
}
}
} # end of analyzers:
path1: [ PrimaryGenerator
, BertiniDefault
, BertiniRandomUniv0001
, BertiniRandomUniv0002
, BertiniRandomUniv0003
, BertiniRandomUniv0004
, BertiniRandomUniv0005
, BertiniRandomUniv0006
, BertiniRandomUniv0007
, BertiniRandomUniv0008
, BertiniRandomUniv0009
] // end-of path1
path2: [
BertiniDefaultHARP
, BertiniDefaultITEP
, BertiniRandomUniv0001HARP
, BertiniRandomUniv0001ITEP
, BertiniRandomUniv0002HARP
, BertiniRandomUniv0002ITEP
, BertiniRandomUniv0003HARP
, BertiniRandomUniv0003ITEP
, BertiniRandomUniv0004HARP
, BertiniRandomUniv0004ITEP
, BertiniRandomUniv0005HARP
, BertiniRandomUniv0005ITEP
, BertiniRandomUniv0006HARP
, BertiniRandomUniv0006ITEP
, BertiniRandomUniv0007HARP
, BertiniRandomUniv0007ITEP
, BertiniRandomUniv0008HARP
, BertiniRandomUniv0008ITEP
, BertiniRandomUniv0009HARP
, BertiniRandomUniv0009ITEP
] // end-of path2
stream1: [ outroot ]
trigger_paths: [ path1 ]
end_paths: [ path2, stream1 ]
} # end of physics:
These individual FCL files are generated by the job itself on the worker node based on passed parameters and the $PROCESS
within the "cluster".
Submitting Jobs¶
As this work is being done for the experiments, and the Geant4 group as no significant grid allocation, I submitted the jobs as dune
.
Keeping track of jobs submitted and those completed is difficult. The following script attempts to build a script of jobs to submit (jobs_to_submit.sh
). It also caches some of the info (e.g. multiverse170208_Bertini_complete_exptsetup.txt
) so run it from the same place every time.
source /cvmfs/fermilab.opensciencegrid.org/products/common/etc/setup
setup jobsub_client
export JOBSUB_GROUP=dune
### on time creation of directories in $DESTDIRTOP ...
# near the top of this file are important hardcoded (for now) values
${MRB_SOURCE}/runartg4tk/scripts/cleanup_hists.sh
# near the top of this file are some important hardcoded (for now) values
${MRB_SOURCE}/runartg4tk/scripts/check_for_missing.sh
./jobs_to_submit.sh
# a record of submitted jobs is kept in jobs_to_submit.jobsub.log
Individual submissions of clusters look similar to:
jobsub_submit -g -N 100 \ file:///geant4/app/rhatcher/mrb_work_area-2018-03-05/srcs/runartg4tk/scripts/genana_g4vmp_proclevel_condor.sh \ --probe=piplus --target=Be --pz=5 --universes multiverse170208_Bertini,0,10 \ --tarball=localProducts_runartg4tk_v0_03_00_e15_prof_2018-03-21.tar.bz2
Output¶
Jobs are configured to put their histogram output (the important bits) in DESTDIRTOP= /pnfs/geant4/persistent/${USER}/genana_g4vmp/${MULTIVERSE}
(where MULTIVERSE= multiverse170208_Bertini
here)
The art root files are put in the same path with persistent
replaced with scratch
so as not to fill the limited persistent space.
This expects that the appropriate "exptsetup" directories have been created in $DESTDIRTOP ... how was that done again ? (extract this from older script create_submit_all.sh and push to git?
Dealing with Failures¶
Our choice of the multiverse was to generate 1000 separate universes.
When running a cluster we generally found that groups of 10 universes were manageable in a standard job (i.e. a job cluster was 100 jobs of 10 universes).
But for some cases (some "exptsetup" and particular sets of universes) this would exceed the allocated grid run time. Our strategy then was to run clusters with fewer universes per job -- make the adjustment in check_for_missing.sh
by modifying the USTRIDE
(normally 10).
Also there were instances where the histogram "file" existed in PNFS, but didn't contain any contents. These can usually be identified by the file size in comparison to others in the directory and were weeded out by hand; the cached entry for completion was removed from the multiverse170208_Bertini_complete_exptsetup.txt
file and check_for_missing.sh
was re-run).