How to submit and monitor production jobs with POMS » History » Version 2
How to submit and monitor production jobs with POMS¶
- First and most important steps is to make sure that your kerberos id is associated with the icaruspro account. Open and login with your kerberos id at this page: https://pomsgpvm01.fnal.gov/poms/ and check if you are able to open the page and edit any of the job type and campaigns. If you don't have any permission, you can request access this by submitting a ticket to service desk for permission to run production under icaruspro.
- Create a configuration file. Do this by logging into email@example.com and setup the environment necessary to run as icaruspro and go to the poms configuration directory as follows:
$ ssh firstname.lastname@example.org Last login: Thu Apr 11 12:11:33 2019 from 220.127.116.11 NOTICE TO USERS This is a Federal computer (and/or it is directly connected to a Fermilab local network system) that is the property of the United States Government. It is for authorized use only. Users (autho- rized or unauthorized) have no explicit or implicit expectation of privacy. Any or all uses of this system and all files on this system may be intercepted, monitored, recorded, copied, audited, inspected, and disclosed to authorized site, Department of Energy and law enforcement personnel, as well as authorized officials of other agencies, both domestic and foreign. By using this system, the user consents to such interception, monitoring, recording, copy- ing, auditing, inspection, and disclosure at the discretion of authorized site or Department of Energy personnel. Unauthorized or improper use of this system may result in admin- istrative disciplinary action and civil and criminal penalties. By continuing to use this system you indicate your awareness of and consent to these terms and conditions of use. LOG OFF IMME- DIATELY if you do not agree to the conditions stated in this warning. Fermilab policy and rules for computing, including appropriate use, may be found at http://www.fnal.gov/cd/main/cpolicy.html [01:01:26 ~]$ setup_icaruspro v08_13_02 e17 Setting up LArSoft from "CVMFS": - executing '/cvmfs/larsoft.opensciencegrid.org/products/setup' - appending '/cvmfs/fermilab.opensciencegrid.org/products/common/db' Setting up artdaq from "CVMFS": - appending '/cvmfs/fermilab.opensciencegrid.org/products/artdaq' Setting up ICARUS from "CVMFS": - prepending '/cvmfs/icarus.opensciencegrid.org/products/icarus' [01:01:47 ~]$ cd /icarus/app/poms_test/cfg/
We currently have all the configuration file needed to run the SBN workshop production in this directory:
[01:02:31 /icarus/app/poms_test/cfg]$ ls -1 *workshop* icarus_workshop_cosmicmuon_launch.cfg icarus_workshop_cosmicmuon_launch_injectrun.cfg icarus_workshop_cosmiconly.cfg icarus_workshop_cosmiconly_injectrun.cfg icarus_workshop_fastoptical_injectrun.cfg icarus_workshop_intrinsic_nue_injectrun.cfg icarus_workshop_nominal_bnb_instrinsic_nue.cfg icarus_workshop_nominal_bnb_instrinsic_nue_injectrun.cfg icarus_workshop_nominal_bnb_neutrino.cfg icarus_workshop_nominal_bnb_neutrino_injectrun.cfg icarus_workshop_nominal_bnb_oscillated_nue.cfg icarus_workshop_nominal_bnb_oscillated_nue_injectrun.cfg icarus_workshop_osc_nue_injectrun.cfg icarus_workshop_single_electron.cfg icarus_workshop_single_electron_injectrun.cfg icarus_workshop_single_electronpiplus_injectrun.cfg icarus_workshop_single_muon_bnb_injectrun.cfg icarus_workshop_single_muon_parallel_injectrun.cfg icarus_workshop_single_muons.cfg icarus_workshop_single_muons_injectrun.cfg icarus_workshop_single_pi0_injectrun.cfg icarus_workshop_standard_singles_neutrino.cfg
Most of the neutrino and single particle sample have different configuration files created for gen stage, but because both of these sample type have similar production workflow and similar memory, there is a skeleton configuration file that handles the production workflow from g4 stage to reco called the
icarus_workshop_standard_singles_neutrino.cfg. When using this configuration file, each sample is differentiated by using a global parameter called:
global.sample which will tag the directory the output file is written into with the name of the produced sample. For the purpose of re-running the production sample under the new icaruscode, you would not have to create a new configuration file, but can simply change the software version inside the configuration file using the current software version. For example, setting
version = v08_13_02
will set the icaruscode to the v08_13_02 version, and this will be the version used to run the sample production. IMPORTANT: make sure that you always change the software version inside the configuration file, as changing it through POMS editor would not set the icaruscode to the correct version.
- Create a campaign workflow. If you are re-running previously requested SBN sample then the campaign is already created.
If you want to run the same campaign but with a different tag, then you can do so by using the clone function (click on the clone icon of the respective campaign (blue highlighted box on the picture below)) and then rename the name of the campaign. In the example below, I copied the whole name of the campaign and then add "with_CRTgeomfix" at the end of the campaign name:
Oglobal.sampleparameter for each stage of the production workflow with the name of the new tag for the new sample (e.g. "cosmics_muon_3ms_fixedCRTgeom"). If you forgot to add this parameter, then the file will be written to a “default” directory. Currently, this default directory is listed under:
/pnfs/icarus/scratch/users/icaruspro/dropbox/mc1/poms_production/MCC1_poms_icarus_prod_numu_bnb_v08_13_02. This is because the default
sampleparameter in the configuration file is "numu_bnb" Please remember to use the exact same name for each sample that is being produced (despite the stage). This will help to keep all of the sample files for different stages under the same directory.
- Specify the memory needed for each stage. For gen and g4 stage (single particle/neutrino sample), 1000MB-2500MB usually is sufficient to run a job. Cosmic sample usually need much larger memory and wall time. You can also see the memory profiling for the different samples in these pages:
- cosmic 3ms gen stage: https://fifemon.fnal.gov/monitor/d/otZRzhImk/poms-campaign?from=now-30d&to=now&var-Campaign=3005
- cosmic 3ms g4 stage: https://fifemon.fnal.gov/monitor/d/otZRzhImk/poms-campaign?from=now-30d&to=now&var-Campaign=3004
- cosmic 3ms detsim stage: https://fifemon.fnal.gov/monitor/d/otZRzhImk/poms-campaign?from=now-30d&to=now&var-Campaign=3006
- cosmic 3ms reco stage: https://fifemon.fnal.gov/monitor/d/otZRzhImk/poms-campaign?from=now-30d&to=now&var-Campaign=3007
to give you some idea about the size of the memory and disk that you should request when running this sample. A good rule of thumb to approach this is to run a test sample of ~10 jobs, using the new software version, through the whole production flow and collect the information on the maximum memory and walltime to be used as a baseline when submitting jobs for each stage. This will give you a better estimate of the wall time and memory to request for the production jobs
- (Not needed but will make your production life less complicated): use the POMS recovery options for jobs that are being held due to memory. POMS will run jobs based on the number of jobs we specify at the gen stage. For each of the stage downstream of that, I have added the following line into the configuration file:
n_files_per_job = 1. This will ensure that when the files from previous stage have been completed, POMS will only run the jobs that were located, and when the recovery options is being ran it will only re-submit the missing files and not the number of samples that we submitted at the gen stage.
- Now you have everything in place, you can start the campaign production by clicking “Launch” or the rocket symbol on POMS.
But what if I have to run any fhicl file that are not located in the repository? If possible, always make sure that the fhicl file that you need to run is located on the repository, but if not, don’t worry you can always hack your way around this problem. If the fhicl file that you want to run is not available on the repository for any stage other than gen stage, then you can just copy the fhicl-file to the worker node on the grid.