Project

General

Profile

Keepup Jobs

Keepup jobs are currently run with the following configurations.

------------------------------------------------------------------------------------------------------------------------------------
Keepup Config                       Raw2Root Rel                        Reco Rel                                 Novapro    Running?  
------------------------------------------------------------------------------------------------------------------------------------
Broken_Config_Reco_ND_ligo.cfg       S20-10-30                           R19-11-18-prod5reco.muremove-hotfix.b    v04.97              
Config_OffsiteProbe.cfg                                                                                                               
Config_Raw2Root_FD_cosmics.cfg       S20-10-30                                                                              X         
Config_Raw2Root_FD_numi.cfg          S20-10-30                                                                              X         
Config_Raw2Root_FD_others.cfg        S20-10-30                                                                              X         
Config_Raw2Root_ND_cosmics.cfg       S20-10-30                                                                              X         
Config_Raw2Root_ND_numi.cfg          S20-10-30                                                                              X         
Config_Raw2Root_ND_others.cfg        S20-10-30                                                                              X         
Config_Raw2Root_ND_test.cfg          S20-10-30                                                                                        
Config_Raw2Root_TB_activity.cfg      R19-09-24-testbeam-production.c                                                        X         
Config_Raw2Root_TB_beamline.cfg      R19-09-24-testbeam-production.c                                                        X         
Config_Raw2Root_TBBL_beamline.cfg    R19-09-24-testbeam-production.c                                                        X         
Config_Reco_FD_DDHmu.cfg             S20-10-30                           R19-11-18-prod5reco.muremove-hotfix.b    v04.97    X         
Config_Reco_FD_DDMichel.cfg          S20-10-30                           R19-11-18-prod5reco.muremove-hotfix.b    v04.97    X         
Config_Reco_FD_DDMoon.cfg            S20-10-30                           R19-11-18-prod5reco.muremove-hotfix.b    v04.97    X         
Config_Reco_FD_ddsnews.cfg           S20-10-30                           R19-11-18-prod5reco.muremove-hotfix.b    v04.97    X         
Config_Reco_FD_ligo.cfg              S20-10-30                           R19-11-18-prod5reco.muremove-hotfix.b    v04.97    X         
Config_Reco_ND_ddsnews.cfg           S20-10-30                           R19-11-18-prod5reco.muremove-hotfix.b    v04.97    X         
Config_Reco_ND_ligo.cfg              S20-10-30                           R19-11-18-prod5reco.muremove-hotfix.b    v04.97    X         
Config_RecoProd5_FD_numi.cfg         S20-10-30                           R19-11-18-prod5reco.muremove-hotfix.b    v05.02    X         
Config_RecoProd5_ND_bnb.cfg          S20-10-30                           R19-11-18-prod5reco.muremove-hotfix.b    v04.97              
Config_RecoProd5_ND_numi.cfg         S20-10-30                           R19-11-18-prod5reco.muremove-hotfix.b    v04.98    X         

This summary can be generated with the list_versions.sh script in novaproduction/keepup/ConfigFile .

Setting Up for Submission of Keepups

For easiness, keepups are run through one of the production conveners.

In order to submit the keepup jobs using the proper KCA proxy and Keepup submission script, one must follow these instruction, which are replicated from the FIFE wiki.

NOTE: In this example, we are using the interactive node novagpvm02.fnal.gov.
      There is nothing special about this node, but one should be consistent with the node s/he chooses.

Initial setup to be done only once
Log as yourself to novagpvm02.fnal.gov
Type the following commands and follow the inline instructions:

kcroninit
# follow instruction provided on the terminal

This command will create a cron principal in the form:
<username>/cron/<hostname>@FNAL.GOV

Then you can run the command:

/usr/bin/kcron kx509 -o /tmp/${USER}.cron.cert

to create a cert in /tmp/${USER}.cron.cert that you can use to identify your "Subject" value (DN).
For this purpose run the command:

voms-proxy-info -all -file /tmp/${USER}.cron.cert

The "subject" line would provide the DN that looks like:

/DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=Robots/CN=<hostname>/CN=cron/CN=<Full Name>/CN=UID:<username>

You need to register this DN with your account at FNAL, by using this ServiceNow form:
https://fermi.servicenowservices.com/wp?id=evg_sc_cat_item&sys_id=11cd3721db8adf4096f5ff621f9619fc&spa=1

NOTE: You should be able to fill a single SNOW ticket with the list of user/subject. Make sure that the information is correct for each user.

Setting Up The Crontab

Log in as yourself in novagpvm02 and run the following script:

/nova/app/users/novapro/production_keepups/novaproduction/keepup/update_NOvA_keepup_crontab.sh

If there are no errors, You should see as output:

 *** crontab update succeeded ***

This script will update your crontab with the entries needed to:
  1. keep your crontab configuration updated
  2. renew your Production proxy
  3. submit KeepUp jobs

In order to see your crontab, type: crontab -l. Your crontab should eventually look like the following:


### START Keep-up crontab config ###
### Don't modify this section

# test command
55 21 * * * echo "-> BEGIN: Running Keep-up Crontab" 

# this is needed for SL7, for SL7, user home directories are mounted NFS4 which are mounted using kerberos
# so the cron daemon can't access
HOME=/nova/app/users/novapro/production_keepups/novaproduction/keepup

# update the crontab configuration from the template
01 00 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/update_NOvA_keepup_crontab.sh > /dev/null

# For SL7 cron: generate proxy
05 00 * * * /usr/bin/kcron /nova/app/users/novapro/production_keepups/novaproduction/keepup/GenerateProxy.sh 

# Raw2Root keepup
30 00 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_TBBL_beamline.cfg
35 00 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_TB_beamline.cfg > /dev/null
40 00 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_TB_activity.cfg  > /dev/null

45 00 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_FD_numi.cfg > /dev/null
50 00 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_FD_cosmics.cfg > /dev/null
55 00 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_FD_others.cfg > /dev/null

00 01 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_ND_numi.cfg > /dev/null
05 01 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_ND_cosmics.cfg > /dev/null
10 01 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_ND_others.cfg > /dev/null

# Reco keepup
30 02 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Reco_FD_DDHmu.cfg > /dev/null
35 02 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Reco_FD_DDMichel.cfg > /dev/null
40 02 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Reco_FD_DDMoon.cfg > /dev/null
45 02 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Reco_FD_ligo.cfg > /dev/null
50 02 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Reco_ND_ligo.cfg > /dev/null
55 02 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Reco_FD_ddsnews.cfg > /dev/null
00 03 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Reco_ND_ddsnews.cfg > /dev/null
05 03 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_RecoProd5_FD_numi.cfg > /dev/null
10 03 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_RecoProd5_ND_numi.cfg > /dev/null

# generate report on exit codes at the start and end of the day
00 09 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/GenerateReport.sh --keepup --email
00 17 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/GenerateReport.sh --keepup --email

# not currently used
# 15 01 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Reco_ND_bnb.cfg > /dev/null
# 00 02 * * * /nova/app/users/novapro/production_keepups/novaproduction/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_RecoProd5_ND_bnb.cfg > /dev/null

### END Keep-up crontab config ###

Each line represents a piece of code to run at a specific time. 30 00 * * * means "run each day at 00:30", for instance.

Monitoring the KeepUp jobs

First, check if the submission was successful and then monitor those jobs.

Submission Monitoring

Keepup submission is emailed to the #keepup slack channel. There is one email per configuration.

The first part of submission log shows the submission at a glance. It shows which detector/stream/release is being used for the configuration, and also shows how many jobs were submitted, and if the submission was successful. Keepups are submitted for raw data from the last 30 days with a stride of 2 days. The first column of the N files/ N jobs/Exit code lists in the code snippet below tells you that there were 24 files submitted in 2 jobs from today and that they were submitted successfully. The second column tells you that there are zero files from 3 days ago, and then zero from 5, 7, 9, 11... days ago.

N files list: 24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
N jobs list: 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Exit code list: 0 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28
------------------------------------------
== Submission summary:
- Detector: neardet
- Stream: numi
- Raw2Root Release: S20-10-30
- EXTRADIMENSION: Online.RunNumber > 12086
================================

Submission exit codes are as follows:

#-------------------------------------------------------------------------------------
# Exit codes:
# 0: Status OK
#10: Nothing wrong, show list of arguments
#18: Script disabled
#21: General error in the arguments
#22: Invalid detector
#23: Invalid stream
#24: Wrong combination of streams
#25: Invalid test argument
#26: Invalid release
#27: Some options are missing, --det, --r2r-release and --stream or --Sstream are mandatory
#28: No files in the selected dataset, the script continue to look for the next dataset
#29: Something was wrong with samweb, the script continue to look for the next dataset
#30: Nothing wrong, this is a test
#31: Choose one and only one option among --raw2root, --reconstruction, --calibration
#32: Something was wrong with job submission, the script continue to look for the next dataset
#33: Choose one and only one option among --keepup, --manual,--OffsiteProbe
#34: --runrange and --timerange options need 2 arguments comma separated
#35: With --OffsiteProbe you need to use also: --offsite_only, --det DET_NAME and --definition  DEFINITION
#36: The --recommended_sites option is not compatible with --site option
#37: The --site option requires  --offsite_only or --offsite option
#38: The --offsite_only and --offsite options require --os option
#39: The --definition, --fulldimension and --extradimension options are incompatible
# and doesn't allows the use of --runrange, --timerange, --stream, --Sstream options
# For Reconstruction and Calibration --definition option is incompatible with --r2r-release option
#40: The configuration file is not accessible
#41: You don't have a valid proxy/ticket
#42: You don't have right access to this file/script
#45: Wrong number of fields to store in the DB, the script continue
#100: This feature is not implemeted yet
#110: The selected production type is unknown, you can choose among --raw2root, --reconstruction, --calibration
#150: There are extra arguments not associated to any option
#152: An option has wrong argument
#200: Invalid Shifter user
#201: This script should not be sourced
#202: The script pretends to run in a crontab not owned by the choosed Shifter user
#203: The Shifter file is not accessible

If the submission was done properly, the corresponding SAMweb projects should have been created and the jobs should have been submitted. If the submission has not yet completed, the jobs should still be available in the job queue in a "valid" state (queued/running).

Job Monitoring

Twice daily an email summary of the submitted keepup jobs are emailed to the #keepup channel, and to the email address.

For each configuration, each set of jobs run is listed along with some useful information. An example for a specific config is given below

Summary for Config_RecoProd5_ND_numi.cfg

-- jobid 40740199.0@jobsub02.fnal.gov -- r: 0 i: 0 h: 0 -- 23/24 jobs failed | ec: 250 n: 23 | ec: 0 n: 1
-- jobid 40740200.0@jobsub02.fnal.gov -- r: 0 i: 0 h: 0 -- 2/23 jobs failed | ec: 252 n: 2 | ec: 0 n: 21
-- jobid 40740201.0@jobsub02.fnal.gov -- r: 0 i: 0 h: 0 -- 1/23 jobs failed | ec: 252 n: 1 | ec: 0 n: 22
-- jobid 40740202.0@jobsub02.fnal.gov -- r: 0 i: 0 h: 0 -- 0/10 jobs failed | ec: 0 n: 10
-- jobid 40740203.0@jobsub02.fnal.gov -- r: 0 i: 0 h: 0 -- 0/2 jobs failed | ec: 0 n: 2
-- jobid 40740204.0@jobsub02.fnal.gov -- r: 0 i: 0 h: 0 -- 0/6 jobs failed | ec: 0 n: 6

This tells you that there are 0 jobs running (r), 0 idle (i) and 0 held (h). For each set of jobs it also tells you how many jobs failed, and lists the exit codes of the jobs. For instance, of the 23 jobs with jobid , 2 failed, and both had exit code 252, and 21 succeeded with exit code 0.

Modifying the keep-up configuration (expert)

If you are the shifter you generally don't need to read this documentation: Modifying keep-up configuration