Project

General

Profile

Keepup Jobs » History » Version 60

« Previous - Version 60/82 (diff) - Next » - Current version
Paola Buitrago, 07/21/2016 11:38 AM


Keepup Jobs

The OPOS group runs several keepup jobs for NoVA. These jobs are managed by service-desk requests. Presently, the keepup jobs run by OPOS are:

  • Reconstruction keepup
    • FD keepup uses S15-12-07. Only NuMI trigger stream files are processed. Reconstruction with this version began with FD run 20923 (the start of epoch 3b). * ND keepup uses S15-12-07. Only the NuMI and BNB trigger stream files are processed. Reconstruction with this version began with ND run 11250 (the start of epoch 3b).
  • Raw2root keepup
    • FD keepup uses S15-03-11. The NuMI and Cosmics trigger streams are both processed. OPOS also processes all other trigger streams except 04 in a set of files referred to as "other." Raw2root with this version began with FD run 12942 (the start of period 1). * ND keepup uses S15-08-12. The NuMI, Cosmics and "Other" (as with FD) streams are all processed. Raw2root with this version began with ND run 10377 (the start of period 1).

Getting ready to submit keepup as yourself

In order to submit the keepup jobs as yourself using the proper KCA proxy and OPOS Keepup submission script, one must follow the following steps:

Prepare credentials

NOTE: In this example, we are using the interactive node novagpvm02.fnal.gov (typically more responsive than the often-busy novagpvm01). There is nothing particularly special about this node, but one should be consistent with the node he/she chooses.

  1. Log as yourself to novagpvm02.fnal.gov
  2. Type: kcroninit
    Return
    Return
    Input Kerberos principal
    Input kerberos password
  3. Type kcron
  4. Type source /grid/fermiapp/products/common/etc/setups.sh
  5. Type setup kx509
  6. Type kx509
  7. Type voms-proxy-info -all
    Get the subject of that output, it should look like:
    subject= /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=novagpvm02.fnal.gov/CN=cron/CN=Paola Buitrago/CN=UID:pbuitrag
  8. Open a SNOW request through the NOvA Experiment and Category: ‘Batch Submission group’ requesting the subject you got in the previous step, to be added to your voms user credentials (example request: RITM0318692).

NOTE: You should be able to fill a single SNOW ticket with the list of user/subject. Make sure that the information is correct for each user.

Setting up your crontab

Log in as yourself in novagpvm02 and run the following script:


/nova/app/home/novapro/OPOS/keepup/update_NOvA_keepup_crontab.sh
This script will update your crontab with the entries needed to:
  1. keep your crontab configuration updated
  2. renew your Production proxy
  3. submit KeepUp jobs

Your crontab should eventually look like the following:


### START Keep-up crontab config ###
### Don't modify this section

#update the crontab configuration from the template
10 0 * * * /nova/app/home/novapro/OPOS/keepup/update_NOvA_keepup_crontab.sh

#renew the production proxy
15 0 * * * /nova/app/home/novapro/OPOS/keepup/renewShifterCredentials_CILogon.sh

#Raw2Root keepup
30 01 * * * /nova/app/home/novapro/OPOS/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_FD_numi.cfg
35 01 * * * /nova/app/home/novapro/OPOS/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_FD_cosmics.cfg
40 01 * * * /nova/app/home/novapro/OPOS/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_FD_others.cfg
45 01 * * * /nova/app/home/novapro/OPOS/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_ND_numi.cfg
50 01 * * * /nova/app/home/novapro/OPOS/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_ND_cosmics.cfg
55 01 * * * /nova/app/home/novapro/OPOS/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_ND_others.cfg

#Reco keepup
00 02 * * * /nova/app/home/novapro/OPOS/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Reco_FD_numi.cfg
10 02 * * * /nova/app/home/novapro/OPOS/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Reco_ND_numi.cfg
20 02 * * * /nova/app/home/novapro/OPOS/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Reco_ND_bnb.cfg

### END Keep-up crontab config ###

Cleaning up the crontab

Whenever the shifter changes, the KeepUp section should be changed accordingly. This can be done by running the following script:


/nova/app/home/novapro/OPOS/keepup/clean_NOvA_keepup_crontab.sh

Setting up the OPOS Keep-up scripts for a new shifter.

Logged in as novapro, open and modify the following file by adding the user name of the new shifter in the obvious spot:


/nova/app/home/novapro/OPOS/keepup/shifter.txt

Testing the Submission

  • Log in to novagpvm02
  • run the following script, making sure to have a valid Production proxy
    
    /nova/app/home/novapro/OPOS/keepup/renewShifterCredentials_CILogon.sh
    
  • If you are the shifter

To test the script you may run the command:


 /nova/app/home/novapro/OPOS/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_FD_numi.cfg --test full
 
  • If you are not the shifter

For you to test the script you may, instead, run this command


/nova/app/home/novapro/OPOS/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_FD_numi.cfg --shifter ${USER} --test full

the command will run the workflow for the Raw2Root processing, FarDetector files, NuMI stream without job submission.

This will only work once the SNOW ticket you submitted earlier has been fulfilled.

Monitoring the KeepUp jobs

First, check if the submission was successful and then perform global monitoring. This global monitoring tells if the entirety of the workflow (processing + declaring and storing of output files) has been successful or not. If the global monitoring has not been completely successful, the next step is to more deeply monitor each of the two following parts of the workflow:
  1. Job execution.
  2. Transfer to permanent storage and declaration in the file catalog (SAM).

Submission Monitoring

To monitor the keepup submission, the shifter would need to check the keepup submission logs available at the slack channel #keepup. Please, contact the nova production mailing list if you need help accessing this. The logs can also be checked via messages from the OPOS mailing list, so long as you are subscribed.

How to read the submission logs:

There will be one submission log produced for each active combination of: production stage, detector, stream and software release. An example of a successful submission log for the combination: Raw2Root, fardet, numi with release S15-03-11, follows:


== Submission summary:
- Detector: fardet
- Stream: numi
- Raw2Root Release: S15-03-11
================================
Exit code list: 0 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28
N files list: 538 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
N jobs list: 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
------------------------------------------

ShifterFile selected /nova/app/home/novapro/OPOS/keepup/shifter.txt
The Shifter is bzamoran
/nova/app/home/novapro/OPOS/keepup/ProductionKeepUp_NOvA.sh --configfile ConfigFile/Config_Raw2Root_FD_numi.cfg

ConfigFile selected /nova/app/home/novapro/OPOS/keepup/ConfigFile/Config_Raw2Root_FD_numi.cfg
ConfigFile content:
---------8<------------------------
--raw2root
--r2r-release S15-03-11
--keepup
--det fardet
--stream numi
--enablePOMS
--enableOPOSDB
#--OPOSuser pbuitrag
#--OPOSuser ahandres
#--proxy /var/tmp/novapro.Paola.temporal050416.Production.proxy
#start at 2 days ago, to wait to clear the backlog
#--extradimension "Online.SubRunEndTime > 1456812000" 
---------8<------------------------

---------8<------------------------
options:
--raw2root
--r2r-release S15-03-11
--keepup
--det fardet
--stream numi
--enablePOMS
--enableOPOSDB  
---------8<------------------------

***
This is a crontab job of user bzamoran 
***

Production stage: Raw2Root

== Submission summary:
- Detector: fardet
- Stream: numi
- Raw2Root Release: S15-03-11
================================

This is a *KeepUp* session
Setup environment for Production release S15-03-11
PWD: /nashome/b/bzamoran

Setup POMS for Raw2Root campaign
POMS_CAMPAIGN_ID = 15
POMS_TASK_DEFINITION_ID = 25

------------------------------------------

### Days ago: 1
Creating dataset bzamoran_opos_prod_Raw2Root_raw_S15-03-11_fd_numi_keepup_2016-07-05 for fardet, with dimensions:

file_type       = 'importedDetector' and
data_tier       = 'raw'  and
Online.Detector = 'fardet' and
Online.SubRunEndTime >= '1467694800'  and
Online.SubRunEndTime <= '1467781200'  and
( Online.Stream = 0 ) and
Online.TotalEvents > '0' and
not isparentof: ( data_tier = 'artdaq' and
DAQ2RawDigit.base_release 'S15-03-11' ) and
file_size < '1288490189'  
minus NOVA.ProductionSkip true

Dataset definition 'bzamoran_opos_prod_Raw2Root_raw_S15-03-11_fd_numi_keepup_2016-07-05' has been created with id 536211
Dataset bzamoran_opos_prod_Raw2Root_raw_S15-03-11_fd_numi_keepup_2016-07-05 contain 538 files

/grid/fermiapp/products/nova/externals/NovaGridUtils/v01.93/NULL/bin/submit_nova_art.py
--memory 1900
--jobname bzamoran-raw2root-keepup-Fermigrid-S15-03-11-fardet-numi-1_days_ago
--defname bzamoran_opos_prod_Raw2Root_raw_S15-03-11_fd_numi_keepup_2016-07-05
--njobs 29
--files_per_job 20
--print_jobsub
--config daq2rawdigitjob.fcl
--tag S15-03-11
--maxopt
--dest /pnfs/nova/scratch/fts/dropbox
--production
--copyOut
--outTier out1:artdaq

SAM_STATION DEFINED AS nova
http://samwebgpvm03.fnal.gov:8480/sam/nova/stations/nova/projects/name/bzamoran-bzamoran-raw2root-keepup-Fermigrid-S15-03-11-fardet-numi-1_days_ago-20160706_0130
Definition name: bzamoran_opos_prod_Raw2Root_raw_S15-03-11_fd_numi_keepup_2016-07-05
start proj returned 0
Station monitor: http://samweb.fnal.gov:8480/station_monitor/nova/stations/nova/projects/bzamoran-bzamoran-raw2root-keepup-Fermigrid-S15-03-11-fardet-numi-1_days_ago-20160706_0130
jobsub_submit \
    -N 29 \
    --resource-provides=usage_model=DEDICATED \
    --disk=10000MB \
    --memory=1900MB \
    --expected-lifetime=10800s \
    -G nova \
    -e SAM_PROJECT_NAME    -e SAM_STATION    -e IFDH_BASE_URI    -e IFDH_DEBUG    -e EXPERIMENT    -e GRID_USER \
    --role=Production \
     file:///nova/app/condor-exec/bzamoran/bzamoran-bzamoran-raw2root-keepup-Fermigrid-S15-03-11-fardet-numi-1_days_ago-20160706_0130.sh \
       --limit 20 \
     --multifile \
      --export DEST=/pnfs/nova/scratch/fts/dropbox \
      --config daq2rawdigitjob.fcl \
      --source /grid/fermiapp/nova/novaart/novasvn/setup/setup_nova.sh:-r:S15-03-11:-b:maxopt \
      -X runNovaSAM.py \
        --hashDirs \
        --jsonMetadata \
        --copyOut \
        --logs \
        --zipLogs \
        --outTier out1:artdaq 
no valid Krb5 cache found
/fife/local/scratch/uploads/nova/novapro/2016-07-06_013022.153793_6390

/fife/local/scratch/uploads/nova/novapro/2016-07-06_013022.153793_6390/bzamoran-bzamoran-raw2root-keepup-Fermigrid-S15-03-11-fardet-numi-1_days_ago-20160706_0130.sh_20160706_013022_938466_0_1_.cmd

submitting....

Submitting job(s).............................

29 job(s) submitted to cluster 8734520.

JobsubJobId of first job: 8734520.0@fifebatch1.fnal.gov

Use job id 8734520.0@fifebatch1.fnal.gov to retrieve output

Information to store in the DB
nova Raw2Root fardet numi S15-03-11 null 2016-07-06_01:30:24_CDT bzamoran-bzamoran-raw2root-keepup-Fermigrid-S15-03-11-fardet-numi-1_days_ago-20160706_0130 bzamoran_opos_prod_Raw2Root_raw_S15-03-11_fd_numi_keepup_2016-07-05 2016-07-05 538 8734520.0@fifebatch1.fnal.gov 29

3706

------------------------------------------

### Days ago: 3
Creating dataset bzamoran_opos_prod_Raw2Root_raw_S15-03-11_fd_numi_keepup_2016-07-03 for fardet, with dimensions:

file_type       = 'importedDetector' and
data_tier       = 'raw'  and
Online.Detector = 'fardet' and
Online.SubRunEndTime >= '1467522000'  and
Online.SubRunEndTime <= '1467608400'  and
( Online.Stream = 0 ) and
Online.TotalEvents > '0' and
not isparentof: ( data_tier = 'artdaq' and
DAQ2RawDigit.base_release 'S15-03-11' ) and
file_size < '1288490189'  
minus NOVA.ProductionSkip true

Dataset definition 'bzamoran_opos_prod_Raw2Root_raw_S15-03-11_fd_numi_keepup_2016-07-03' has been created with id 536231
WARNING: The dataset does not contain files.
No files to process. No jobs to submit.

------------------------------------------

### Days ago: 5
Creating dataset bzamoran_opos_prod_Raw2Root_raw_S15-03-11_fd_numi_keepup_2016-07-01 for fardet, with dimensions:

file_type       = 'importedDetector' and
data_tier       = 'raw'  and
Online.Detector = 'fardet' and
Online.SubRunEndTime >= '1467349200'  and
Online.SubRunEndTime <= '1467435600'  and
( Online.Stream = 0 ) and
Online.TotalEvents > '0' and
not isparentof: ( data_tier = 'artdaq' and
DAQ2RawDigit.base_release 'S15-03-11' ) and
file_size < '1288490189'  
minus NOVA.ProductionSkip true

Dataset definition 'bzamoran_opos_prod_Raw2Root_raw_S15-03-11_fd_numi_keepup_2016-07-01' has been created with id 536251
WARNING: The dataset does not contain files.
No files to process. No jobs to submit.

------------------------------------------

...

### Days ago: 31
Creating dataset bzamoran_opos_prod_Raw2Root_raw_S15-03-11_fd_numi_keepup_2016-06-05 for fardet, with dimensions:

file_type       = 'importedDetector' and
data_tier       = 'raw'  and
Online.Detector = 'fardet' and
Online.SubRunEndTime >= '1465102800'  and
Online.SubRunEndTime <= '1465189200'  and
( Online.Stream = 0 ) and
Online.TotalEvents > '0' and
not isparentof: ( data_tier = 'artdaq' and
DAQ2RawDigit.base_release 'S15-03-11' ) and
file_size < '1288490189'  
minus NOVA.ProductionSkip true

Dataset definition 'bzamoran_opos_prod_Raw2Root_raw_S15-03-11_fd_numi_keepup_2016-06-05' has been created with id 536911
WARNING: The dataset does not contain files.
No files to process. No jobs to submit.

N files list: 538 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

N jobs list: 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Exit code list: 0 28 28 28 28 28 28 28 28 28 28 28 28 28 28 28

Raw2Root

If the submission was done properly, the corresponding SAMweb projects should have been created and the jobs should have been submitted. If the submission has not yet completed, the jobs should still be available in the job queue in a "valid" state (queued/running).

Global Monitoring

Global monitoring aims to check if all the input files taken by the keepup processing have made it through the workflow. The criteria for completion is to have the expected output files declared in the file catalog (SAM), as well as the files being defined as children to the input files. This is done on a daily basis and varies in syntax depending on the production stage and processing combination (The software release is the most-likely candidate for this change, but anything is possible) . To perform global monitoring, one can either check POMS or utilize SAMweb projects coupled with specific sam commands. For the latter, one needs to identify which are the daily input files and be familiar with the associated characteristics of the expected output files.

Global monitoring with POMS

For the former option, navigate to the 'active campaigns' tab in POMS (https://pomsgpvm01.fnal.gov/poms/show_campaigns), click on the name of the keepup campaigns and then select the "Day by Day Spreadsheet" link from the "Actions" menu. Check that the "pending" column has a value of "zero" for all recent days. Likewise, If the number of pending files is equal to zero, the processing has been completely successful for that day. In the case that these values are not zero, it is necessary to discover where processing failed and why,. If it can be fixed, one needs to take necessary measures to produce a solution. For even deeper monitoring, proceed to the "Job Execution Monitoring" section below.

The current keepup campaigns registered in POMS are:

  • Nova raw2root keepup ND
  • NOvA Reco Keepup FD
  • Nova raw2root keepup FD
  • NOvA Reco Keepup ND

Global monitoring with Samweb projects

  1. Identifying the input files: Find out the snapshot IDs of all the keepup samweb projects of the day.
  2. Find pending files: Figure out how many of the files in the snapshots are still pending to get the corresponding output files declared in SAM. You should query how many of these files do not have children with the expected dimensions which are defined by the processing stage and the software release.
Production Stage Children dimensions Checking pending files example
Raw2Root data_tier = artdaq and (daq2rawdigit.base_release ${nova_release} snapshot_id 111111 minus isparentof:( data_tier = artdaq and (daq2rawdigit.base_release S15-03-11 ))
Reco data_tier = reco and Reconstructed.base_release ${nova_release} snapshot_id 222222 minus isparentof:( data_tier = reco and Reconstructed.base_release S15-12-07)

Where ${nova_release} is the nova software release used when processing that snapshot. Usually, it depends on the processing stage and the detector. As of July 2016, the software releases used are:

Raw2Root

FD Numi S15-03-11
FD Cosmics S15-03-11
FD Others S15-03-11

ND Numi S16-02-02
ND Cosmics S16-02-02
ND Others S16-02-02

Reco

ND BNB S16-03-04
ND Numi S16-03-04

FD Numi S15-12-07

Job Execution Monitoring

If there is a non-zero value for pending files, it could be due to the job execution being incomplete or due to an error in the workflow.

  • To check if the job execution hasn't completed:
  1. Using samweb projects: Open the corresponding keep up samweb project (s) and check if there are processes listed as active. This is an indicator that some jobs might still be running. This indicator is not sufficient to conclude.
  2. Using the job queue: Get the cluster id of the submissions and check the job queue to find jobs in running status from that cluster.
  • To check if something went wrong with the job execution:
  1. Using the condor logs: Get the cluster id of the submissions and check the condor logs and look for jobs finishing with return values different than zero.

Declaration and Transfer of Files Monitoring

Modifying the keep-up configuration (expert)

If you are the shifter you generally don't need to read this documentation: Modifying keep-up configuration