Project

General

Profile

Monitoring Minos production

Monitoring Reco KeepUp

The reconstruction processing works in Keep Up mode. Daily, a cronjob executes the KEYGEAR Minos+ script and submit all files available. The cronjob is at minospro@minos51 crontab. The OPOS team monitors this submission and reports back to the experiment if any job or file failed.

Check the following sections to perform the daily monitoring and reporting (if necessary).

Monitoring

These instructions assume you are going to monitor the Keep Up submission from one day.

First, you will use sam projects to check the submission:

  • Open the Minos+ Sam monitoring station (http://samweb.fnal.gov:8480/station_monitor/minos/stations/minos/projects) and look for the project you're willing to monitor.
    • Get the name of the project from the output email sent to opos mailing list and Adam Schrekenberger. The mail is sent daily around 11:30 pm.
    • The name of the projects associated to reco keep up submission follows the following standard:

[software version].keepup.[timestamp]
Example:
elm4.keepup.20150510233001

Second, you will use the scripts to monitor the submission.

Log into the minos51 machine.

ssh minospro@minos51

Then you need to setup the tools:

PRO > setup_jobsub
PRO > setup_samweb

Then you need to use the report_minos.py script located at: /minos/app/home/minospro/OPOS/report_generation/py_classes/

This script will report about the submission of one day, you need to provide one numeric argument to the script which is going to determine the monitoring day.

For example:

Let's say today is July 20, 2016.
*python report.py 1 | it will go one day ago and check the submission for that day July 19th, 2016.
*python report.py 2 | it will go one day ago and check the submission for that day July 18th, 2016.
*python report.py 3 | it will go one day ago and check the submission for that day July 17th, 2016.

The script will give you the error code frequency (from condor logs) and other useful information like bad process, number of files, etc (from the samweb project)

For example, from the information below you can see that 84 jobs ended with error code 0 and 3 jobs ended with error code 4.

####################
Overall metrics
Error code frequency {'0': 84, '4': 3}
Total files: 87
Total processes: 90
Completed processes: 84
Bad processes: 3
Error processes: 3
Other processes: 0

Then, you should run python List_report.py 1 As the previous script, the numeric argument describes the monitoring day. A recovery will be generated at : /minos/data/minfarm/lists/OPOSlists/recovery_file_[submission-date] The output file created by the List_report.py script will contain that were in a process marked as bad or error.

Third, you need to keep the tracking spreedsheet and report problems:

  • Check that all the jobs have completed successfully.
  • Update the Minos+ monitoring spreadsheet: https://goo.gl/fttKAS
  • If you find at least one job that failed, report back to the experiment (check instructions in following section).
    • If there are some jobs still running, hold until they complete before reporting back to the experiment.

Repeat each of these steps if you're monitoring the Keep up submission of more than one day. For instance, if you are monitoring the keep up submissions of the whole weekend.

Reporting

You should perform this task only when you have found that there is at least one job that failed. The way to report back to the experiment is by filling the reporting form and sending it as an email message to .

Date Submitted Jobs Failed Jobs Error Code Error Description List of Failed Files
5/10/2015 119 95 136 FPE (Floating Point Exception) ~minospro/OPOS/Reports/Reco-KeepUp/05-10-15/Recovery_List_Error_136.txt

Each row in this table would have the information pertaining to one keep up submission. Let's go through each of the columns in this table explaining it's meaning and how to get the information to fill it.

  • Date: Date of the submission. You can get this information from the sam web project monitoring page.
  • Submitted jobs: Number of jobs submitted. Allows the receiver of the report to estimate a ratio of failing jobs. This number can be obtained from the sam web project monitoring page or the output submission mail (must be available in opos mailbox).
  • Failed jobs: Number of jobs that failed in the submission. This number can be obtained from the sam web project monitoring page or the output submission mail (must be available in opos mailbox).
  • Error code: Error code of the failing jobs. The way to get this information is by checking the condor log files. In particular, by grepping "return value" from the condor log file with extension *.log (that was automatize with the report.py script)
  • Error description: Brief description of the error code. The error codes are defined by the KEYGEAR and AMBROSIA scripts. To check a complete list of error codes with their description, go to https://goo.gl/2b7kvL.
  • List of failed files: For each submission and each error code, you must create a text file that contains the names of the files that failed with that particular error code in that particular submission. That file should be under a folder named after the date of the submission (e.g. 05-10-15) which should be under the folder ~minospro/OPOS/Reports/Reco-KeepUp. (that was atomatize with the List_report.py)

Reco KeepUp Overview

The following pie chart describes the overview status of the Reco Keep up considering the period starting on April 21st of 2015 (date in which OPOS team started monitoring this process with samweb projects monitoring aid) until June 18th of 2015.

Return value equal to zero represents the jobs that completed successfully. Return value 101 represents the jobs that failed due to "Gaps in beam spill database -- usually rerun after db update" which is associated with no beam scenario. Return value 136 represents jobs that failed due to floating point exception. It's usually associated with the calibration database having weird entries that lead to this. Return value/ error code 4 is also interesting for us and represents the jobs that failed due to a problem with sam web projects.