Project

General

Profile

What to do on production shift

Your main job as the production shifter is to keep the grid busy! At shift changeover (production meeting), you should have heard about:
  • What samples the last shifter completed
  • What samples the last shifter was still working on or didn't get to
  • The production convener's plan for your shift.

Keeping the queue tidy

When there are lots of production jobs running, it's inevitable that some of them wind up in the "held" state (whether because a job runs into a particularly messy event that uses lots more memory than was budgeted for simulation and/or reconstruction, or because it overran the time that it was guessed the job should last, etc.). Eventually a cluster with held jobs in it will run down such that it has nothing but held jobs. It will then remain in the queue that way for a week, cluttering it up.

As the shifter, it's your job to keep the queue tidy. When a cluster has nothing remaining but held jobs:
  • Ensure the reason they are held doesn't point to some larger issue. Check the "Why are my jobs held?" link on the novapro FIFEmon page. If you notice any patterns like the following, discuss them with the production convener:
    • Are a large fraction (more than a couple of percent) of the jobs in a cluster being held for running out of memory or running too long?
    • Are a large fraction of jobs being sent to a particular OSG site being held for the same reason?
    • etc.
  • Once you're confident there are no systemic reasons for jobs being held, remove clusters that consist only of held jobs. prodjob-summary -H will list such clusters; they can be removed with jobsub_rm --role=Production --jobid <cluster ID>.

Fulfilling production requests

Production requests are tracked on the Trello board. (You should have access to Trello; if you don't, ask the production convener to invite you). Any samples that are in status "Ready to Submit," "Running," or "Running and Defined" are your responsibility. Act on them according to the priorities communicated by the convener.

"Ready to Submit"

These jobs are ready to start. Use the info in the Trello card and the information in the Running Jobs links to submit them.

Once they're running, move the Trello card to "Running."

"Running"

The hardest part is done! You should keep an eye on the jobs using the links under Monitoring Jobs to make sure they run successfully.

While they're running, you should make any dataset definitions appropriate for the sample; see How to make definitions. Then move the Trello card to "Running and Defined."

"Running and Defined"

Now you get to sit back and wait, and monitor them. Or, more likely, get started on the next set of jobs. Or watch the Keepup. :)

If it looks like jobs aren't running successfully, read the information in When things go wrong.

Once the files finish, move the card to "Complete".

If jobs didn't finish successfully, you may need to re-submit some of them using a draining dataset.

"Complete"

This sample's done! Announce it to #production on Slack and celebrate.

Keepup

Monitoring

You should also check the status of the keepup jobs every day. Look at the emails sent to the #keepup channel on Slack. The most important part of each one looks like:

================================
Exit code list: 0 0 28 28 28 28 28 28 28 28 28 28 28 28 28 28
N files list: 446 105 0 0 0 0 0 0 0 0 0 0 0 0 0 0
N jobs list: 25 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0
------------------------------------------

The first column of numbers tells you about the jobs submitted with inputs from yesterday or the day before. The second column concerns files from 3 or 4 days ago; the next one, 5-6 days ago, etc. If everything is working correctly, there will be nonzero "N files" and "N jobs" numbers only in the first few columns, with "0" exit code (success). The other columns will have zero (or very few) jobs, and exit code 28 (no jobs to process). (The full list of exit codes is here.)

When things are not working correctly, you may see something like

================================
Exit code list: 0 0 0 28 28 28 28 28 28 28 28 28 28 28 28 28
N files list: 5822 4941 1685 0 0 0 0 0 0 0 0 0 0 0 0 0
N jobs list: 294 250 87 0 0 0 0 0 0 0 0 0 0 0 0 0
------------------------------------------

where there are many jobs from several days ago left (indicating that they are not running correctly), or perhaps

================================
Exit code list: 28 28 28 28 28 28 0 28 28 28 28 28 28 28 28
N files list: 0 0 0 0 0 0 292 0 0 0 0 0 0 0 0
N jobs list: 0 0 0 0 0 0 61 0 0 0 0 0 0 0 0
------------------------------------------

which indicates that there was a problematic batch of jobs which should be investigated. Further monitoring notes can be found in Keepup_Jobs.

Troubleshooting

The first thing to do is to check the log files. Look for the line that looks like

Use job id 14892093.0@fifebatch1.fnal.gov to retrieve output

in the email on the #keepup channel you were examining above. You can put the jobsub cluster ID into the boxes on the upper left of the FIFEmon page: the cluster is the number preceding the .0, and the schedd is either fifebatch1 or fifebatch2 depending on the job id you saw in the email. Alternately, you can execute the following on a gpvm:

jobsub_fetchlog --jobid <job id> --dest-dir <some dir>

and read the logs in <some dir>.

Look for errors in the log files.

Report that keepup is having problems, along with the symptoms you observed and any errors you've found, on #production.

Credentials

Once upon a time, keepup jobs were run using the current production shifter's credentials (grid proxy etc.); changing this from one shifter to another wound up causing far more headache than it was worth, so currently the jobs are always run using the convener's credentials.