Project

General

Profile

Production Troubleshooting

This page is an attempt to document problems commonly encountered in running production jobs, and their solutions. It's not intended as a cookbook, but rather as a list of suggestions for how to investigate problems and understand what went wrong. It presumes you are already familiar with our production procedures. Note that here we document only production specific issues. You may run into problems that more general, in which case you should consult the general Troubleshooting and Gotchas page.

Step 0: Check the Logs

There are several ways to check the log files for you jobs.

Common Crashes

Function checkReservedForCafe is Not allowed for version 3.

This error (which forces a crash) is normally encountered during raw2root keepup. It is an indication that the input file is corrupt. Notify the data quality and DAQ groups, as well as the production coordinator. The production coordinator will mark the file for skipping by production, and update the list of corrupt files

/usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.14' not found

This occurs at submission time. The full error is

ifdh: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.14' not found (required by ifdh)
ifdh: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ifdh)
ifdh: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.15' not found (required by ifdh)

This can happen when you first setup nova and then ksu to novapro an then issue some nova related command (in this example it is ifdh) . The correct thing to do is start from a clean shell, ksu to novapro, and only then setup nova. You should also check whether you actually need to be novapro for the task at hand. Running make_sim_fcl generally requires this, for example, but not submit_mc_gen.

My jobs appear to be finish successfully, but the dataset looks incomplete

This same symptom can be caused by multiple underlying problems:
  • The file is declared to SAM, but the file never made it to the dropbox.
  • The file is declared to SAM, but has the wrong metadata.
  • The file got stuck in the dropbox, and may or may not have been declared to SAM.
  • The file is simply lost (not on disk or in SAM)

Check your draining dataset

Don't have a draining dataset? Construct a draining dataset).

If there are files in your draining dataset, then some files were not processed for any of a variety of reasons. Re-submit your initial submission, but use the draining dataset instead of your original dataset. If not, continue.

Check to see if there are "virtual" files

Virtual files are files that SAM knows about but which do not actually exist on tape yet. See if you have virtual files in your definition:

samweb list-files "defname: <your definition> with availability virtual"

If you have virtual files:

Then the files have the correct metadata, but did not make it to their final location. Check to see if they were successfully copied to the dropbox. First, check to see if the file is sitting in the dropbox:

ls `dropbox_path <filename>`

If the file is there, then it is stuck in FTS for some reason. See, Debugging FTS.

If there is no file there, then the job declared the file to SAM but failed to copy it back to the dropbox. The virtual files will need to be retired, and then draining jobs will need to be submitted. Prepare a list of files which need to be retired and send them to the production convener to be retired.

You do not have virtual files

First, you need to determine the name of a missing file. You can do that by checking the logs of your jobs or by asking for files which are children of your input dataset but are not in your output dataset:

samweb list-files "data_tier <tier> and nova.release <release> and ischildof:(defname: <input dataset>) minus defname: <output dataset>

If the file was copied back successfully, then it may be in Sam, but not appearing in the expected output dataset. First confirm that the file is in SAM by using the command

samweb get-metadata <file_name>

If the file does exist in SAM, it probably means there is some mismatch between the file metadata and the constraints in the SAM definition (Remember that a definition is not a fixed list of files, but a set of metadata constraints. As file metadata changes, new files show up, or old files are deleted the list of files in a dataset can change). Compare the file metadata to the constraints in the definition

samweb describe-definition <defname>

You may learn that the file metadata is messed up in some way and it needs to be corrected or the job resubmitted. You may find that there is a mistake in the construction of the SAM definition. (As a side note, we could really use an automatic tool for doing this comparison, since it can be a little bit tedious.)

Debugging FTS

There are multiple ways to see what the FTS is doing with the file. This script is slow, but will provide the most complete information:

findFileInProdFTS.py <filename>

to see if you can figure out what went wrong in FTS.

You can also do:

which_fts <filename>

to see which FTS instance the file is located in and check the website, novasamgpvmNN.fnal.gov:8888/fts/status, where NN goes from 01 to 08.

Sidebar: How Virtual Files are Handled

The treatment of virtual files can be confusing, so here's an example that will hopefully make it clearer.

Say we have three parent files:
  • Parent 1 produces Child 1 successfully
  • Parent 2 declares Child 2, but because of a copy failure it is only virtual in SAM
  • Parent 3 fails to produce a file
Now let's look at how these show up in definitions
  • Counting the parent definition will give you 3 files.
  • Counting the child definition will give you only 1 file, since by default only files with locations are counted.
  • Counting a draining datasets will also only give you 1, since the virtual files are counted for deciding if a file has a child or not.
Notice, that the virtual file will seem to get "lost," since you can't remake it with the draining dataset, but it also doesn't show up in the child dataset. You can get around that with extra sam query parameters about "availability"
  • Counting the child definition, but adding with availability virtual will give you 1 since now you are seeing the virtual files.
  • Counting the child definition, but adding with availability anyLocation will give you 2 since now the virtual file is included.