This page is an attempt to document problems commonly encountered in running production jobs, and their solutions. It's not intended as a cookbook, but rather as a list of suggestions for how to investigate problems and understand what went wrong. It presumes you are already familiar with our production procedures. Note that here we document only production specific issues. You may run into problems that more general, in which case you should consult the general Troubleshooting and Gotchas page.
Step 0: Check the Logs¶
There are several ways to check the log files for you jobs.
- If you submitted with
--poms, the easiest way is to drill down to the job in POMS and click the downward arrow to see job details. Links to the logs will be available.
- Alternatively, if you are on the Fermilab network (or connected via vpn) you can get the logs online at: https://fifebatch1OR2.fnal.gov:8443/jobsub/acctgroups/nova/sandboxes/novapro/CLUSTER.0@fifebatch1OR2.fnal.gov/
- Finally, you can fetch the logs from the command line with
jobsub_fetchlog --role=Production --jobid=<cluster>.0@fifebatch<1 or 2> .fnal.gov
Function checkReservedForCafe is Not allowed for version 3.¶
This error (which forces a crash) is normally encountered during raw2root keepup. It is an indication that the input file is corrupt. Notify the data quality and DAQ groups, as well as the production coordinator. The production coordinator will mark the file for skipping by production, and update the list of corrupt files
/usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.14' not found¶
This occurs at submission time. The full error is
ifdh: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.14' not found (required by ifdh) ifdh: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ifdh) ifdh: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.15' not found (required by ifdh)
This can happen when you first setup nova and then ksu to novapro an then issue some nova related command (in this example it is ifdh) . The correct thing to do is start from a clean shell, ksu to novapro, and only then setup nova. You should also check whether you actually need to be novapro for the task at hand. Running
make_sim_fcl generally requires this, for example, but not
My jobs appear to be finish successfully, but the dataset looks incomplete¶This same symptom can be caused by multiple underlying problems:
- The file is declared to SAM, but the file never made it to the dropbox.
- The file is declared to SAM, but has the wrong metadata.
- The file got stuck in the dropbox, and may or may not have been declared to SAM.
- The file is simply lost (not on disk or in SAM)
Check your draining dataset¶
Don't have a draining dataset? Construct a draining dataset).
If there are files in your draining dataset, then some files were not processed for any of a variety of reasons. Re-submit your initial submission, but use the draining dataset instead of your original dataset. If not, continue.
Check to see if there are "virtual" files¶
Virtual files are files that SAM knows about but which do not actually exist on tape yet. See if you have virtual files in your definition:
samweb list-files "defname: <your definition> with availability virtual"
If you have virtual files:¶
Then the files have the correct metadata, but did not make it to their final location. Check to see if they were successfully copied to the dropbox. First, check to see if the file is sitting in the dropbox:
ls `dropbox_path <filename>`
If the file is there, then it is stuck in FTS for some reason. See, Debugging FTS.
If there is no file there, then the job declared the file to SAM but failed to copy it back to the dropbox. The virtual files will need to be retired, and then draining jobs will need to be submitted. Prepare a list of files which need to be retired and send them to the production convener to be retired.
You do not have virtual files¶
First, you need to determine the name of a missing file. You can do that by checking the logs of your jobs or by asking for files which are children of your input dataset but are not in your output dataset:
samweb list-files "data_tier <tier> and nova.release <release> and ischildof:(defname: <input dataset>) minus defname: <output dataset>
If the file was copied back successfully, then it may be in Sam, but not appearing in the expected output dataset. First confirm that the file is in SAM by using the command
samweb get-metadata <file_name>
If the file does exist in SAM, it probably means there is some mismatch between the file metadata and the constraints in the SAM definition (Remember that a definition is not a fixed list of files, but a set of metadata constraints. As file metadata changes, new files show up, or old files are deleted the list of files in a dataset can change). Compare the file metadata to the constraints in the definition
samweb describe-definition <defname>
You may learn that the file metadata is messed up in some way and it needs to be corrected or the job resubmitted. You may find that there is a mistake in the construction of the SAM definition. (As a side note, we could really use an automatic tool for doing this comparison, since it can be a little bit tedious.)
There are multiple ways to see what the FTS is doing with the file. This script is slow, but will provide the most complete information:
to see if you can figure out what went wrong in FTS.
You can also do:
to see which FTS instance the file is located in and check the website,
novasamgpvmNN.fnal.gov:8888/fts/status, where NN goes from 01 to 08.
Sidebar: How Virtual Files are Handled¶
The treatment of virtual files can be confusing, so here's an example that will hopefully make it clearer.Say we have three parent files:
- Parent 1 produces Child 1 successfully
- Parent 2 declares Child 2, but because of a copy failure it is only virtual in SAM
- Parent 3 fails to produce a file
- Counting the parent definition will give you 3 files.
- Counting the child definition will give you only 1 file, since by default only files with locations are counted.
- Counting a draining datasets will also only give you 1, since the virtual files are counted for deciding if a file has a child or not.
- Counting the child definition, but adding
with availability virtualwill give you 1 since now you are seeing the virtual files.
- Counting the child definition, but adding
with availability anyLocationwill give you 2 since now the virtual file is included.