Project

General

Profile

How to get help

  • Step 1, look at your log files.
    • Is the an obvious error message? Is there a seg fault?
    • If you don't have a log file because you are running interactively, you can easily capture the output of your job while still seeing it on the screen using the "tee" program:
    • nova -c myjob.fcl | tee myjob.log
  • Step 2, if the solution is not obvious check this page to see if the solution here.
    • For example, if you have a seg fault, you should look at the "Debugging" guides.
  • Step 3, if you are still having trouble, email .
    • This will quickly get your question to all the NOvA experts.

Table of Contents

novasoft Troubleshooting

Investigating a job's run conditions

You can run in a debug mode which will simply print out the configuration of the job by doing:
% ART_DEBUG_CONFIG=1 nova -c myjob.fcl

Need to know what values were used from fhicl files to fill a fhicl::ParameterSet passed to a job module after the fact?
% config_dumper art_framework_event_file.root

Checking memory usage with SimpleMemoryCheck service

The SimpleMemoryCheck service is an ART service that keeps track of when memory and swap space usage changes during a job. Instructions for using it are here.

Debugging with Allinea DDT

Just had a job quit with seg fault? Here is how you can debug it. First, you want to setup your test release using the debug build of NOvASoft

srt_setup -a SRT_QUAL=debug

Now you need to rebuild your test release to have the debugging symbols in the libraries

novasoft_build -test

This information is based on Dune's writeup

Allinea Forge is a commercially-available suite including a debugger (ddt) and a profiling tool (map). The ddt debugger is a GUI frontend to the gdb debugging tool.

Fermilab has a few licenses available: see details below

Debugging videos demos and tutorials are available at: https://www.allinea.com/debugger-videos.

On the interactive GPVM machines at Fermilab, it is available as a pre-installed UPS product.
There is no default version, so you have to specify the version on the setup line. It is generally deployed in /grid/fermiapp/products/local, which you may have to add to the PRODUCTS path if your experiment setup doesn't already. As an alternative, you can add the argument -z /grid/fermiapp/products/local to your UPS commands. These two commands will first have you discover the available versions, and then set one up:

ups list -aK+ allinea
setup allinea <version>

If you need to install Allinea forge on a computer that doesn't yet have it using upd, follow the instructions here: https://cdcvs.fnal.gov/redmine/projects/art/wiki/Getting_started_with_Allinea_MAP_and_DDT

The debugger is called ddt, and it has built-in help for starting it

ddt --help

in addition to the web manual (both ddt and map) at http://www.allinea.com/user-guide/forge/userguide.html

Tips

Multi-threading and MPI

DUNE found that starting it with --nompi works to get started if you get this error message,

Allinea DDT detected you are using MPI, but could not detect which implementation.

There are a few things in art that are multi-threaded, such as the messaging system and interaction with the SQLite database when running time or memory profiling services, though your program should be able to run single-threaded as well. Feel free to experiment with it.

Command line

You will have to specify the full path name to the executable, and arguments can be specified on the command line. The quickest way is to use which command directly within backticks:

ddt --nompi `which lar` -c prodsingle.fcl &

Source code

It can be tricky to get ddt to find source code. Right-clicking in the Project Files window lets you add source directories. Subdirectories are searched for appropriate source, but you need to specify a full path including unique versions to get the source you want. The path to the source that was used to compile the program is stored along with the debug symbols, and Allinea checks to see if the source version matches that in the executable and shared libraries. Due to our installation procedure, this check will fail, and it is up to you to make sure that your source and executable match.

When you step into a file that Allinea does not find, it will show a button to allow you to search the source code. You can point it to the sources, that are (almost) always distributed with the UPS product. If, for example, you are trying to dig into ROOT, you want to check where ROOT source is by asking UPS:

echo "$ROOT_DIR"

on the terminal you started ddt from, and then browse with Allinea to find the the source subdirectory of that directory.
Fortunately, once Allinea is told for one file, it can figure out by itself for others in the same directory.

Searching extra directories

If the executable or source files have been moved since compilation, the source files cannot be found automatically.

Extra directories to search for source files can be added by right-clicking whilst in the Project Files tab, and selecting Add/view Source Directory(s). You can also specify extra source directories on the command line using the --source-dirs command line argument (separate each directory with a colon).

Can add an individual file, by right-clicking in the Project Files tab and selecting the Add File option.

Any directories or files added are saved and restored when you use the Save Session and Load Session commands inside the File menu.

There doesn't seem to be a way to do a recursive check (e.g. by setting /products as the source dir), but ddt does remember the settings for a given executable from session to session. You may need to clear out the .allinea/autosaves directory if you change the version of a package you use.

Experiments can add to their environment setup the setting of an environment variable which turns all the XXX_DIR environment variables into a path-like variable with $XXXX_DIR/source/XXX, where that directory exists.

Console output of the job

The output of the job goes to a window at the bottom selectable with the tab Input/Output.
The messages in red are from standard error stream, and depending on your message facility configuration they might duplicate the ones in standard output (in black).

Licenses

Fermilab has purchased a limited number of licenses for use of Allinea Forge. If you are declined access, it means that the maximum number of allowed users are already running it.

Please make sure you don't leave an Allinea instance open, since that could prevent other people from using it.

If you are declined access, please provide feedback to your computing division liaison. (Need your services account to access the link.) It may be a sign that more licenses should be purchased.

Debugging from Command-line with gdb

If GUI is not for you, you can run your job with gdb debugger in command-line. Here is a link to a great tutorial on getting started with gdb on YouTube: https://www.youtube.com/watch?v=xQ0ONbt-qPs&t=1210

To get started running gdb on an ART module, do as follows:

% gdb --args nova -c somejob.fcl -s somefile.root

gdb then displays its start-up messages. At the prompt type "run" or just "r":
(gdb) run

When the process reaches the point of error, gdb displays the line in the code that causes the error. In case you want a more detailed back trace, try:
(gdb) backtrace

or "bt". To quit gdb,
(gdb) quit

or "q". The above procedure would work well in case of seg-fault. In case you need to track down where your program is throwing an exception, before 'run', do:
(gdb) catch throw

and then run. When the exception occurs, gdb catches it. You can see where the exception occurred with backtrace.

You can also break when control reaches a specific function, or line number. Just provide the full function name, or filename:linenumber to the "break" command.

$ gdb --args nova -c somejob.fcl -s somefile.root
(gdb) br foo()
Function "foo()" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (except()) pending.
(gdb) run

gdb can't be surewhat you're talking about until the libraries built from that file are loaded. So if you want to be sure that your breakpoint will get set right you need to break into the main function and set it from there.

$ gdb --args nova -c somejob.fcl -s somefile.root
(gdb) br main
(gdb) run
...
(gdb) br MyFile.cxx:123
(gdb) cont

If you are using gdb to track down a seg-fault, it may get a bit tricky sometimes. Gdb may change the way the memory is allocated so that the seg-fault goes away when you run your job under gdb. In that case, you can start the job normally (NOT under gdb), log into the same VM and obtain the process-id of the job that you started running in the other terminal. Then start gdb and attach it to the job using the process id like this:

> gdb
(gdb) attach psid 

For more information on gdb, you can refer to http://www.cs.cmu.edu/~gilpin/tutorial/

Debugging with Valgrind

Valgrind can be a useful debugging tool (http://valgrind.org/docs/manual/quick-start.html). The version in SL5 is laughably old, and chokes with the error DWARF2 CFI reader: unhandled CFI instruction.

Type

> setup valgrind

to get a slightly more modern one that works

> valgrind nova -c myjob.fcl myfile.root

Furthermore, ROOT provides a file that suppresses spurious valgrind output originating in ROOT code. To use it, add:

> --suppressions=$ROOTSYS/etc/valgrind-root.supp

to your valgrind command.

Recursive #include directive

This type of error message:

Failed to parse the configuration file 'job/t962g4ana.fcl' with exception
---- Recursive #include directive: BEGIN
job/simulationservices.fcl => /grid/fermiapp/lbne/lar/code/larsoft/releases/development/job/simulationservices.fcl
included from line 2 of file "./job/t962g4ana.fcl" 
---- Recursive #include directive: END

is telling you that you have included the indicated .fcl file twice somehow in your job's .fcl. Most likely one of the other included files also includes the offending file. Simply remove the offending file from your job .fcl file. The message also tells you which line of your job .fcl file to fix, in this case line 2.

Service unable to find requested service with compiler type ...

The following type of error message:

%MSG-e BeginJob:  MyMod:mymod@BeginJob 18-Feb-2011 13:36:26 CST  MF-online :0
A cet::exception is going through WorkerT<EDAnalyzer>:

%MSG
%MSG-s ArtException:  MyJob::mymod@BeginJob 18-Feb-2011 13:36:26 CST  MF-online
Module failed due to an exception
---- NotFound BEGIN
 Service  unable to find requested service with compiler type name 'myserv::MyService'.
---- NotFound END

%MSG

is telling you that the framework was unable to locate the service myserv::MyService and you should check your fcl file to be sure it is defined in the services block. One potential problem is that you declared the service in the service block, but not the user subblock. All NOvA services need to be in the user subblock.

Product Not Found

Errors of this type have the form of

---- ProductNotFound BEGIN
getByLabel: Found zero products matching all criteria
Looking for type: std::vector<rb::Track>
Looking for module label:
Looking for productInstanceName:

---- ProductNotFound END

This error is telling you exactly what the problem is - it can’t find any data product in the file that is of type std::vector<rb::Track> because no module label was provided to the getByLabel function. To resolve the issue, check that you are filling the module label value correctly and that the file contains data products produced under that module label using the event. You can check the contents of the file using the eventdump.fcl file, i.e.

$ nova -c eventdump.fcl file.root

Insert failure

This error happens when a collection of objects was attempted to put into the art::Event record, but it had not been registered in the vector of any module. An example is:

---- EventProcessorFailure BEGIN
  An exception occurred during current event processing
  ---- ScheduleExecutionFailure BEGIN
    ProcessingStopped.

    ---- InsertFailure BEGIN
      Illegal attempt to retrieve an unregistered product.
      No product is registered for
        process name:                'Skimming'
        module label:                'skimmer'
        product friendly class name: 'me::SlcMEs'
        product instance name:       'nue'
        branch type:                 'Event'
      Registered products currently:
The object doing the putting was the NueSkimmer. The fix was to add produces() calls in the DataProductSkimmer_module.cc for me::SlcME and me::TrkME along with associations between those objects and the rb::Cluster representing the slice.

Undefined symbol errors

These errors typically have messages of the form:

%MSG-s cet::xception: NOvARawInputSource:source{*ctor*} 13-Oct-2010 15:45:12 CDT pre-events
art::exception caught in mute
---- PluginLibraryLoadError BEGIN
unable to load libXXX_module.so because /nova/app/users/brebel/artsrt/test/lib/Linux2.6-GCC/libRawDataUtils.so: undefined symbol: _ZN3art14RawInputSource15getNextItemTypeEvError occurred while creating source XXX
---- PluginLibraryLoadError END

%MSG

First Possible Solution

Notice that an undefined symbol was named, _ZN3art14RawInputSource15getNextItemTypeEvError.

This problem is the result missing a library in the link list for building the libRawInputSource_source.so. To figure out what the problem is, you need to identify the library that the symbol comes from, ie what the symbol means. To figure out what the symbol is, do

% c++filt symbol

For this example

% c++filt _ZN3art14RawInputSource15getNextItemTypeEvError
art::RawInputSource::getNextItemType()

We see that the symbol is art::RawInputSource::getNextItemType(), a symbol defined in the art::RawInputSource. The solution is to put the correct library, in this case libart_Framework_IO_Sources.so into the link list in the GNUmakefile,

LIBLINK := $(LOADLIBES) -L$(FRAMEWORK_DIR)/sl5.x86_64/lib -lart_Framework_IO_Sources -L$(SRT_PRIVATE_CONTEXT)/lib/$(SRT_SUBDIR) -L$(SRT_PUBLIC_CONTEXT)/lib/$(SRT_SUBDIR) -l$(PACKAGE)

Second Possible Solution

Perhaps the undefined symbol is because the symbol is declared in the header file of the object in question but not implemented in the .cxx file. If that is the case, either implement the method in the .cxx file or remove it from the header file.

Third Possible Solution

Maybe the package library needs to have the library containing the unresolved symbol linked in when it is built. Do that by adding

override LIBLIBS += -L$(ENV_VARIABLE_POINTING_TO_LIBRARY) -lXXX

where XXX is the name of the missing library and ENV_VARIABLE_POINTING_TO_LIBRARY is the location of the missing library.

After doing any of the above you need to a clean build of your package.

Fourth Possible Solution

It may be that the package in question still inherits from TObject. See Converting Data Objects from FMWK for details on how to fix this.

No dictionary for class

Errors of this kind are produced at run time and typically look like

%MSG-s cet::exception:  PostModule 14-Oct-2010 16:54:39 CDT Run: 1 Event: 1
cet::exception caught in nova
---- EventProcessorFailure BEGIN
EventProcessingStopped
---- ScheduleExecutionFailure BEGIN
ProcessingStopped
---- DictionaryNotFound BEGIN
NoMatch TypeID::className: No dictionary for class St6vectorIPN4simb7MCTruthESaIS2_EE
cet::exception going through module NeutrinoAna/neutrinoana run: 1 event: 1
---- DictionaryNotFound END
Exception going through path doit
---- ScheduleExecutionFailure END
an exception occurred during current event processing
cet::exception caught in EventProcessor and rethrown
---- EventProcessorFailure END

%MSG

The error is the result of not linking the proper libraries for your _module.so or _service.so file. The solution is to figure out first what the symbol means

$c++filt St6vectorIPN4simb7MCTruthESaIS2_EE
std::vector<simb::MCTruth>

So the symbol is a vector of simb::MCTruth objects, thus you need the library corresponding to the simb namespace, ie libSimulationBase.so.

Now add the library to the LIBLIBS list in your GNUmakefile. The GNUmakefile should have a line like

LIBLIBS := $(LOADLIBES) -L$(SRT_PRIVATE_CONTEXT)/lib/$(SRT_SUBDIR) -L$(SRT_PUBLIC_CONTEXT)/lib/$(SRT_SUBDIR) -lXXX

in it already. Just add

-lSimulationBase

to the end of the line and do a clean build of your code, and you should be ready to go.

These errors also come about if the class is not properly defined in your package's LinkDef.h, classes.h, or classes_def.xml files.

Unable to load requested library

Errors of this kind are produced at run time and typically look like:

 %MSG-i MF_INIT_OK:  nova 11-Oct-2011 09:28:55 CDT JobSetup
 Messagelogger initialization complete.
 %MSG
 terminate called after throwing an instance of 'cet::coded_exception<art::errors::ErrorCodes, &(art::detail::translate(art::errors::ErrorCodes))>'
 what():  ---- Configuration BEGIN
 Unable to load requested library /nova/app/users/raddatz/arttest/lib/Linux2.6-GCC/libMCCheater_dict.so
 /nova/app/users/raddatz/arttest/lib/Linux2.6-GCC/libMCCheater_dict.so: undefined symbol: vtable for rb::Prong
---- Configuration END

The error is the result of not linking the proper libraries in the GNUmakefile. The solution is to figure out first what the undefined symbol is:

rb::Prong

So the symbol is a rb::Prong object, thus you need a link to the library corresponding to the rb namespace (also see next item).

Now add the library to the LIBLIBS list in your GNUmakefile. The GNUmakefile should have a line like

LIBLIBS := $(LOADLIBES) -L$(SRT_PRIVATE_CONTEXT)/lib/$(SRT_SUBDIR) -L$(SRT_PUBLIC_CONTEXT)/lib/$(SRT_SUBDIR) -lXXX

in it already. Just add

-lRecoBase

to the end of the line and do a clean build of your code, and you should be ready to go. If the problem persists, check that your local lib folder is up to date e.g. by doing

ls -lrt

In case that the culprit library is the oldest one in that folder, delete it and rebuild again.

Another example of this inability to load requested libraries is:

cet::exception caught in art
---- Configuration BEGIN
 Unable to load requested library /grid/fermiapp/nova/novaart/novasoft/releases/S12-10-04/lib/Linux2.6-GCC-debug/libMagneticField_service.so
 libG4analysis.so: cannot open shared object file: No such file or directory
---- Configuration END

This example is usually seen when executing a job on the GRID, and is not seen while running interactively. The reason for this error is caused by not sourcing the nova software correctly by using the older version of novasoft. Make sure you are now sourcing the newest novasoft version. For those running jobs on the GRID, to source the newest version, make sure the script contains:

echo "source /grid/fermiapp/nova/novaart/novasoft/srt/srt.sh" >> $jobcmd
echo "export EXTERNALS=/nusoft/app/externals" >> $jobcmd
echo "source /grid/fermiapp/nova/novaart/novasoft/setup/setup_novasoft.sh" >> $jobcmd

And remove any other sourcing of novasoft to prevent confusion.

Finding the source for an undefined symbol

Does you code not link, or a library not load because of an undefined symbol? Often it is enough to understand what the "demangled" name is. Does the name look something weird like:

_ZN3cet9exceptionC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_RKS0_

That's because it's "mangled" (a compiler-dependent encoding scheme for mapping C++ entities to an internal string). Try:
$ c++filt _ZN3cet9exceptionC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_RKS0_
cet::exception::exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cet::exception const&)

This makes it clear that's it's something related to cet::exception. If you know which library holds that code then you'll need to link it in.

But what if you don't know where the code lives? Then you might turn to find_global_symbols.sh to do the tedious work of tracking it down.

$ ~rhatcher/bin/find_global_symbol.sh _ZN3cet9exceptionC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_RKS0_
Searching for mangled symbol '_ZN3cet9exceptionC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_RKS0_'
Found in path /cvmfs/nova.opensciencegrid.org/externals/cetlib_except/v1_02_00/slf6.x86_64.e15.debug/lib/...
    Found in libcetlib_except.so
        Entry: 35:0000000000008588 T _ZN3cet9exceptionC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_RKS0_
        Translates to 0000000000008588 T cet::exception::exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cet::exception const&)

Another example based on the previous item:

$ ~rhatcher/bin/find_global_symbol.sh -f -d "vtable for rb::Prong" 
Searching for demangled symbol 'vtable for rb::Prong'
nm: libgcc_s.so: File format not recognized
nm: libgcc_s.so: File format not recognized
Skipping ./lib/Linux2.6-GCC-debug
Found in path /cvmfs/nova-development.opensciencegrid.org/novasoft//releases/development/lib/Linux2.6-GCC-debug/...
    Found in libRecoBase.so
        Entry: 488:0000000000229b98 V vtable for rb::Prong
        Translates to 0000000000229b98 V _ZTVN2rb5ProngE

Note that  U       <symbol>    means the symbol is undefined (required) here
           T, W, V <symbol>    is defined here

See info here: find global symbol script

The script find_global_symbols.sh is available from various locations:

  • setup the larutils package from LArSoft
  • download from the above linked page
  • also found at ~rhatcher/bin/find_global_symbol.sh (probably easiest for NOvA-ians; just use the whole path on the command line)

One of the export sub-branches is not present in the import TTree

Errors of this type are usually the result of trying to output an object from an input file whose data member list has changed since the file was created. There are two ways to configure your job .fcl file to avoid the error:

  1. Add the "outputCommands" option to the output stream configuration:
    outputs:
    {
     out1:
     {
        module_type: RootOutput
        fileName:    "tracks.root" 
        outputCommands: [ "keep *", "drop sim::Particles_geant_*_GenieGen" ]
     }      
    }
    

    This will configure your job to simply not write out the offending object, in this case sim::Particles created by module labeled geant in process GenieGen.
  2. Add the "fastCloning" option to the output stream configuration:
    outputs:
    {
     out1:
     {
        module_type: RootOutput
        fileName:    "tracks.root" 
        fastCloning: false
     }      
    }
    

    This will write the object out to the output file and add the missing data member along the way. It is a slower option than the previous option.

The export branch and the import branch do not have the same streamer type

Add the "fastCloning" option to the output stream configuration:

outputs:
{
 out1:
 {
    module_type: RootOutput
    fileName:    "tracks.root" 
    fastCloning: false
 }      
}

This will write the object out to the output file and add the missing data member along the way. It is a slower option than the previous option.

Failed to parse the configuration file

The FHICL parser will tell you if there are errors in your configuration file. One example is

%MSG-s ArtException:  nova 27-May-2011 10:45:46 CDT JobSetup
Failed to parse the configuration file 'job/somejob.fcl' with exception 
---- Can't find key BEGIN
some_parameter (at part "some_parameter")
---- Can't find key END

%MSG

The message is telling you that the configuration file (ie somejob.fcl) is missing a definition of the key some_parameter. The solution is to make sure the key is defined somewhere, possibly in a #include file.

Can't find key BEGIN

These errors have the form

%MSG-s ArtException:  nova 27-May-2011 10:45:46 CDT JobSetup
cet::exception caught in art
---- Can't find key BEGIN
 some_parameter
---- Can't find key END
%MSG

and are telling you that the configuration file (ie xxx.fcl) is missing a definition of the key some_parameter. The solution is to make sure the key is defined somewhere, possibly in a #include file.

SQLExecutionError BEGIN

These errors have the form

%MSG-s ArtException:  nova 04-Oct-2016 11:38:40 CDT JobSetup
cet::exception caught in art
---- OtherArt BEGIN
 ServiceCreation
 ---- SQLExecutionError BEGIN
   database or disk is full
 ---- SQLExecutionError END
 cet::exception caught during construction of service type art::TimeTracker:
---- OtherArt END
%MSG
Art has completed and will exit with status 1.

and they are telling you that the /var/tmp is full for that machine as shown below:

<novagpvm14.fnal.gov> df -h | ack /var
/dev/mapper/rootvg-system_var      7.9G  7.5G     0 100% /var
/dev/mapper/rootvg-system_cvmfs    9.9G  855M  8.6G   9% /var/cache/cvmfs2

Ideally, setup_nova cleans such directory automatically but if that doesn't works, alternative solutions include that the guilty user(s) removes his/her tmp files and/or you use another novagpvm machine.

Type mismatch / narrowing conversion

These can caused by code such as

    fZ0 = pset.get<int>("Z0");

and a .fcl file that has:
  Z0:  202.3

While both the value 202.3 and the destination variable are of double type, there is an attempt to squeeze it through an intermediate int.

The C++0x standard (which seems to be enabled w/ gcc 4.6.1) is more pedantic about type conversions that will lose precision.

For reference, see http://www2.research.att.com/~bs/C++0xFAQ.html#narrowing or the draft standard 8.5.4

A narrowing conversion is an implicit conversion
  • from a floating-point type to an integer type, or
  • from long double to double or float, or from double to float, except where the source is a constant expression and the actual value after conversion is within the range of values that can be represented (even if it cannot be represented exactly), or
  • from an integer type or unscoped enumeration type to a floating-point type, except where the source is a constant expression and the actual value after conversion will fit into the target type and will produce the original value when converted back to the original type, or
  • from an integer type or unscoped enumeration type to an integer type that cannot represent all the values of the original type, except where the source is a constant expression and the actual value after conversion will fit into the target type and will produce the original value when converted back to the original type.

REFLEX: Attempt to change the size of the class...

If your job fails with an error that looks like this:

%MSG
terminate called after throwing an instance of 'Reflex::RuntimeError'
  what():  REFLEX: Attempt to change the size of the class cheat::SimHit
Aborted 

It means that there's a disagreement between different parts of your release about what size some object is (in this case, SimHit). Assuming this isn't happening to everyone, you need to figure out what part of your test release is to blame (likely out of date).

In this case, the MCCheater package (where SimHit is defined) was itself fine. You can get the effect of building your package against just the base release by removing all the files under lib/ in your test release, and by removing (or moving out of the way) the links under include/. In this kind of case paranoia is your friend, you need to rebuild packages with "make clean && make" to ensure they pick up all the details of their new environment properly. I then added packages back one by one (relink them into include/ and build them clean) until I found the culprit. Here, RawDigit (the grandparent class of SimHit) was out of date, and a simple "cvs up" fixed it. You might also need to check that ClassVersion fields in xml files were updated when they should have been.

Calibrator cannot find at lease one shape table file

If your job fails with an error that looks like this:

---- OtherArt BEGIN
 ServiceCreation
 ---- Calibrator BEGIN
   Cannot find at least one shape table file: 
   FDMC:   
   FDData: 
   NDMC:   
   NDData: 
 ---- Calibrator END
 cet::exception caught during construction of service type calib::Calibrator:
---- OtherArt END

It looks like you are not setting up the novasoft ups product, be sure to have in your bash_profile a function that includes:

source /nova/app/home/novasoft/nova_offline_software/externals/setup
setup mrb
source localProducts_nova_develop_eXX_sYY_grid_prof/setup
cd $MRB_BUILDDIR
mrbsetenv

And to have set up the novasoft ups:

 setup novasoft develop -q eXX:sYY:buildtype 

CalibUtils: too many cell surfaces intersected by trajectory

When running a job fcl you get the following exception

%MSG-s ArtException:  PostCloseFile 29-Nov-2016 12:52:16 CST PostEndRun
cet::exception caught in art
---- EventProcessorFailure BEGIN
 An exception occurred during current event processing
 ---- ScheduleExecutionFailure BEGIN
   ProcessingStopped.

   ---- CalibUtils BEGIN
     too many cell surfaces intersected by trajectory: 3cet::exception going through module SomeModule/somemodule run: 18027 subRun: 33 event: 745
   ---- CalibUtils END
   Exception going through path makecaf
 ---- ScheduleExecutionFailure END
 cet::exception caught in EventProcessor and rethrown
---- EventProcessorFailure END
%MSG

Status: investigating (11/26/2016).

sh: -c: line 0: syntax error near unexpected token `('

sh: -c: line 0: syntax error near unexpected token `('
sh: -c: line 0: `svnserve --tunnel-user (null)  -t' 

When trying to add a package, or when attempting to install a new release.
This is caused by a Kerberos authentication problem. Make sure you have the necessary GSSAPI options in your .ssh/config file. See Using_NOvASoft_on_the_GPVM_nodes.
Make sure those options are applying to the hostcdcvs.fnal.gov domain. (The link above achieves this by wildcarding the whole of fnal.gov).

no ServiceRegistry has been set for this thread

Errors of the form

terminate called after throwing an instance of 'cet::coded_exception<art::errors::ErrorCodes, &(art::ExceptionDetail::translate(art::errors::ErrorCodes))>'
  what():  ---- NotFound BEGIN
  Service  no ServiceRegistry has been set for this thread
---- NotFound END

Aborted

indicate that there is a service that is not configured properly. This may mean that a service is being used which has not been included in the services block of the .fcl file.

It could also mean that you are attempting to use a service in the constructor of a stored object, which is a bad idea and should not be done. Avoiding this resolves the issue. This includes not making global calls to service handles.

unknown branch -> BranchIDLists

This is one (of several?) possible errors obtained when attempting to open a file produced in ART 2 using ART 1.

cet::exception caught in art
---- FatalRootError BEGIN
  Fatal Root Error: @SUB=TTree::SetBranchAddress
  unknown branch -> BranchIDLists
---- FatalRootError END

Solution: be sure to use consistent software releases.

GENIE_3665/src/make/Make.config: line x: SOMETHING_DIR: command not found

These errors will occur for offsite builds of novasoft using the prebuilt version of geant. The reason for this is the syntax used in the Make.conf file for path variables and the fact that packages using genie feel the need to parse this file. These are not build errors, as long as these are the only "errors" you get, the build has still succeeded; however, these messages make it extremely tedious to parse the output of the build script and spot errors.

To fix this you will have to change the 8 offending lines in the Make.conf file. These are path variables so replace them with the absolute path. Replace $EXTERNALS with the absolute path for your install:

GENIE_INSTALLATION_PATH=$EXTERNALS/genie/v3665a/Linux64bit+2.6-2.5-e2-debug
GOPT_WITH_PYTHIA6_LIB=$EXTERNALS/pythia/v6_4_26/Linux64bit+2.6-2.5-gcc47-debug/lib
GOPT_WITH_LHAPDF_LIB=$EXTERNALS/lhapdf/v5_8_8/Linux64bit+2.6-2.5-e2-debug/lib
GOPT_WITH_LHAPDF_INC=$EXTERNALS/lhapdf/v5_8_8/Linux64bit+2.6-2.5-e2-debug/include
GOPT_WITH_LIBXML2_INC=$EXTERNALS/libxml2/v2_8_0/Linux64bit+2.6-2.5-gcc47-debug/include/libxml2
GOPT_WITH_LIBXML2_LIB=$EXTERNALS/libxml2/v2_8_0/Linux64bit+2.6-2.5-gcc47-debug/lib
GOPT_WITH_LOG4CPP_INC=$EXTERNALS/log4cpp/v1_1/Linux64bit+2.6-2.5-e2-debug/include
GOPT_WITH_LOG4CPP_LIB=$EXTERNALS/log4cpp/v1_1/Linux64bit+2.6-2.5-e2-debug/lib

Of course, if you're having the problem with the prof/maxopt build then change the file in the -prof package with the appropriate paths instead.

classes.h:X: error: explicit instantiation of 'struct art::Ptr<namespace::class>' before definition of template

In the classes.h files, before you can define a template class of something, make sure you include the corresponding header file. In this case

#include "art/Persistency/Common/Ptr.h" 

fixes the problem.

IOManip Errors

Errors of the form:

‘setiosflags’ is not a member of ‘std’
‘setprecision’ is not a member of ‘std’
‘setw’ is not a member of ‘std’

are confusing because in fact these functions are members of the standard name space. However, they are not members of the standard namespaces your code has access to, so make sure to include the right set of functions:
#include <iomanip>

Type Errors

Errors of the form:

error: 'uint8_t' does not name a type

are similar to the ones above, in that they are simply missing the correct include directive;
#include <stdint.h>

fixes the problem.

Message Logger errors

error: ‘LOG_DEBUG’ was not declared in this scope

Simply needs the messagefacility/MessageLogger header file
#include "messagefacility/MessageLogger/MessageLogger.h" 

fixes the problem. For more information on how to properly use the MessageLogger go here.

I can't commit to a nusoft package

You need to be added to the separate nusoft committers list, and promise to be very careful.

If you're already on that list, you probably checked it out wrong. Correct syntax is:

addpkg_svn -h -d svn+ssh://p-nusoftart@cdcvs.fnal.gov/cvs/projects/nusoftsvn -s nutools <package name>

Segfault after modifying GeometryBase class

You need to checkout and recompile the Geometry package as well.

SVN issues

local edit, incoming delete upon update

This happens when you edit a file, while someone else deleted the file and commited first. As a good svn citizen you do an update before a commit. Now you have a conflict. Realising that deleting the file is the right thing to do, you delete the file from your working copy. Instead of being content, svn now complains that the local files are missing, in addition to the conflicting update which ultimately wants to see the files deleted.

$ svn st
!  +  C foo
      >   local edit, incoming delete upon update
!  +  C bar
      >   local edit, incoming delete upon update
$ touch foo bar
$ svn revert foo bar
$ rm foo bar

Bad Owner or Permissions on ~/.ssh/config

If, when trying to do anything with svn, you get the following message:

Bad owner or permissions on ~/.ssh/config
svn: To better debug SSH connection problems, remove the -q option from 'ssh' in the [tunnels] section of your Subversion configuration file.
svn: Network connection closed unexpectedly

svn is messing up because it doesn't like the permissions on your ssh config file. This can be resolved by running:

chmod 600 ~/.ssh/config

This properly sets the permissions for your ssh config so that it's only readable/writable by you. After doing that, try running your svn command again.

assert(fPECorr >= 0 && "Not calibrated")

This is telling you, surprise, that the cell does not have good calibration constants. You MUST check RecoHit::IsCalibrated() before trying to use any of the calibrated values: PECorr, MIP, or GeV.

Assertion 'fGeV >=0 && "Not Calibrated"'

Similar to above. The solution is to include the following lines of code:

// Defines an example of recohit
rb::RecoHit rhit = track->RecoHit(cellhit);
//Not Calibrated Aborted error solution
if(!rhit.IsCalibrated()) continue;

Test release modifications apparent running interactively, but not on the grid

This is probably because you set up the maxopt (prof) version of novasoft, but did not compile your test release in maxopt. The linker failed to find your libraries in $SRT_PRIVATE_CONTEXT/lib/lib/Linux2.6-GCC-maxopt/ and fell back to $SRT_PUBLIC_CONTEXT/lib/lib/Linux2.6-GCC-debug/. You can resolve this problem by compiling your test release in maxopt. To do this, setup novasoft with "-b maxopt" and then compile your test release.

Fatal Root Error: @SUB=TSystem::ExpandFileName

An error message in a grid job that looks like this

terminate called after throwing an instance of 'cet::coded_exception<art::errors::ErrorCodes, &art::ExceptionDetail::translate>'
    what():  ---- FatalRootError BEGIN
    Fatal Root Error: @SUB=TSystem::ExpandFileName
    input: $HOME/.root.mimes, output: $HOME/.root.mimes
---- FatalRootError END

The solution is apparently to add

export DISPLAY=localhost0.0

to your job script.

If anyone figures out what's actually going on here, and how to stop ROOT doing it in the first place, that would be great!

SSL Handshake error

139975637681992:error:14094418:SSL routines:SSL3_READ_BYTES:tlsv1 alert unknown ca:s3_pkt.c:1259:SSL alert number 48
139975637681992:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:s23_lib.c:184:

This is a sign that a proxy has gotten mangled. To fix it, force a new proxy to be generated by doing the following:

setup -t NovaGridUtils
setup_fnal_security --force

TVector3 and caf::SRVector3D Warnings and errors (SRMRCCParent SRTrkME SRVertexDT)

Different errors using CAF files form R17-03-01-prod3reco.*.

Fatal in <TBranchElement::InitializeOffsets>: Could not find the real data member 'vtx.fUniqueID' when constructing the branch 'vtx.vdt' 
Likely an internal error, please report to the developers.

error: 'const struct caf::SRVector3DProxy' has no member named 'fZ'

Warning in <TStreamerInfo::BuildOld>: Cannot convert caf::SRVertexDT::vtx from type:TVector3 to type:caf::SRVector3D, skip element
Warning in <TStreamerInfo::BuildOld>: Cannot convert caf::SRMRCCParent::muonstart from type:TVector3 to type:caf::SRVector3D, skip element

Fatal Root Error: @SUB=TStreamerInfo::BuildOld
        Cannot convert caf::SRMRCCParent::muonstart from type:TVector3 to type:caf::SRVector3D, skip element
        cet::exception going through module RootOutput/out1 run: 24942 subRun: 3 event: 1

Introduced by Revision r24831, there are three instances in StandardRecord where the change from TVector3 and caf::SRVector3D did not happen in time for production. Solution is to use the same release as files were produced, and e.g. use FileReducer, or to use a tweaked StandardRecord in the test release.

If you are not using the variables that did not get switched to caf::SRVector3D in time, but you are seeing this error, you can add these lines back into classes_def.xml in StandardRecord and it will suppress the error: https://cdcvs.fnal.gov/redmine/projects/novaart/repository/revisions/26799/diff/trunk/StandardRecord/classes_def.xml

Grid running troubleshooting

WARNING: File /fife/local/scratch/uploads/nova/... is not writable by condor.

mkdir: cannot create directory `/fife': Permission denied
chgrp: cannot access `/fife/local/scratch/uploads/nova/<username>': No such file or directory
chmod: cannot access `/fife/local/scratch/uploads/nova/<username>': No such file or directory

Your jobs are fine, you can safely ignore this message. This is some internal error which occurs when a user submits too many jobs at one time. Somehow the system uses up all of the available file handles and condor can't touch any more files.

ifdh: /grid/fermiapp/products/nova/externals/gcc/v4_8_2/Linux64bit+2.6-2.12/lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ifdh)

This occurs when ifdhc has not been set up in your grid job. Add the following to your script:

setup ifdhc v1_8_5

SysError in <TFile::ReadBuffer>: error reading from file

root [0] 
Attaching file <filename>.root as _file0...
SysError in <TFile::ReadBuffer>: error reading from file <filename>.root (Input/output error)
Error in <TFile::Init>: <filename>.root failed to read the file type data.

These errors occur when trying to open a file on pnfs directly in Root. You cannot open dCache files directly, instead use xrootd as described on the Grid Running page.

SSL, pycurl and Proxy errors

Some example error messages include:

SSL error: unable to open private key file
'cannot obtain credentials for protocol: Secgsi: ErrParseBuffer: error getting user proxies: kXGS_init: unable to get protocol object.'
ERROR: Couldn't find valid credentials to generate a proxy.
HTTP response:0 PyCurl Error 77: Problem with the SSL CA cert (path? access rights?)

These are all signs of an authentication issue. Try following the instructions at the beginning of Finding Data with SAM. You may need to create the .globus directory in your home directory.

Library Specification "<your module>": does not correspond to any library in LD_LIBRARY_PATH of type "module"

This error occurs when running in a different version of novasoft than used in your test release. For example, using the regular version locally but maxopt when submitting on the grid.

550 File exists error

If the output of the *.out file after retrieving it has the string

error: globus_ftp_client: the server responded with an error
550 File exists

ifdh cp failed at: Thu Jan 11 19:46:25 2018

Thu Jan 11 19:46:25 UTC 2018 ./cafe_grid_script.sh COMPLETED with exit status 3

that error just means there's already a file at the output path it's trying to write to. Simply create a new output folder with a different name and give it try. Things are configured to not let you overwrite by accident.

Submitting grid jobs with an old release

This will often produce errors like:

Error encountered when setting up product: totalview
ERROR: Product 'totalview' (with qualifiers ''), has no v8_9_0a version (or may not exist)
Error encountered when setting up product: fife_utils
ERROR: Product 'fife_utils' (with qualifiers ''), has no current chain (or may not exist)
/cvmfs/nova.opensciencegrid.org/novasoft/slf6/novasoft/releases/S16-12-07/bin/Linux2.6-GCC-maxopt/setup_testrel: line 9: pushd: /nova/app/users/ynitin/art/S16-12-07: No such file or directory
No GNUmakefile in current or parent directory. Assuming no SoftRelTools.
srt_int_info failed.
/cvmfs/nova.opensciencegrid.org/novasoft/slf6/novasoft/releases/S16-12-07/bin/Linux2.6-GCC-maxopt/setup_testrel: line 11: popd: directory stack empty

or

Traceback (most recent call last):
  File "/cvmfs/nova.opensciencegrid.org/externals/NovaGridUtils/v03.19/NULL/bin/runNovaSAM.py", line 5, in <module>
    import samweb_client, ifdh
ImportError: No module named samweb_client

Details on this page: Running Jobs on Grid using Older release

ERROR in submit_cafana.py /pnfs/nova/scratch/users/<you> is not group writable, but should be

Make your area group writable by issuing the command

chmod g+w /pnfs/nova/scratch/users/<you>

where you replace <you> with your username.

Miscellaneous

warning: Clock skew detected. Your build may be incomplete.

This warning message can occur from time to time during code compilation on the nodes. It indicates that the client and server clocks are not in sync. Usually it gives a time of milliseconds or less and a message like:

make[2]: Warning: File "blah" has modification time 0.0023 s in the future

so unless you managed to change any of your files within that time discrepancy you can safely ignore this. if the skew is large, multi-seconds to minutes then there's a real problem (highly unlikely). For the most part this is a harmless warning message.

Internal timeout(error code: 3012)

Let's suppose you have substantial CAFs in your pnfs scratch area and you have also created a dataset from them. After running your script to fill some spectra you encounter the following annoying error

[======================================================>   ] 4m15s    161130 11:52:36 3901 Xrd: CheckErrorStatus: Server [fndca1.fnal.gov] declared: Internal timeout(error code: 3012)
161130 11:52:36 3901 Xrd: Open: Error opening the file /pnfs/fnal.gov/usr/nova/scratch/users/jasq/HadCellsEdgeCAFs-NonSwap/f1e1b132-bb62-4679-8957-31e503fd8fbd-fardet_genie_nonswap_nog\
enierw_fhc_v08_1000_r00015345_s60_c000_development_v1_20160212_103458_sim.caf.root on host fndca4a.fnal.gov:1094
Error in <TXNetFile::CreateXClient>: open attempt failed on root://fndca1.fnal.gov//pnfs/fnal.gov/usr/nova/scratch/users/jasq/HadCellsEdgeCAFs-NonSwap/f1e1b132-bb62-4679-8957-31e503fd8\
fbd-fardet_genie_nonswap_nogenierw_fhc_v08_1000_r00015345_s60_c000_development_v1_20160212_103458_sim.caf.root
root.exe: FileListSource.cxx:101: virtual TFile* ana::FileListSource::GetNextFile(): Assertion `fFile' failed.

According to SCD folks, this is might be related to heavy load on the scratch pools. People are investigating. An empirical way to fix this is by exporting the following variable

XRD_CONNECTIONRETRY=32

and try to re-run your script again.

HTTP response:0 PyCurl Error 56: SSL read: errno -12224

This error appear when you try to run any command related to jobsub interface e.g. jobsub_q --user=<username> and is of the type:

File "/grid/fermiapp/products/common/db/../prd/jobsub_client/v1_2_2/NULL/jobsub_q", line 245, in <module>
    sys.exit(main(sys.argv))
  File "/grid/fermiapp/products/common/db/../prd/jobsub_client/v1_2_2/NULL/jobsub_q", line 227, in main
    js_client = JobSubClient(options.jobsubServer, options.acctGroup, None, [], extra_opts=optDict)
  File "/grid/fermiapp/products/common/prd/jobsub_client/v1_2_2/NULL/jobsubClient.py", line 108, in __init__
    self.serverAuthMethods()
  File "/grid/fermiapp/products/common/prd/jobsub_client/v1_2_2/NULL/jobsubClient.py", line 768, in serverAuthMethods
    traceback.print_stack()
HTTP response:0 PyCurl Error 56: SSL read: errno -12224

The solution is simply type

setup_fnal_security --force

and run the jobsub command again.

Error in locking authority file

/usr/bin/xauth:  error in locking authority file /nashome/u/user/.Xauthority

This error appears after ssh to a gpvm, as a result of other unsuccessful login attempts. Delete any extra files of the form

/nashome/u/user/.Xauthority-*

No space left on device

Errors of the form

-bash: cannot create temp file for here-document: No space left on device

(which usually occur when attempting to setup_nova) mean that a disk area is full. Nearly always this is the /tmp area on a nova GPVM. The solution is to look at the files in /tmp and identify the large file that is there (try
ls -lhrS /tmp
; the largest file will be at the bottom of the list), and have its owner remove it. (Often this situation results from a failed file transfer, and the file is called something like tmp6bYmrA, but this is not the only filename pattern.)

Gotchas

A guy walks in to a bar. What does he say?

Ouch!

Gotcha!