Project

General

Profile

Bug #13719

platform-dependent static TLS exception

Added by Martin Haigh almost 3 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
Other
Target version:
Start date:
08/31/2016
Due date:
% Done:

0%

Estimated time:
Occurs In:
Experiment:
DUNE
Co-Assignees:
Duration:

Description

When running standard reconstruction using cvmfs build with latest versions of larsoft and dunetpc, an error "dlopen: cannot load any more object with static TLS" is encountered (full output at bottom of report). This can be produced on our system using a standard setup of the code from a clean shell:

source /cvmfs/fermilab.opensciencegrid.org/products/larsoft/setup
source /cvmfs/dune.opensciencegrid.org/products/dune/setup
setup dunetpc v06_04_01 -q e10:prof
lar -c standard_reco_dune10kt_1x2x6.fcl {input file}

where the input file is either a "detsim" file produced using the same version of the code, using the standard scripts in dunetpc, or a non-existent file.

This occurs on an x86_64 machine running Red Hat Server 6.6, with kernel build 2.6.32-504.12.2.el6.x86_64. It has been verified that this error does not occur when running the same code on the Fermilab GPVM nodes.

Full output:

%MSG-s ArtException: lar 31-Aug-2016 10:22:09 BST JobSetup
cet::exception caught in art
---- Configuration BEGIN
The following were encountered while processing the module configurations:
ERROR: Configuration of module with label pandora encountered the following error:
---- Configuration BEGIN
Unable to load requested library /cvmfs/fermilab.opensciencegrid.org/products/larsoft/larpandora/v06_00_06/slf6.x86_64.e10.prof/lib/liblarpandora_LArPandoraInterface_StandardPandora_module.so
dlopen: cannot load any more object with static TLS
---- Configuration END

---- Configuration END
%MSG
%MSG-s ArtException: lar 31-Aug-2016 10:22:09 BST JobSetup
cet::exception caught in art
---- Configuration BEGIN
The following were encountered while processing the module configurations:
ERROR: Configuration of module with label pandora encountered the following error:
---- Configuration BEGIN
Unable to load requested library /cvmfs/fermilab.opensciencegrid.org/products/larsoft/larpandora/v06_00_06/slf6.x86_64.e10.prof/lib/liblarpandora_LArPandoraInterface_StandardPandora_module.so
dlopen: cannot load any more object with static TLS
---- Configuration END

---- Configuration END
%MSG
Art has completed and will exit with status 9.

strace.log (2.16 MB) strace.log Martin Haigh, 09/01/2016 10:26 AM
strace.log (2.15 MB) strace.log Martin Haigh, 09/01/2016 10:46 AM

History

#1 Updated by Lynn Garren almost 3 years ago

What is the output of lsb_release -a on this machine?

#2 Updated by Martin Haigh almost 3 years ago

lsb_release -a output:

LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 6.6 (Santiago)
Release: 6.6
Codename: Santiago

#3 Updated by Lynn Garren almost 3 years ago

Thanks Martin. I should have also asked if you are using the standard kernel provided by that release.

#4 Updated by Lynn Garren almost 3 years ago

Am I correct that this problem is limited to larpandora? If so, we should be looking at the larpandoracontent dependencies.

$ ups depend larpandoracontent v02_07_09 -q +prof:+e10
larpandoracontent v02_07_09 -f Linux64bit+2.6-2.12 -z /products -q e10:prof
|__cetpkgsupport v1_10_02 -f NULL -z /products -g current
|__pandora v02_07_00b -f Linux64bit+2.6-2.12 -z /products -q e10:nu:prof
   |__root v6_06_04b -f Linux64bit+2.6-2.12 -z /products -q e10:nu:prof
      |__clhep v2_3_2_2 -f Linux64bit+2.6-2.12 -z /products -q e10:prof
      |  |__gcc v4_9_3a -f Linux64bit+2.6-2.12 -z /products
      |__fftw v3_3_4 -f Linux64bit+2.6-2.12 -z /products -q prof
      |__gsl v2_1 -f Linux64bit+2.6-2.12 -z /products -q prof
      |__pythia v6_4_28e -f Linux64bit+2.6-2.12 -z /products -q gcc493a:prof
      |__postgresql v9_3_12 -f Linux64bit+2.6-2.12 -z /products -q p2711
      |  |__python v2_7_11 -f Linux64bit+2.6-2.12 -z /products
      |     |__sqlite v3_12_02_00 -f Linux64bit+2.6-2.12 -z /products
      |__mysql_client v5_5_48a -f Linux64bit+2.6-2.12 -z /products -q e10
      |__libxml2 v2_9_3 -f Linux64bit+2.6-2.12 -z /products -q prof
      |__xrootd v3_3_4e -f Linux64bit+2.6-2.12 -z /products -q e10:prof

#5 Updated by Martin Haigh almost 3 years ago

Confirmed that the issue is not present when all modules with "pandora" in name are removed from process. Yes, the kernel version is the standard one for the distribution.

#6 Updated by Marc Paterno almost 3 years ago

Please try running the program using strace:

strace -o strace.log -e open lar -c standard_reco_dune10kt_1x2x6.fcl {input file}

Please post the resulting log file.

#7 Updated by Martin Haigh almost 3 years ago

strace log attached.

#8 Updated by Lynn Garren almost 3 years ago

Where did these files come from? This might be your problem.

open("/data/t2k/phsmaj/lbne/software_v6/build_slf6.x86_64/larana/lib/tls/x86_64/libart_Framework_Art.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/data/t2k/phsmaj/lbne/software_v6/build_slf6.x86_64/larana/lib/tls/libart_Framework_Art.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/data/t2k/phsmaj/lbne/software_v6/build_slf6.x86_64/larana/lib/x86_64/libart_Framework_Art.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/data/t2k/phsmaj/lbne/software_v6/build_slf6.x86_64/larana/lib/libart_Framework_Art.so", O_RDONLY) = -1 ENOENT (No such file or directory)

#9 Updated by Martin Haigh almost 3 years ago

Must have run strace in a polluted environment, sorry. New log is attached running as in original bug report.

#10 Updated by Lynn Garren almost 3 years ago

  • Status changed from New to Accepted

#11 Updated by Marc Paterno almost 3 years ago

I don't see anything unusual in the strace log file.

It would be useful to know the version of libc being used. Can you please post the output of

/lib64/libc.so.6

#12 Updated by Martin Haigh almost 3 years ago

Checking the libc output gives:

GNU C Library stable release version 2.12, by Roland McGrath et al.
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.4.7 20120313 (Red Hat 4.4.7-9).
Compiled on a Linux 2.6.32 system on 2015-01-19.
Available extensions:
The C stubs add-on version 2.1.2.
crypt add-on version 2.1 by Michael Glad and others
GNU Libidn by Simon Josefsson
Native POSIX Threads Library by Ulrich Drepper et al
BIND-8.2.3-T5B
RT using linux kernel aio
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html&gt;.

#13 Updated by Marc Paterno almost 3 years ago

  • Status changed from Accepted to Feedback

We are attempting to reproduce this failure locally. We do not yet understand what difference between your installation and other installations is causing the failure.

Can you try asking your system manager to update the libc.so version? It is possible (but not certain) that this could eliminate the problem.

#14 Updated by Martin Haigh almost 3 years ago

I am looking at the version of libc on GPVM and it appears to be 2.12, same as on our system. Is this right? Is there a specific version we should try?

#15 Updated by Lynn Garren almost 3 years ago

The specific concern is "Compiled on a Linux 2.6.32 system on 2015-01-19." on your machine. Compare to the output from uboonegpvm01, which is SLF 6.4 and has "Compiled on a Linux 2.6.32 system on 2016-02-16." So we wonder if your OS can be upgraded to a newer release.

<uboonegpvm01> /lib64/libc.so.6
GNU C Library stable release version 2.12, by Roland McGrath et al.
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.4.7 20120313 (Red Hat 4.4.7-16).
Compiled on a Linux 2.6.32 system on 2016-02-16.
Available extensions:
The C stubs add-on version 2.1.2.
crypt add-on version 2.1 by Michael Glad and others
GNU Libidn by Simon Josefsson
Native POSIX Threads Library by Ulrich Drepper et al
BIND-8.2.3-T5B
RT using linux kernel aio
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html&gt;.

#16 Updated by Lynn Garren almost 3 years ago

We also wonder if building the complete code base on your machine will resolve the problem.

If you are willing to try that:
  • work from a clean environment (check your login script)
  • start from a fresh login
  • make an empty directory, <my product dir>, which has at least 20 GB available
  • cd <my product dir>
  • wget http://scisoft.fnal.gov/scisoft/bundles/tools/pullProducts
  • chmod +x pullProducts
  • pullProducts `pwd` source larsoft-v06_05_00
  • ./buildFW -b e10 -s s41 `pwd` prof larsoft-v06_05_00
  • be prepared to wait some hours

That will get you larsoft. You'll also need to use mrb to build a local copy of dunetpc.

#17 Updated by Martin Haigh almost 3 years ago

Ok, trying this. Hopefully everything will be available when I come in tomorrow.

#18 Updated by Ben Morgan almost 3 years ago

I suspect the underlying cause is as detailed in this bug report:

https://bugzilla.redhat.com/show_bug.cgi?id=1124987

So whilst it's an underlying issue in one or more of the libraries loaded up to and including larpandoracontent, it can be fixed by an update of glibc (which will be applied on our system). For RHEL/CentOS6 reference, this looks to have been applied in glibc-2.12-1.183 (I can't access the bugzilla ref, but does appear related):

* Mon Dec 14 2015 Carlos O'Donell <carlos@redhat.com> - 2.12-1.183
- Increase the limit of shared libraries that can use static TLS (#1198802).

Which libraries in the load sequence are using static TLS would require investigation, but probably worthwhile as the upstream glibc patch (as I understand it) only increases the limit on available storage (I'm also not sure how other distros and upstream glibc have tackled the issue). The above bugzilla has some info on checking this. If any LArSoft or UPS products are using static TLS, they should probably be patched accordingly.

One candidate (if loaded by the script Martin's using) is Geant4 as this uses initial-exec when compiled with Multithreading support (for best application performance) unless the TLS model is changed with the GEANT4_BUILD_TLS_MODEL CMake variable. Others might be using -ftls-model directly or possibly picking up GCC's change in default behaviour under the -fpic flag?

#19 Updated by Katherine Lato over 2 years ago

  • Status changed from Feedback to Resolved

From: "Haigh, Martin" <>
Date: Friday, March 10, 2017 at 4:00 AM
To: Katherine Lato <>
Subject: Re: redmine issue

Hi Katherine,

We didn't get to the root of what caused the issue but we were able to stop it on our system by updating our system glibc library to more exactly match what is present at FNAL. So yes, this issue can be closed.

Thanks,
Jen

#20 Updated by Katherine Lato over 2 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF