Project

General

Profile

Bug #20246

Illegal Instruction in hep_concurrency

Added by Christopher Backhouse about 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
06/29/2018
Due date:
% Done:

100%

Estimated time:
2.00 h
Spent time:
Occurs In:
Scope:
Internal
Experiment:
-
SSI Package:
art
Duration:

Description

Running essentially anything in art v2_11_01 gets me an Illegal Instruction inside hep_concurrency inside messagefacility. This same machine used to work with art1 releases.

Program received signal SIGILL, Illegal instruction.
hep::concurrency::getTSCP (cpuidx=@0x7ffffffe66d4: 0)
   at /scratch/workspace/art-release-build/SLF6/debug/build/hep_concurrency/v1_00_01/src/hep_concurrency/tsan.cc:28
28    /scratch/workspace/art-release-build/SLF6/debug/build/hep_concurrency/v1_00_01/src/hep_concurrency/tsan.cc: No such file or directory.

#1  0x00007fffee94506b in hep::concurrency::RecursiveMutex::lock (
   this=0x7ffff7b4a440 <mf::(anonymous namespace)::msgMutex_>, opName=...)
   at /scratch/workspace/art-release-build/SLF6/debug/build/hep_concurrency/v1_00_01/src/hep_concurrency/RecursiveMutex.cc:72
72    /scratch/workspace/art-release-build/SLF6/debug/build/hep_concurrency/v1_00_01/src/hep_concurrency/RecursiveMutex.cc: No such file or directory.

#2  0x00007fffee946d40 in hep::concurrency::RecursiveMutexSentry::RecursiveMutexSentry (this=0x7ffffffe67e0, mutex=..., name=...)
   at /scratch/workspace/art-release-build/SLF6/debug/build/hep_concurrency/v1_00_01/src/hep_concurrency/RecursiveMutex.cc:283
283    in /scratch/workspace/art-release-build/SLF6/debug/build/hep_concurrency/v1_00_01/src/hep_concurrency/RecursiveMutex.cc

#3  0x00007ffff78db6aa in mf::(anonymous namespace)::logMessage (msg=0x44c4ec0)
   at /scratch/workspace/art-release-build/SLF6/debug/build/messagefacility/v2_02_01/src/messagefacility/MessageLogger/MessageLogger.cc:438
438    /scratch/workspace/art-release-build/SLF6/debug/build/messagefacility/v2_02_01/src/messagefacility/MessageLogger/MessageLogger.cc: No such file or directory.

#4  0x00007ffff78dba4f in mf::LogErrorObj (msg=0x44c4ec0)
   at /scratch/workspace/art-release-build/SLF6/debug/build/messagefacility/v2_02_01/src/messagefacility/MessageLogger/MessageLogger.cc:547
547    in /scratch/workspace/art-release-build/SLF6/debug/build/messagefacility/v2_02_01/src/messagefacility/MessageLogger/MessageLogger.cc

#5  0x00007ffff7d55ec3 in mf::MaybeLogger_<(mf::ELseverityLevel::ELsev_)3, false>::~MaybeLogger_ (this=0x7ffffffe8b28, __in_chrg=<optimized out>)
   at /scratch/workspace/art-release-build/SLF6/debug/build/messagefacility/v2_02_01/include/messagefacility/MessageLogger/MessageLogger.h:143
143    /scratch/workspace/art-release-build/SLF6/debug/build/messagefacility/v2_02_01/include/messagefacility/MessageLogger/MessageLogger.h: No such file or directory.

#6  0x00007ffff7d51ff9 in art::run_art_common_ (main_pset=...)
   at /scratch/workspace/art-release-build/SLF6/debug/build/art/v2_11_01/src/art/Framework/Art/run_art.cc:287
287    /scratch/workspace/art-release-build/SLF6/debug/build/art/v2_11_01/src/art/Framework/Art/run_art.cc: No such file or directory.

#7  0x00007ffff7d51315 in art::run_art (argc=4, argv=0x7ffffffe9608,
   in_desc=..., lookupPolicy=..., handlers=...)
   at /scratch/workspace/art-release-build/SLF6/debug/build/art/v2_11_01/src/art/Framework/Art/run_art.cc:206
206    in /scratch/workspace/art-release-build/SLF6/debug/build/art/v2_11_01/src/art/Framework/Art/run_art.cc

/proc/cpuinfo says this:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
stepping        : 10
microcode       : 2571
cpu MHz         : 2826.477
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm pti retpoline tpr_shadow vnmi flexpriority
bogomips        : 5652.95
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

Is it possible to rebuild with less aggressive CPU requirements? It's obviously bad to have machines that used to work stop working. I wouldn't be surprised if we've also made some fraction of the grid unusable.


Related issues

Has duplicate cet-is - Bug #20488: DUNE jobs on some off-site worker nodes are terminated with exit status 4 (SIGILL)Rejected2018-07-30

History

#1 Updated by Kyle Knoepfel about 1 year ago

  • Status changed from New to Feedback

We need more details, Chris:

  • Which platform (SLF6?)
  • Printout of 'ups active'
  • Sample job that reproduces the problem

#2 Updated by Christopher Backhouse about 1 year ago

SLF6

art               v2_11_01        -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
artdaq_core       v3_01_08        -f Linux64bit+2.6-2.12  -q debug:e15:s67   -z /cvmfs/nova.opensciencegrid.org/externals
awscli            v1_7_15         -f Linux64bit+2.6-2.12                     -z /cvmfs/fermilab.opensciencegrid.org/products/common/db
boost             v1_66_0a        -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
bpf               v02.01          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
caffe             v1_0i           -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
calibcsvs         v12.06          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
calibfixnd        v01.00          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
canvas            v3_03_01        -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
canvas_root_io    v1_01_05        -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
castxml           v0_00_00_f20180122 -f Linux64bit+2.6-2.12                     -z /cvmfs/nova.opensciencegrid.org/externals
ccache            v03.03.03       -f Linux64bit+2.6-2.12                     -z /cvmfs/nova.opensciencegrid.org/externals
cetbuildtools     v7_03_01        -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
cetlib            v3_03_00        -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
cetlib_except     v1_02_00        -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
cetpkgsupport     v1_14_01        -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
cigetcert         v1_16_1         -f Linux64bit+2.6-2.12                     -z /cvmfs/fermilab.opensciencegrid.org/products/common/db
cigetcertlibs     v1_1            -f Linux64bit+2.6-2.12                     -z /cvmfs/fermilab.opensciencegrid.org/products/common/db
clhep             v2_3_4_6        -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
cmake             v3_10_1         -f Linux64bit+2.6-2.12                     -z /cvmfs/nova.opensciencegrid.org/externals
condb             v2_0b           -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
cpn               v1.7            -f NULL                                    -z /cvmfs/fermilab.opensciencegrid.org/products/common/db
cppunit           v1_13_2c        -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
cry               v1_7k           -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
cstxsd            v4_0_0h         -f Linux64bit+2.6-2.12  -q e15             -z /cvmfs/nova.opensciencegrid.org/externals
cvn               v01.04          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
cvnprong          v01.00          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
cvnreg            v01.01          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
dk2nudata         v01_06_01b      -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
dk2nugenie        v01_06_01e      -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
eid               v01.00          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
FCHelperAna2017   v01.02          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
fftw              v3_3_6_pl2      -f Linux64bit+2.6-2.12  -q debug           -z /cvmfs/nova.opensciencegrid.org/externals
fhiclcpp          v4_06_07        -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
fife_utils        v3_1_3          -f NULL                                    -z /cvmfs/fermilab.opensciencegrid.org/products/common/db
g4abla            v3_0            -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
g4emlow           v6_50           -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
g4neutron         v4_5            -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
g4neutronxs       v1_4            -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
g4nucleonxs       v1_1            -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
g4nuclide         v2_1            -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
g4photon          v4_3_2          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
g4pii             v1_3            -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
g4radiative       v5_1_1          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
g4surface         v1_0            -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
g4tendl           v1_3            -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
gcc               v6_4_0          -f Linux64bit+2.6-2.12                     -z /cvmfs/nova.opensciencegrid.org/externals
gdb               v8_0_1          -f Linux64bit+2.6-2.12                     -z /cvmfs/nova.opensciencegrid.org/externals
geant4            v4_10_3_p01d    -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
genie_fluxopt     v17_03_14a      -f NULL                 -q nova            -z /cvmfs/nova.opensciencegrid.org/externals
genie             v2_12_10b       -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
genie_phyopt      v2_12_10        -f NULL                 -q dkcharmtau      -z /cvmfs/nova.opensciencegrid.org/externals
genie_xsec        v2_12_10        -f NULL                 -q DefaultPlusMECWithNC -z /cvmfs/nova.opensciencegrid.org/externals
gflags            v2_2_1          -f Linux64bit+2.6-2.12  -q e15             -z /cvmfs/nova.opensciencegrid.org/externals
gibuu_libs        v00.01          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
glog              v0_3_5          -f Linux64bit+2.6-2.12  -q e15             -z /cvmfs/nova.opensciencegrid.org/externals
gsl               v2_4            -f Linux64bit+2.6-2.12  -q debug           -z /cvmfs/nova.opensciencegrid.org/externals
hdf5              v1_10_1c        -f Linux64bit+2.6-2.12  -q e15             -z /cvmfs/nova.opensciencegrid.org/externals
hep_concurrency   v1_00_01        -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
ifbeam            v2_2_3          -f Linux64bit+2.6-2.12  -q debug:e15:p2714b -z /cvmfs/nova.opensciencegrid.org/externals
ifdh_art          v2_06_01        -f Linux64bit+2.6-2.12  -q debug:e15:s67   -z /cvmfs/nova.opensciencegrid.org/externals
ifdhc_config      v2_3_3          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
ifdhc             v2_3_3          -f Linux64bit+2.6-2.12  -q debug:e15:p2714b -z /cvmfs/nova.opensciencegrid.org/externals
jobsub_client     v1_2_6_2        -f NULL                                    -z /cvmfs/fermilab.opensciencegrid.org/products/common/db
kx509             v3_1_1          -f NULL                                    -z /cvmfs/fermilab.opensciencegrid.org/products/common/db
lapack            v3_7_1          -f Linux64bit+2.6-2.12  -q e15:prof        -z /cvmfs/nova.opensciencegrid.org/externals
lemlittle         v01.03          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
leveldb           v1_20a          -f Linux64bit+2.6-2.12  -q e15             -z /cvmfs/nova.opensciencegrid.org/externals
lhapdf            v5_9_1k         -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
library_shim      v03.03          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
libwda            v2_24_0         -f Linux64bit+2.6-2.12                     -z /cvmfs/nova.opensciencegrid.org/externals
libxml2           v2_9_5          -f Linux64bit+2.6-2.12  -q debug           -z /cvmfs/nova.opensciencegrid.org/externals
lid               v01.03          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
lmdb              v0_9_21         -f Linux64bit+2.6-2.12                     -z /cvmfs/nova.opensciencegrid.org/externals
log4cpp           v1_1_3a         -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
messagefacility   v2_02_01        -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
monopoleid        v01.00          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
mysql_client      v5_5_58a        -f Linux64bit+2.6-2.12  -q e15             -z /cvmfs/nova.opensciencegrid.org/externals
ncid              v01.03          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
novaproduction    v02.49          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
nucondb           v2_2_3          -f Linux64bit+2.6-2.12  -q debug:e15:p2714b -z /cvmfs/nova.opensciencegrid.org/externals
nuecosrej         v01.01          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
nuededx           v01.01          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
nuone             v01.02          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
nusdata           v00.10          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
nusimdata         v1_13_00        -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
nutools           v2_22_01        -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
opencv            v3_3_0c         -f Linux64bit+2.6-2.12  -q e15             -z /cvmfs/nova.opensciencegrid.org/externals
pdfsets           v5_9_1b         -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
poms_client       v3_0_0          -f NULL                                    -z /cvmfs/fermilab.opensciencegrid.org/products/common/db
postgresql        v9_6_6a         -f Linux64bit+2.6-2.12  -q p2714b          -z /cvmfs/nova.opensciencegrid.org/externals
ppfx              v02_03          -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
protobuf          v3_3_1a         -f Linux64bit+2.6-2.12  -q e15             -z /cvmfs/nova.opensciencegrid.org/externals
psycopg2          v2_5_p2_7       -f Linux64bit+2.6                          -z /cvmfs/nova.opensciencegrid.org/externals
pycurl            v7_16_4         -f Linux64bit+2.6-2.12                     -z /cvmfs/nova.opensciencegrid.org/externals
pygccxml          v1_9_1          -f NULL                 -q p2714b          -z /cvmfs/nova.opensciencegrid.org/externals
pythia            v6_4_28k        -f Linux64bit+2.6-2.12  -q debug:gcc640    -z /cvmfs/nova.opensciencegrid.org/externals
python            v2_7_14b        -f Linux64bit+2.6-2.12                     -z /cvmfs/nova.opensciencegrid.org/externals
python_request    v2_9_1          -f NULL                                    -z /cvmfs/fermilab.opensciencegrid.org/products/common/db
pyyaml            v3_12           -f Linux64bit+2.6-2.12                     -z /cvmfs/nova.opensciencegrid.org/externals
qepid             v01.01          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
range             v3_0_3_0        -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
remid             v01.03          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
root              v6_12_06a       -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
rvp               v01.00          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
sam_web_client    v2_0            -f NULL                                    -z /cvmfs/fermilab.opensciencegrid.org/products/common/db
setpath           v1_11           -f NULL                                    -z /cvmfs/fermilab.opensciencegrid.org/products/common/db
snappy            v1_1_7a         -f Linux64bit+2.6-2.12  -q e15             -z /cvmfs/nova.opensciencegrid.org/externals
sqlite            v3_20_01_00     -f Linux64bit+2.6-2.12                     -z /cvmfs/nova.opensciencegrid.org/externals
tbb               v2018_2a        -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
tensorflow        v1_3_0c         -f Linux64bit+2.6-2.12  -q debug:e15:p2714b -z /cvmfs/nova.opensciencegrid.org/externals
TRACE             v3_13_05        -f Linux64bit+2.6-2.12                     -z /cvmfs/nova.opensciencegrid.org/externals
ucana             v01.07          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
ups               v6_0_6          -f Linux64bit+2.6-2.12                     -z /cvmfs/nova.opensciencegrid.org/externals
valgrind          v3_13_0         -f Linux64bit+2.6-2.12                     -z /cvmfs/nova.opensciencegrid.org/externals
wsnumu            v01.00          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
xerces_c          v3_2_0a         -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
xgboost           v0.60           -f Linux64bit+2.6-2.12                     -z /cvmfs/nova.opensciencegrid.org/externals
xrootd            v4_8_0b         -f Linux64bit+2.6-2.12  -q debug:e15       -z /cvmfs/nova.opensciencegrid.org/externals
xsecccpi0inc      v01.01          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals
xsecncpi0         v01.04          -f NULL                                    -z /cvmfs/nova.opensciencegrid.org/externals

specifically we have messagefacility v2_02_01 and hep_concurrency v1_00_01 in there.

I'm just running nova -c eventdump.fcl <some art file>. I really suspect this happens on all messagefacility calls.

#3 Updated by Kyle Knoepfel about 1 year ago

  • Status changed from Feedback to Accepted

Thanks, Chris. Yes, I suspect you're correct. Based on the backtrace you provided, the code in question is calling the function __rdtscp, which is not supported on all CPU models. We will discuss a path forward at next week's SciSoft team meeting.

#4 Updated by Christopher Backhouse about 1 year ago

Thanks!

#5 Updated by Kyle Knoepfel about 1 year ago

  • Assignee set to Kyle Knoepfel
  • Estimated time set to 2.00 h

There is a straightforward fix to this issue. We will issue bug-fix releases, but the builds will not be available until the middle of the month.

#6 Updated by Kyle Knoepfel about 1 year ago

  • Category set to Infrastructure
  • Status changed from Accepted to Resolved
  • SSI Package art added

Implemented with commit hep_concurrency:0a04dd3e.

#7 Updated by Kyle Knoepfel about 1 year ago

  • % Done changed from 0 to 100

#8 Updated by Kyle Knoepfel about 1 year ago

  • Target version set to 2.11.03

#9 Updated by Kyle Knoepfel about 1 year ago

  • Status changed from Resolved to Closed

#10 Updated by Christopher Green 12 months ago

  • Has duplicate Bug #20488: DUNE jobs on some off-site worker nodes are terminated with exit status 4 (SIGILL) added


Also available in: Atom PDF