Project

General

Profile

Bug #20488

DUNE jobs on some off-site worker nodes are terminated with exit status 4 (SIGILL)

Added by Vito Di Benedetto about 1 year ago. Updated about 1 year ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
07/30/2018
Due date:
% Done:

0%

Estimated time:
Scope:
Internal
Experiment:
-
SSI Package:
Co-Assignees:
Duration:

Description

Running DUNE (and uBooNE) jobs off-site I get a some of those jobs terminated with exit status 4 (SIGILL)

Details of the code I run are:

dunetpc: v06_84_00 with qualifier "debug:e15" 
art: v2_11_02

command:

lar --rethrow-all -c prodgenie_nue_dune10kt_1x2x6.fcl -n 1 -o prodgenie_nue_dune10kt_1x2x6_pass_0.root

gdb backtrace is the following:

Reading symbols from lar...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGILL, Illegal instruction.
hep::concurrency::getTSCP (cpuidx=@0x7ffffffe2954: 0)
    at /scratch/workspace/art-release-build/SLF6/debug/build/hep_concurrency/v1_00_02/src/hep_concurrency/tsan.cc:28
#0  hep::concurrency::getTSCP (cpuidx=@0x7ffffffe2954: 0)
    at /scratch/workspace/art-release-build/SLF6/debug/build/hep_concurrency/v1_00_02/src/hep_concurrency/tsan.cc:28
#1  0x00002aaab4b7806b in hep::concurrency::RecursiveMutex::lock (
    this=0x2aaaab4a41e0 , opName=...)
    at /scratch/workspace/art-release-build/SLF6/debug/build/hep_concurrency/v1_00_02/src/hep_concurrency/RecursiveMutex.cc:72
#2  0x00002aaab4b79d40 in hep::concurrency::RecursiveMutexSentry::RecursiveMutexSentry (this=0x7ffffffe2a60, mutex=..., name=...)
    at /scratch/workspace/art-release-build/SLF6/debug/build/hep_concurrency/v1_00_02/src/hep_concurrency/RecursiveMutex.cc:283
#3  0x00002aaaab23548a in mf::(anonymous namespace)::logMessage (
    msg=0x6150960)
    at /scratch/workspace/art-release-build/SLF6/debug/build/messagefacility/v2_02_02/src/messagefacility/MessageLogger/MessageLogger.cc:438
#4  0x00002aaaab23582f in mf::LogErrorObj (msg=0x6150960)
    at /scratch/workspace/art-release-build/SLF6/debug/build/messagefacility/v2_02_02/src/messagefacility/MessageLogger/MessageLogger.cc:547
#5  0x00002aaaaaed6d0b in mf::MaybeLogger_<(mf::ELseverityLevel::ELsev_)3, false>::~MaybeLogger_ (this=0x7ffffffe4da8, __in_chrg=)
    at /scratch/workspace/art-release-build/SLF6/debug/build/messagefacility/v2_02_02/include/messagefacility/MessageLogger/MessageLogger.h:129
#6  0x00002aaaaaed2e41 in art::run_art_common_ (main_pset=...)
    at /scratch/workspace/art-release-build/SLF6/debug/build/art/v2_11_02/src/art/Framework/Art/run_art.cc:287
#7  0x00002aaaaaed215d in art::run_art (argc=8, argv=0x7ffffffe5888, 
    in_desc=..., lookupPolicy=..., handlers=...)
    at /scratch/workspace/art-release-build/SLF6/debug/build/art/v2_11_02/src/art/Framework/Art/run_art.cc:206
#8  0x00002aaaaaece117 in artapp (argc=8, argv=0x7ffffffe5888)
    at /scratch/workspace/art-release-build/SLF6/debug/build/art/v2_11_02/build-Linux64bit+2.6-2.12-e15-debug/art/Framework/Art/artapp.cc:51
#9  0x0000000000401628 in main (argc=8, argv=0x7ffffffe5888)
    at /scratch/workspace/art-release-build/SLF6/debug/build/art/v2_11_02/build-Linux64bit+2.6-2.12-e15-debug/art/Framework/Art/lar.cc:9
A debugging session is active.

    Inferior 1 [process 84628] will be killed.

Jobs terminated this way are running on worker node with the following CPU info from /proc/cpuinfo

processor    : 0
vendor_id    : AuthenticAMD
cpu family    : 21
model        : 1
model name    : AMD Opteron(tm) Processor 6282 SE
stepping    : 2
cpu MHz        : 2599.948
cache size    : 2048 KB
physical id    : 0
siblings    : 16
core id        : 0
cpu cores    : 16
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm rep_good extd_apicid unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw xop fma4
bogomips    : 5199.89
TLB size    : 1536 4K pages
clflush size    : 64
cache_alignment    : 64
address sizes    : 42 bits physical, 48 bits virtual
power management:

processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 46
model name    : Intel(R) Xeon(R) CPU           X7560  @ 2.27GHz
stepping    : 6
microcode    : 4294967295
cpu MHz        : 2260.949
cache size    : 24576 KB
physical id    : 0
siblings    : 16
core id        : 0
cpu cores    : 16
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 11
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm
bogomips    : 4521.89
clflush size    : 64
cache_alignment    : 64
address sizes    : 42 bits physical, 48 bits virtual
power management:
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 62
model name    : Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
stepping    : 4
microcode    : 4294967295
cpu MHz        : 2199.980
cache size    : 25600 KB
physical id    : 0
siblings    : 20
core id        : 0
cpu cores    : 20
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good unfair_spinlock pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 popcnt aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms
bogomips    : 4399.96
clflush size    : 64
cache_alignment    : 64
address sizes    : 42 bits physical, 48 bits virtual
power management:
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 44
model name    : Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
stepping    : 2
microcode    : 4294967295
cpu MHz        : 2659.976
cache size    : 12288 KB
physical id    : 0
siblings    : 12
core id        : 0
cpu cores    : 12
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 11
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good unfair_spinlock pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 popcnt aes hypervisor lahf_lm
bogomips    : 5319.95
clflush size    : 64
cache_alignment    : 64
address sizes    : 40 bits physical, 48 bits virtual
power management:

Is this a known issue?


Related issues

Is duplicate of art - Bug #20246: Illegal Instruction in hep_concurrencyClosed06/29/2018

History

#1 Updated by Christopher Green about 1 year ago

  • Is duplicate of Bug #20246: Illegal Instruction in hep_concurrency added

#2 Updated by Christopher Green about 1 year ago

  • Status changed from New to Rejected

This is a duplicate of #20246, which was resolved and the fix incorporated into release 2.11.03.



Also available in: Atom PDF