Bug #20488
DUNE jobs on some off-site worker nodes are terminated with exit status 4 (SIGILL)
Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
07/30/2018
Due date:
% Done:
0%
Estimated time:
Scope:
Internal
Experiment:
-
SSI Package:
Co-Assignees:
Description
Running DUNE (and uBooNE) jobs off-site I get a some of those jobs terminated with exit status 4 (SIGILL)
Details of the code I run are:
dunetpc: v06_84_00 with qualifier "debug:e15" art: v2_11_02
command:
lar --rethrow-all -c prodgenie_nue_dune10kt_1x2x6.fcl -n 1 -o prodgenie_nue_dune10kt_1x2x6_pass_0.root
gdb backtrace is the following:
Reading symbols from lar...done. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Program received signal SIGILL, Illegal instruction. hep::concurrency::getTSCP (cpuidx=@0x7ffffffe2954: 0) at /scratch/workspace/art-release-build/SLF6/debug/build/hep_concurrency/v1_00_02/src/hep_concurrency/tsan.cc:28 #0 hep::concurrency::getTSCP (cpuidx=@0x7ffffffe2954: 0) at /scratch/workspace/art-release-build/SLF6/debug/build/hep_concurrency/v1_00_02/src/hep_concurrency/tsan.cc:28 #1 0x00002aaab4b7806b in hep::concurrency::RecursiveMutex::lock ( this=0x2aaaab4a41e0 , opName=...) at /scratch/workspace/art-release-build/SLF6/debug/build/hep_concurrency/v1_00_02/src/hep_concurrency/RecursiveMutex.cc:72 #2 0x00002aaab4b79d40 in hep::concurrency::RecursiveMutexSentry::RecursiveMutexSentry (this=0x7ffffffe2a60, mutex=..., name=...) at /scratch/workspace/art-release-build/SLF6/debug/build/hep_concurrency/v1_00_02/src/hep_concurrency/RecursiveMutex.cc:283 #3 0x00002aaaab23548a in mf::(anonymous namespace)::logMessage ( msg=0x6150960) at /scratch/workspace/art-release-build/SLF6/debug/build/messagefacility/v2_02_02/src/messagefacility/MessageLogger/MessageLogger.cc:438 #4 0x00002aaaab23582f in mf::LogErrorObj (msg=0x6150960) at /scratch/workspace/art-release-build/SLF6/debug/build/messagefacility/v2_02_02/src/messagefacility/MessageLogger/MessageLogger.cc:547 #5 0x00002aaaaaed6d0b in mf::MaybeLogger_<(mf::ELseverityLevel::ELsev_)3, false>::~MaybeLogger_ (this=0x7ffffffe4da8, __in_chrg=) at /scratch/workspace/art-release-build/SLF6/debug/build/messagefacility/v2_02_02/include/messagefacility/MessageLogger/MessageLogger.h:129 #6 0x00002aaaaaed2e41 in art::run_art_common_ (main_pset=...) at /scratch/workspace/art-release-build/SLF6/debug/build/art/v2_11_02/src/art/Framework/Art/run_art.cc:287 #7 0x00002aaaaaed215d in art::run_art (argc=8, argv=0x7ffffffe5888, in_desc=..., lookupPolicy=..., handlers=...) at /scratch/workspace/art-release-build/SLF6/debug/build/art/v2_11_02/src/art/Framework/Art/run_art.cc:206 #8 0x00002aaaaaece117 in artapp (argc=8, argv=0x7ffffffe5888) at /scratch/workspace/art-release-build/SLF6/debug/build/art/v2_11_02/build-Linux64bit+2.6-2.12-e15-debug/art/Framework/Art/artapp.cc:51 #9 0x0000000000401628 in main (argc=8, argv=0x7ffffffe5888) at /scratch/workspace/art-release-build/SLF6/debug/build/art/v2_11_02/build-Linux64bit+2.6-2.12-e15-debug/art/Framework/Art/lar.cc:9 A debugging session is active. Inferior 1 [process 84628] will be killed.
Jobs terminated this way are running on worker node with the following CPU info from /proc/cpuinfo
processor : 0 vendor_id : AuthenticAMD cpu family : 21 model : 1 model name : AMD Opteron(tm) Processor 6282 SE stepping : 2 cpu MHz : 2599.948 cache size : 2048 KB physical id : 0 siblings : 16 core id : 0 cpu cores : 16 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm rep_good extd_apicid unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw xop fma4 bogomips : 5199.89 TLB size : 1536 4K pages clflush size : 64 cache_alignment : 64 address sizes : 42 bits physical, 48 bits virtual power management:
processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 46 model name : Intel(R) Xeon(R) CPU X7560 @ 2.27GHz stepping : 6 microcode : 4294967295 cpu MHz : 2260.949 cache size : 24576 KB physical id : 0 siblings : 16 core id : 0 cpu cores : 16 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm bogomips : 4521.89 clflush size : 64 cache_alignment : 64 address sizes : 42 bits physical, 48 bits virtual power management:
processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz stepping : 4 microcode : 4294967295 cpu MHz : 2199.980 cache size : 25600 KB physical id : 0 siblings : 20 core id : 0 cpu cores : 20 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good unfair_spinlock pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 popcnt aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 4399.96 clflush size : 64 cache_alignment : 64 address sizes : 42 bits physical, 48 bits virtual power management:
processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5650 @ 2.67GHz stepping : 2 microcode : 4294967295 cpu MHz : 2659.976 cache size : 12288 KB physical id : 0 siblings : 12 core id : 0 cpu cores : 12 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good unfair_spinlock pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 popcnt aes hypervisor lahf_lm bogomips : 5319.95 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:
Is this a known issue?
Related issues
History
#1 Updated by Christopher Green over 2 years ago
- Is duplicate of Bug #20246: Illegal Instruction in hep_concurrency added
#2 Updated by Christopher Green over 2 years ago
- Status changed from New to Rejected