Support #9892
Primary on-call
Description
Log various actions related to primary on-call work.
History
#1 Updated by Natalia Ratnikova over 4 years ago
accepted and acknowledged the page (OK-ed by Lisa).
Cleared zabbix alarm and updated /closed the incidents in SNOW.
#2 Updated by Natalia Ratnikova over 4 years ago
checked the process looks OK, ack-ed alarm
#3 Updated by Natalia Ratnikova over 4 years ago
The cmsstor361 had an error 3 days ago, and the warranty contact ran out yesterday.
Previous primary missed this and cmsstor292 node (still under warranty).
The error on 361 is auto-fixed, see kernel messages in [1] below.
Ack-ed alarm, no repair for now.
cmsstor292 continues to throw errors with "no action required" , see [2].
Node is still under warranty.
Sent email to dcso , asking for advice
[1]
/var/log/messages-20150816:Aug 16 03:34:02 cmsstor361 xinetd3333: Error parsing attribute server - DISABLING SERVICE [file=/etc/xinetd.d/telnet] [line=15]
/var/log/messages-20150823:Aug 22 05:01:05 cmsstor361 kernel: [Hardware Error]: MC4 Error (node 3): DRAM ECC error detected on the NB.
/var/log/messages-20150823:Aug 22 05:01:05 cmsstor361 kernel: EDAC amd64 MC3: CE ERROR_ADDRESS= 0x777472770
/var/log/messages-20150823:Aug 22 05:01:05 cmsstor361 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
/var/log/messages-20150823:Aug 22 05:01:05 cmsstor361 kernel: [Hardware Error]: CPU:12 (10:9:1) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c00c10008080a13
/var/log/messages-20150823:Aug 22 05:01:05 cmsstor361 kernel: [Hardware Error]: MC4_ADDR: 0x0000000777472770
/var/log/messages-20150823:Aug 22 05:01:05 cmsstor361 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
[2]
root@cmsstor292 ~]# grep i error /var/log/messages|CE|MiscV|-|AddrV|CECC]: 0x9c4b400004080a13
Aug 23 05:58:04 cmsstor292 kernel: [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.
Aug 23 05:58:04 cmsstor292 kernel: EDAC amd64 MC6: CE ERROR_ADDRESS= 0xdf29a8150
Aug 23 05:58:04 cmsstor292 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Aug 23 05:58:04 cmsstor292 kernel: [Hardware Error]: CPU:24 (10:9:1) MC4_STATUS[
Aug 23 05:58:04 cmsstor292 kernel: [Hardware Error]: MC4_ADDR: 0x0000000df29a8150
Aug 23 05:58:04 cmsstor292 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Aug 23 07:05:34 cmsstor292 kernel: [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.
Aug 23 07:05:34 cmsstor292 kernel: EDAC amd64 MC6: CE ERROR_ADDRESS= 0xdf29a8150
Aug 23 07:05:34 cmsstor292 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Aug 23 07:05:34 cmsstor292 kernel: [Hardware Error]: CPU:24 (10:9:1) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c4b400004080a13
Aug 23 07:05:34 cmsstor292 kernel: [Hardware Error]: MC4_ADDR: 0x0000000df29a8150
Aug 23 07:05:34 cmsstor292 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Aug 23 11:23:04 cmsstor292 kernel: [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.
Aug 23 11:23:04 cmsstor292 kernel: EDAC amd64 MC6: CE ERROR_ADDRESS= 0xdf29a8150
Aug 23 11:23:04 cmsstor292 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Aug 23 11:23:04 cmsstor292 kernel: [Hardware Error]: CPU:24 (10:9:1) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c4b410004080a13
Aug 23 11:23:04 cmsstor292 kernel: [Hardware Error]: MC4_ADDR: 0x0000000df29a8150
Aug 23 11:23:04 cmsstor292 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Aug 23 15:45:34 cmsstor292 kernel: [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.
Aug 23 15:45:34 cmsstor292 kernel: EDAC amd64 MC6: CE ERROR_ADDRESS= 0xdf29a8150
Aug 23 15:45:34 cmsstor292 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Aug 23 15:45:34 cmsstor292 kernel: [Hardware Error]: CPU:24 (10:9:1) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c4b400004080a13
Aug 23 15:45:34 cmsstor292 kernel: [Hardware Error]: MC4_ADDR: 0x0000000df29a8150
Aug 23 15:45:34 cmsstor292 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Aug 23 21:43:04 cmsstor292 kernel: [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.
Aug 23 21:43:04 cmsstor292 kernel: EDAC amd64 MC6: CE ERROR_ADDRESS= 0xdf29a8150
Aug 23 21:43:04 cmsstor292 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Aug 23 21:43:04 cmsstor292 kernel: [Hardware Error]: CPU:24 (10:9:1) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c4b410004080a13
Aug 23 21:43:04 cmsstor292 kernel: [Hardware Error]: MC4_ADDR: 0x0000000df29a8150
Aug 23 21:43:04 cmsstor292 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Aug 24 16:30:34 cmsstor292 kernel: [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.
Aug 24 16:30:34 cmsstor292 kernel: EDAC amd64 MC6: CE ERROR_ADDRESS= 0xdf29a8150
Aug 24 16:30:34 cmsstor292 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Aug 24 16:30:34 cmsstor292 kernel: [Hardware Error]: CPU:24 (10:9:1) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c4b400004080a13
Aug 24 16:30:34 cmsstor292 kernel: [Hardware Error]: MC4_ADDR: 0x0000000df29a8150
Aug 24 16:30:34 cmsstor292 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Aug 24 17:28:04 cmsstor292 kernel: [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.
Aug 24 17:28:04 cmsstor292 kernel: EDAC amd64 MC6: CE ERROR_ADDRESS= 0xdf29a8150
Aug 24 17:28:04 cmsstor292 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Aug 24 17:28:04 cmsstor292 kernel: [Hardware Error]: CPU:24 (10:9:1) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c4b400004080a13
Aug 24 17:28:04 cmsstor292 kernel: [Hardware Error]: MC4_ADDR: 0x0000000df29a8150
Aug 24 17:28:04 cmsstor292 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Aug 25 01:15:34 cmsstor292 kernel: [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.
Aug 25 01:15:34 cmsstor292 kernel: EDAC amd64 MC6: CE ERROR_ADDRESS= 0xdf29a8150
Aug 25 01:15:34 cmsstor292 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Aug 25 01:15:34 cmsstor292 kernel: [Hardware Error]: CPU:24 (10:9:1) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c4b400004080a13
Aug 25 01:15:34 cmsstor292 kernel: [Hardware Error]: MC4_ADDR: 0x0000000df29a8150
Aug 25 01:15:34 cmsstor292 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
[root@cmsstor292 ~]#
#4 Updated by Natalia Ratnikova over 4 years ago
Dear ECF-CIS,
cmsstor292 is still under warranty until Sep-27-2015 and has HW errors (please see below) since a few days.
Similar problem happens on cmsstor361, which went out of warranty just yesterday, Aug-24-2015.
Aug 22 05:01:05 cmsstor361 kernel: [Hardware Error]: MC4 Error (node 3): DRAM ECC error detected on the NB.
Do we ought to open support tickets with the vendor and your group respectively, despite the kernel saying the errors were corrected ?
Please advice!
Thanks,
Natalia.
On 8/25/15 1:00 PM, Lisa Giacchetti wrote:
I would ask ECF-CIS what they think.
lisa
On 8/25/15 12:53 PM, Natalia Ratnikova wrote:
Hi,
this node is still under warranty, and it throws hardware errors several times a day starting on Aug 21st (~ 50 total)
Is it OK to ignore the errors that say: "Error Status: Corrected error, no action required." ?
Or still file the service call?
Thanks,
Natalia.
#5 Updated by Natalia Ratnikova over 4 years ago
check who manages this:
in ENC environment is ptader, check with him - it's w.i.p.
[root@cmsosgce2 ~]# facter primary
unknown
[root@cmsosgce2 ~]# facter secondary
dcso
[root@cmsosgce2 ~]# facter role
cms_ce
[root@cmsosgce2 ~]# facter puppetenvironment
ptader_htcondor_ce
#6 Updated by Natalia Ratnikova over 4 years ago
Informed Merina about both issues on cmswn2002 and cmswn1275.
Below is the initial investigation details passed on to Merina.
cmswn2002:
Swatch: end_request: I/O error
See system log excerpt below.
cmswn1275 - unavailable, connected to the console, looks like it is trying to reboot....
On the console:
PXE-M0F: Exiting Intel Boot Agent.
Intel(R) Boot Agent GE v1.2.70
Copyright (C) 1997-2007, Intel Corporation
PXE-E61: Media test failure, check cable
PXE-M0F: Exiting Intel Boot Agent.
Operating System not found
[root@cmswn2002 ~]# grep 'I/O' /var/log/messages
Aug 25 13:02:15 cmswn2002 kernel: serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
Aug 25 13:02:15 cmswn2002 kernel: serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
Aug 25 13:02:15 cmswn2002 kernel: serial8250: ttyS2 at I/O 0x3e8 (irq = 5) is a 16550A
Aug 25 13:02:15 cmswn2002 kernel: 00:07: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
Aug 25 13:02:15 cmswn2002 kernel: 00:08: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
Aug 25 13:02:15 cmswn2002 kernel: 00:09: ttyS2 at I/O 0x3e8 (irq = 5) is a 16550A
#7 Updated by Natalia Ratnikova over 4 years ago
cmswn1844 -
The HW errors filled up the system log .
Notified MErina (system primary):
Also cmswn1844 has hardware errors, so frequent that check_mk has an alarm about system log exceeding max size.
Node is OOW, time to retire?
=========================
cmspnfs1 - alarm from Aug 20 about IPMI - ignore for now (likely related to reboot and mount problems)
=========================
cmsxrood1 - critical number of threads: informed Gerard via AIM:
HI Gerard,
check_mk shows critical number of threads on cmsxrootd1 :
CRIT Number of threads[Reschedule an immediate check of the 'Check_MK' service] [View and edit parameters for this service] CRIT - 20816 threads (critical at 8000)2015-08-22 12:50:42 54 sec
20816
=========================
14:10 - many alarms on cmsdcacheadmindisk for slow response
- checked everything looks fine, monitoring works. Alarms gone in 3 mins.
#8 Updated by Natalia Ratnikova over 4 years ago
This is lpc condor collector, sent email to Krista, the system primary.