Project

General

Profile

Support #9892

Primary on-call

Added by Natalia Ratnikova over 4 years ago. Updated over 4 years ago.

Status:
Assigned
Priority:
Normal
Start date:
08/25/2015
Due date:
% Done:

0%

Estimated time:
Spent time:
component:
base
Scope:
Internal
Experiment:
-
Stakeholders:
Co-Assignees:
Categorization:
-
Duration:

Description

Log various actions related to primary on-call work.

History

#1 Updated by Natalia Ratnikova over 4 years ago

accepted and acknowledged the page (OK-ed by Lisa).

Cleared zabbix alarm and updated /closed the incidents in SNOW.

#2 Updated by Natalia Ratnikova over 4 years ago

checked the process looks OK, ack-ed alarm

#3 Updated by Natalia Ratnikova over 4 years ago

The cmsstor361 had an error 3 days ago, and the warranty contact ran out yesterday.
Previous primary missed this and cmsstor292 node (still under warranty).

The error on 361 is auto-fixed, see kernel messages in [1] below.
Ack-ed alarm, no repair for now.

cmsstor292 continues to throw errors with "no action required" , see [2].
Node is still under warranty.

Sent email to dcso , asking for advice

[1]

/var/log/messages-20150816:Aug 16 03:34:02 cmsstor361 xinetd3333: Error parsing attribute server - DISABLING SERVICE [file=/etc/xinetd.d/telnet] [line=15]
/var/log/messages-20150823:Aug 22 05:01:05 cmsstor361 kernel: [Hardware Error]: MC4 Error (node 3): DRAM ECC error detected on the NB.
/var/log/messages-20150823:Aug 22 05:01:05 cmsstor361 kernel: EDAC amd64 MC3: CE ERROR_ADDRESS= 0x777472770
/var/log/messages-20150823:Aug 22 05:01:05 cmsstor361 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
/var/log/messages-20150823:Aug 22 05:01:05 cmsstor361 kernel: [Hardware Error]: CPU:12 (10:9:1) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c00c10008080a13
/var/log/messages-20150823:Aug 22 05:01:05 cmsstor361 kernel: [Hardware Error]: MC4_ADDR: 0x0000000777472770
/var/log/messages-20150823:Aug 22 05:01:05 cmsstor361 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

[2]

root@cmsstor292 ~]# grep i error /var/log/messages
Aug 23 05:58:04 cmsstor292 kernel: [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.
Aug 23 05:58:04 cmsstor292 kernel: EDAC amd64 MC6: CE ERROR_ADDRESS= 0xdf29a8150
Aug 23 05:58:04 cmsstor292 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Aug 23 05:58:04 cmsstor292 kernel: [Hardware Error]: CPU:24 (10:9:1) MC4_STATUS[
|CE|MiscV|-|AddrV|CECC]: 0x9c4b400004080a13
Aug 23 05:58:04 cmsstor292 kernel: [Hardware Error]: MC4_ADDR: 0x0000000df29a8150
Aug 23 05:58:04 cmsstor292 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Aug 23 07:05:34 cmsstor292 kernel: [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.
Aug 23 07:05:34 cmsstor292 kernel: EDAC amd64 MC6: CE ERROR_ADDRESS= 0xdf29a8150
Aug 23 07:05:34 cmsstor292 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Aug 23 07:05:34 cmsstor292 kernel: [Hardware Error]: CPU:24 (10:9:1) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c4b400004080a13
Aug 23 07:05:34 cmsstor292 kernel: [Hardware Error]: MC4_ADDR: 0x0000000df29a8150
Aug 23 07:05:34 cmsstor292 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Aug 23 11:23:04 cmsstor292 kernel: [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.
Aug 23 11:23:04 cmsstor292 kernel: EDAC amd64 MC6: CE ERROR_ADDRESS= 0xdf29a8150
Aug 23 11:23:04 cmsstor292 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Aug 23 11:23:04 cmsstor292 kernel: [Hardware Error]: CPU:24 (10:9:1) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c4b410004080a13
Aug 23 11:23:04 cmsstor292 kernel: [Hardware Error]: MC4_ADDR: 0x0000000df29a8150
Aug 23 11:23:04 cmsstor292 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Aug 23 15:45:34 cmsstor292 kernel: [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.
Aug 23 15:45:34 cmsstor292 kernel: EDAC amd64 MC6: CE ERROR_ADDRESS= 0xdf29a8150
Aug 23 15:45:34 cmsstor292 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Aug 23 15:45:34 cmsstor292 kernel: [Hardware Error]: CPU:24 (10:9:1) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c4b400004080a13
Aug 23 15:45:34 cmsstor292 kernel: [Hardware Error]: MC4_ADDR: 0x0000000df29a8150
Aug 23 15:45:34 cmsstor292 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Aug 23 21:43:04 cmsstor292 kernel: [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.
Aug 23 21:43:04 cmsstor292 kernel: EDAC amd64 MC6: CE ERROR_ADDRESS= 0xdf29a8150
Aug 23 21:43:04 cmsstor292 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Aug 23 21:43:04 cmsstor292 kernel: [Hardware Error]: CPU:24 (10:9:1) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c4b410004080a13
Aug 23 21:43:04 cmsstor292 kernel: [Hardware Error]: MC4_ADDR: 0x0000000df29a8150
Aug 23 21:43:04 cmsstor292 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Aug 24 16:30:34 cmsstor292 kernel: [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.
Aug 24 16:30:34 cmsstor292 kernel: EDAC amd64 MC6: CE ERROR_ADDRESS= 0xdf29a8150
Aug 24 16:30:34 cmsstor292 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Aug 24 16:30:34 cmsstor292 kernel: [Hardware Error]: CPU:24 (10:9:1) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c4b400004080a13
Aug 24 16:30:34 cmsstor292 kernel: [Hardware Error]: MC4_ADDR: 0x0000000df29a8150
Aug 24 16:30:34 cmsstor292 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Aug 24 17:28:04 cmsstor292 kernel: [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.
Aug 24 17:28:04 cmsstor292 kernel: EDAC amd64 MC6: CE ERROR_ADDRESS= 0xdf29a8150
Aug 24 17:28:04 cmsstor292 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Aug 24 17:28:04 cmsstor292 kernel: [Hardware Error]: CPU:24 (10:9:1) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c4b400004080a13
Aug 24 17:28:04 cmsstor292 kernel: [Hardware Error]: MC4_ADDR: 0x0000000df29a8150
Aug 24 17:28:04 cmsstor292 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
Aug 25 01:15:34 cmsstor292 kernel: [Hardware Error]: MC4 Error (node 6): DRAM ECC error detected on the NB.
Aug 25 01:15:34 cmsstor292 kernel: EDAC amd64 MC6: CE ERROR_ADDRESS= 0xdf29a8150
Aug 25 01:15:34 cmsstor292 kernel: [Hardware Error]: Error Status: Corrected error, no action required.
Aug 25 01:15:34 cmsstor292 kernel: [Hardware Error]: CPU:24 (10:9:1) MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c4b400004080a13
Aug 25 01:15:34 cmsstor292 kernel: [Hardware Error]: MC4_ADDR: 0x0000000df29a8150
Aug 25 01:15:34 cmsstor292 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
[root@cmsstor292 ~]#

#4 Updated by Natalia Ratnikova over 4 years ago

Dear ECF-CIS,

cmsstor292 is still under warranty until Sep-27-2015 and has HW errors (please see below) since a few days.

Similar problem happens on cmsstor361, which went out of warranty just yesterday, Aug-24-2015.

Aug 22 05:01:05 cmsstor361 kernel: [Hardware Error]: MC4 Error (node 3): DRAM ECC error detected on the NB.

Do we ought to open support tickets with the vendor and your group respectively, despite the kernel saying the errors were corrected ?
Please advice!

Thanks,
Natalia.

On 8/25/15 1:00 PM, Lisa Giacchetti wrote:

I would ask ECF-CIS what they think.

lisa

On 8/25/15 12:53 PM, Natalia Ratnikova wrote:

Hi,

this node is still under warranty, and it throws hardware errors several times a day starting on Aug 21st (~ 50 total)

Is it OK to ignore the errors that say: "Error Status: Corrected error, no action required." ?

Or still file the service call?

Thanks,
Natalia.

#5 Updated by Natalia Ratnikova over 4 years ago

check who manages this:
in ENC environment is ptader, check with him - it's w.i.p.

[root@cmsosgce2 ~]# facter primary
unknown
[root@cmsosgce2 ~]# facter secondary
dcso
[root@cmsosgce2 ~]# facter role
cms_ce
[root@cmsosgce2 ~]# facter puppetenvironment
ptader_htcondor_ce

#6 Updated by Natalia Ratnikova over 4 years ago

Informed Merina about both issues on cmswn2002 and cmswn1275.

Below is the initial investigation details passed on to Merina.

cmswn2002:
Swatch: end_request: I/O error

See system log excerpt below.

cmswn1275 - unavailable, connected to the console, looks like it is trying to reboot....

On the console:

PXE-M0F: Exiting Intel Boot Agent.

Intel(R) Boot Agent GE v1.2.70
Copyright (C) 1997-2007, Intel Corporation

PXE-E61: Media test failure, check cable
PXE-M0F: Exiting Intel Boot Agent.
Operating System not found

[root@cmswn2002 ~]# grep 'I/O' /var/log/messages
Aug 25 13:02:15 cmswn2002 kernel: serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
Aug 25 13:02:15 cmswn2002 kernel: serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
Aug 25 13:02:15 cmswn2002 kernel: serial8250: ttyS2 at I/O 0x3e8 (irq = 5) is a 16550A
Aug 25 13:02:15 cmswn2002 kernel: 00:07: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
Aug 25 13:02:15 cmswn2002 kernel: 00:08: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
Aug 25 13:02:15 cmswn2002 kernel: 00:09: ttyS2 at I/O 0x3e8 (irq = 5) is a 16550A

#7 Updated by Natalia Ratnikova over 4 years ago

cmswn1844 -

The HW errors filled up the system log .
Notified MErina (system primary):

Also cmswn1844 has hardware errors, so frequent that check_mk has an alarm about system log exceeding max size.
Node is OOW, time to retire?

=========================

cmspnfs1 - alarm from Aug 20 about IPMI - ignore for now (likely related to reboot and mount problems)

=========================
cmsxrood1 - critical number of threads: informed Gerard via AIM:

HI Gerard,
check_mk shows critical number of threads on cmsxrootd1 :
CRIT Number of threads[Reschedule an immediate check of the 'Check_MK' service] [View and edit parameters for this service] CRIT - 20816 threads (critical at 8000)2015-08-22 12:50:42 54 sec

20816

=========================
14:10 - many alarms on cmsdcacheadmindisk for slow response

- checked everything looks fine, monitoring works. Alarms gone in 3 mins.

#8 Updated by Natalia Ratnikova over 4 years ago

This is lpc condor collector, sent email to Krista, the system primary.



Also available in: Atom PDF