Project

General

Profile

Bug #1066

DCM stops reporting to Run Control

Added by Peter Shanahan over 8 years ago. Updated over 8 years ago.

Status:
New
Priority:
High
Assignee:
Category:
-
Start date:
03/17/2011
Due date:
% Done:

0%

Estimated time:
component:
base
Duration:

Description

This seems to have started on or around 3/15/11. The main symptom is that the dcm will stop responding to Run Control heartbeat messages, may stop sending Ganglia data, but still sends out normal hit date. (I.e., the event display continues to look good.) So far when this has happened, it has been impossible to log into the DCM.

(As of the creation of this issue, there is no console line available. Coming soon.)

History

#1 Updated by Peter Shanahan over 8 years ago

Note: this effect does not seem to coincide with increase in CPU usage by the affected DCM

#2 Updated by Leon Mualem over 8 years ago

This seems to have happened again yesterday. I don't know if the not able to login part was also true or not. Here's the logbook entry.

5504, Tony Mann (mann), 04/16/2011 08:12:12 General
Karen K. phoned; she is at the detector and is going to be doing some scintillator filling this morning. She says the the previous experience is that the DAQ crashes when after she has the pumps on for awhile.

In an apparently independent happening, dcm-3-02-01 and dcm-302-02 have started skipping heartbeats.
Comments:

Leon Mualem (mualem) 04/16/2011 08:39:02
CPU load is not high on any of the DCMs at this time, so why heartbeats (echoes) are not received, I don’t know why that is.

#3 Updated by Leon Mualem over 8 years ago

Further inspection of Ganglia does show high CPU and data corruption at this time, so it looks like the occurrence yesterday is a different issue.

#4 Updated by Andrew Norman over 8 years ago

This morning (Sunday Apr. 17th) the machine at NDOS which has the console ports appears to no longer allow logins on the public side of the network.

I'm wondering if what we are seeing is a real problem with the network, that manifests itself in both the DCMs and now also this machine (i.e. the switch flipping out, or a network scanner running that causes problems)



Also available in: Atom PDF