Project

General

Profile

Dcm problems March2013 » History » Version 45

Version 44 (Peter Shanahan, 03/19/2013 03:12 PM) → Version 45/61 (Peter Shanahan, 03/19/2013 03:13 PM)

{{TOC}}

h1. Dcm problems March2013

h2. DCM "hard" failures.

DCMs that can never be contacted.

|DCM| First reported by| Date first reported| Occurrences| Comments | Resolution |

h2. Fragile DCMs

DCMs that can be contacted for some amount of time before they seize.

Notes:

* Please use the full dcm name in the table below, to aid searching.
* The "Occurrences" count is the number of confirmed times the problem has occurred. It may well be greater...

|DCM (location) | DCM (S/N) | First reported by| Date first reported| ECL Entries | Occurrences| Comments | Resolution |
| dcm-2-01-03 | dcm-1143 | Peter | 3/8/13 | "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802 "808":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=808 | Highly repeatable 2 | Andrew noted it was using it's SN name on 3/8/13 | |
| dcm-2-03-01 | dcm1220 |Peter | 3/8/13 | "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "819":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=819 | Highly repeatable | Got it to crash with repeated ssh "ps aux : grep ps" |
| dcm-2-03-02 | dcm-1225 |Peter | 3/8/13 | "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802 "819":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=819 | 2 | Hung after starting ssh daemon. Also had an instance of days of 100% CPU "wait". Andrew suspects different manifestation of same problem. | |
| dcm-2-03-03 | dcm-1227 |Peter | 3/8/13 | "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802 | Highly repeatable | Doesn't seem to finish reboot (Andrew) | |
| dcm-2-03-08 | dcm-1228 | ? | ? | | | One clear failure, during ssh test on 3/17/13| |
| dcm-2-04-07 | dcm-1150 | | needs work | needs work | highly repeatable | | |
| dcm-2-04-10 | dcm-1135 | Peter | 3/8/13 | "810":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=810 "830":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=830 | Highly repeatable | Hung during ps/grep command, crashed during reserve resources. Got it to crash twice with repeated ssh "ps aux : grep ps" | |
| dcm-2-04-11 | dcm-1075 | Peter | 2/15/13 | "494":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=494 "819":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=819 | highly repeatable | | |
| set to fnal | dcm-1095 | Peter | near Feb.15 | | | back@fnal(15Mar)-(ssh-ps-aux+memtester+top) at least 1 occurence of hang with no console output | |

h2. DCMs that were in "fragile" list, but apparently due to other problems.

|DCM (location) | DCM (S/N) | First reported by| Date first reported| ECL Entries | Occurrences| Comments | Resolution |
| dcm-2-01-08 | dcm-1039 (?) | Peter | 3/8/13 | "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "801":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=801 | 1 | Seems to have recovered. | |
| dcm-2-03-06 | dcm-1223 | ? | ? | | | Not included in partition since 2/20/13, but no comment why in ECL. Likely OK| |
| dcm-2-03-09 | dcm-1222 | Peter | 2/18/12 | | | Problems were general network/DDS issues. Likely OK. | |
| dcm-2-04-03 | dcm-1151 | | | | | Likely OK | |
| dcm-2-04-06 | dcm-1085 | | | | | | Likely OK |
| dcm-2-04-09 | dcm-1224 | | | | | Maybe OK | |
| dcm-2-04-12 | dcm-1096 | | | | | | |

h2. DCMs from the first batch of 50

Rick K. was wondering if the 50 DCMs from the first batch (S/Ns 1006-1055) performed any better
thank the rest.

As a start, here is where they live:
| DCM (location) | DCM (S/N) | Comments |
|dcm-2-01-10 | dcm1032 | Significant usage starting 3/14/13. Pegged CPU at times, which A) is not really a symptom of the flaky DCMs, and B) it looks like that is consistent with the high data rates on this DCM |
|dcm-2-01-07 | dcm1038| |
|dcm-2-01-08 | dcm1039| |
|dcm-2-01-09 | dcm1041| |
|dcm-2-01-12 | dcm1043| |
|dcm-2-01-11 | dcm1044| |
|dcm-2-04-05 | dcm1051| |

h2. Test Proceedures

The following document different types of testing that were done:

[[CPU Burning Tests]]
[[Network Data Copy Tests]]