Project

General

Profile

Dcm problems March2013 » History » Version 59

Version 58 (Peter Shanahan, 03/21/2013 11:19 AM) → Version 59/61 (Peter Shanahan, 03/27/2013 10:01 AM)

{{TOC}}

h1. Dcm problems March2013

h2. DCM "hard" failures.

DCMs that can never be contacted.

|DCM| First reported by| Date first reported| Occurrences| Comments | Resolution |

h2. Fragile DCMs

DCMs that can be contacted for some amount of time before they seize.

Notes:

* Please use the full dcm name in the table below, to aid searching.
* The "Occurrences" count is the number of confirmed times the problem has occurred. It may well be greater...

|DCM (location) | DCM (S/N) | First reported by| Date first reported| ECL Entries | Occurrences| Comments | Resolution |
| dcm-2-01-03 | dcm-1143 | Peter | 3/8/13 | "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802 "808":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=808 | Highly repeatable | Andrew noted it was using it's SN name on 3/8/13 | |
| dcm-2-03-01 | dcm1220 |Peter | 3/8/13 | "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "819":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=819 | Highly repeatable | Got it to crash with repeated ssh "ps aux : grep ps" |
| dcm-2-03-02 | dcm-1225 |Peter | 3/8/13 | "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802 "819":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=819 | Quite repeatable | Hung after starting ssh daemon. Also had an instance of days of 100% CPU "wait". Andrew suspects different manifestation of same problem. Hung during tests on 3/20/13 | |
| dcm-2-03-03 | dcm-1227 |Peter | 3/8/13 | "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802 | Highly repeatable | Doesn't seem to finish reboot (Andrew) | |
|
dcm-2-03-06 | dcm-1223 | | | | | Was on parole, but now a clear failure on 3/20/13 during testing| |
| dcm-2-03-08 | dcm-1228 | | | | Highly repeatable |Fails during ssh tests | |
| dcm-2-05-12 | dcm-1083 | Peter | 3/15/13 | | Quite repeatable | Failed in first ~12 hours of ssh test (every 2 seconds) . Failed again on 3/20, again 3/21. | |
| dcm-2-06-07 | dcm-1098 | Peter | 3/20/13 | | Quite repeatable | Failed in first ~30 minutes of first tests on 3/20 . Failed again later on 3/20, but with ssh exchange key error, not a crash. | |
| sent to fnal (was dcm-2-02-11)| dcm-1095 | Peter | near Feb.15 | | | back@fnal(15Mar)-(ssh-ps-aux+memtester+top) at least 1 occurence of hang with no console output | |
| sent to fnal (was dcm-2-02-03)| dcm-1158 | Peter | near Feb.15 | | | Rick K can make it fail on the bench in minutes by doing 10ms top torture test| |
| removed 3/26/13 (dcm-2-01-03) | dcm-1143 | Peter | 3/8/13 | "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802 "808":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=808 | Highly repeatable | Andrew noted it was using it's SN name on 3/8/13 | |
| removed 3/26/13 (dcm-2-03-01) | dcm1220 |Peter | 3/8/13 | "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "819":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=819 | Highly repeatable | Got it to crash with repeated ssh "ps aux : grep ps" |
| removed 3/26/13 (dcm-2-03-02) | dcm-1225 |Peter | 3/8/13 | "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802 "819":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=819 | Quite repeatable | Hung after starting ssh daemon. Also had an instance of days of 100% CPU "wait". Andrew suspects different manifestation of same problem. Hung during tests on 3/20/13 | |
| removed 3/26/13 (dcm-2-03-03) | dcm-1227 |Peter | 3/8/13 | "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802 | Highly repeatable | Doesn't seem to finish reboot (Andrew) |
|
| removed 3/20/13 (dcm-2-04-07) | dcm-1150 | | needs work | needs work | highly repeatable | | |
| removed 3/20/13 (dcm-2-04-10) | dcm-1135 | Peter | 3/8/13, 3/19/13 | "810":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=810 "830":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=830 | Highly repeatable | Hung during ps/grep command, crashed during reserve resources. Got it to crash twice with repeated ssh "ps aux : grep ps" , http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=977 using "top tests" after about 20minutes| |
| removed 3/20/13 (dcm-2-04-11) | dcm-1075 | Peter | 2/15/13 | "494":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=494 "819":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=819 | highly repeatable | | |

h2. DCMs that were in "fragile" list, but apparently due to other problems.

|DCM (location) | DCM (S/N) | First reported by| Date first reported| ECL Entries | Occurrences| Comments | Resolution |
| dcm-2-01-08 | dcm-1039 (?) | Peter | 3/8/13 | "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "801":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=801 | 1 | Seems to have recovered. | |
| dcm-2-03-09 | dcm-1222 | Peter | 2/18/12 | | | Problems were general network/DDS issues. Likely OK. | |
| dcm-2-04-03 | dcm-1151 | | | | | Likely OK | |
| dcm-2-04-06 | dcm-1085 | | | | | | Likely OK |
| dcm-2-04-09 | dcm-1224 | | | | | Maybe OK | |
| dcm-2-04-12 | dcm-1096 | | | | | | |

h2. DCMs from the first batch of 50

Rick K. was wondering if the 50 DCMs from the first batch (S/Ns 1006-1055) performed any better
thank the rest.

As a start, here is where they live:
| DCM (location) | DCM (S/N) | Comments |
|dcm-2-01-10 | dcm1032 | Significant usage starting 3/14/13. Pegged CPU at times, which A) is not really a symptom of the flaky DCMs, and B) it looks like that is consistent with the high data rates on this DCM |
|dcm-2-01-07 | dcm1038| |
|dcm-2-01-08 | dcm1039| |
|dcm-2-01-09 | dcm1041| |
|dcm-2-01-12 | dcm1043| |
|dcm-2-01-11 | dcm1044| |
|dcm-2-04-05 | dcm1051| |

h2. Test Proceedures

The following document different types of testing that were done:

[[CPU Burning Tests]]
[[Network Data Copy Tests]]
[[Multiple TOP Tests]]
[[Repeated ssh]]