- Table of contents
- Dcm problems March2013
Dcm problems March2013¶
DCM "hard" failures.¶
DCMs that can never be contacted.
DCM | First reported by | Date first reported | Occurrences | Comments | Resolution |
Fragile DCMs¶
DCMs that can be contacted for some amount of time before they seize.
Notes:
- Please use the full dcm name in the table below, to aid searching.
- The "Occurrences" count is the number of confirmed times the problem has occurred. It may well be greater...
DCM (location) | DCM (S/N) | First reported by | Date first reported | ECL Entries | Occurrences | Comments | Resolution |
dcm-2-03-06 | dcm-1223 | Was on parole, but now a clear failure on 3/20/13 during testing | |||||
dcm-2-03-08 | dcm-1228 | Highly repeatable | Fails during ssh tests | ||||
dcm-2-05-12 | dcm-1083 | Peter | 3/15/13 | Quite repeatable | Failed in first ~12 hours of ssh test (every 2 seconds) . Failed again on 3/20, again 3/21. | ||
dcm-2-06-07 | dcm-1098 | Peter | 3/20/13 | Quite repeatable | Failed in first ~30 minutes of first tests on 3/20 . Failed again later on 3/20, but with ssh exchange key error, not a crash. | ||
sent to fnal (was dcm-2-02-11) | dcm-1095 | Peter | near Feb.15 | back@fnal(15Mar)-(ssh-ps-aux+memtester+top) at least 1 occurence of hang with no console output | |||
sent to fnal (was dcm-2-02-03) | dcm-1158 | Peter | near Feb.15 | Rick K can make it fail on the bench in minutes by doing 10ms top torture test | |||
removed 3/26/13 (dcm-2-01-03) | dcm-1143 | Peter | 3/8/13 | 800 802 808 | Highly repeatable | Andrew noted it was using it's SN name on 3/8/13 | |
removed 3/26/13 (dcm-2-03-01) | dcm1220 | Peter | 3/8/13 | 800 819 | Highly repeatable | Got it to crash with repeated ssh "ps aux : grep ps" | |
removed 3/26/13 (dcm-2-03-02) | dcm-1225 | Peter | 3/8/13 | 800 802 819 | Quite repeatable | Hung after starting ssh daemon. Also had an instance of days of 100% CPU "wait". Andrew suspects different manifestation of same problem. Hung during tests on 3/20/13 | |
removed 3/26/13 (dcm-2-03-03) | dcm-1227 | Peter | 3/8/13 | 800 802 | Highly repeatable | Doesn't seem to finish reboot (Andrew) | |
removed 3/20/13 (dcm-2-04-07) | dcm-1150 | needs work | needs work | highly repeatable | |||
removed 3/20/13 (dcm-2-04-10) | dcm-1135 | Peter | 3/8/13, 3/19/13 | 810 830 | Highly repeatable | Hung during ps/grep command, crashed during reserve resources. Got it to crash twice with repeated ssh "ps aux : grep ps" , http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=977 using "top tests" after about 20minutes | |
removed 3/20/13 (dcm-2-04-11) | dcm-1075 | Peter | 2/15/13 | 494 819 | highly repeatable |
DCMs that were in "fragile" list, but apparently due to other problems.¶
DCM (location) | DCM (S/N) | First reported by | Date first reported | ECL Entries | Occurrences | Comments | Resolution |
dcm-2-01-08 | dcm-1039 (?) | Peter | 3/8/13 | 800 801 | 1 | Seems to have recovered. | |
dcm-2-03-09 | dcm-1222 | Peter | 2/18/12 | Problems were general network/DDS issues. Likely OK. | |||
dcm-2-04-03 | dcm-1151 | Likely OK | |||||
dcm-2-04-06 | dcm-1085 | Likely OK | |||||
dcm-2-04-09 | dcm-1224 | Maybe OK | |||||
dcm-2-04-12 | dcm-1096 |
DCMs from the first batch of 50¶
Rick K. was wondering if the 50 DCMs from the first batch (S/Ns 1006-1055) performed any better
thank the rest.
DCM (location) | DCM (S/N) | Comments |
dcm-2-01-10 | dcm1032 | Significant usage starting 3/14/13. Pegged CPU at times, which A) is not really a symptom of the flaky DCMs, and B) it looks like that is consistent with the high data rates on this DCM |
dcm-2-01-07 | dcm1038 | |
dcm-2-01-08 | dcm1039 | |
dcm-2-01-09 | dcm1041 | |
dcm-2-01-12 | dcm1043 | |
dcm-2-01-11 | dcm1044 | |
dcm-2-04-05 | dcm1051 |
DCMs not failure top torture tests¶
These are the DCMs that have not failed in more than a continuous month of the multi-top torture test, as of 2013-04-20.
Absence from this list in no way indicates badness, since there could be lots of reasons for a reboot in the preceding month.
DCM (location) | DCM (S/N) | Comments |
dcm-2-04-01 | ||
dcm-2-04-02 | ||
dcm-2-04-03 | ||
dcm-2-04-08 | ||
dcm-2-04-09 | ||
dcm-2-04-12 | ||
dcm-2-05-08 | ||
dcm-2-05-09 | ||
dcm-2-05-10 | ||
dcm-2-05-11 |
Test Proceedures¶
The following document different types of testing that were done:
CPU Burning Tests
Network Data Copy Tests
Multiple TOP Tests
Repeated ssh