Project

General

Profile

Dcm problems March2013

DCM "hard" failures.

DCMs that can never be contacted.

DCM First reported by Date first reported Occurrences Comments Resolution

Fragile DCMs

DCMs that can be contacted for some amount of time before they seize.

Notes:

  • Please use the full dcm name in the table below, to aid searching.
  • The "Occurrences" count is the number of confirmed times the problem has occurred. It may well be greater...
DCM (location) DCM (S/N) First reported by Date first reported ECL Entries Occurrences Comments Resolution
dcm-2-03-06 dcm-1223 Was on parole, but now a clear failure on 3/20/13 during testing
dcm-2-03-08 dcm-1228 Highly repeatable Fails during ssh tests
dcm-2-05-12 dcm-1083 Peter 3/15/13 Quite repeatable Failed in first ~12 hours of ssh test (every 2 seconds) . Failed again on 3/20, again 3/21.
dcm-2-06-07 dcm-1098 Peter 3/20/13 Quite repeatable Failed in first ~30 minutes of first tests on 3/20 . Failed again later on 3/20, but with ssh exchange key error, not a crash.
sent to fnal (was dcm-2-02-11) dcm-1095 Peter near Feb.15 back@fnal(15Mar)-(ssh-ps-aux+memtester+top) at least 1 occurence of hang with no console output
sent to fnal (was dcm-2-02-03) dcm-1158 Peter near Feb.15 Rick K can make it fail on the bench in minutes by doing 10ms top torture test
removed 3/26/13 (dcm-2-01-03) dcm-1143 Peter 3/8/13 800 802 808 Highly repeatable Andrew noted it was using it's SN name on 3/8/13
removed 3/26/13 (dcm-2-03-01) dcm1220 Peter 3/8/13 800 819 Highly repeatable Got it to crash with repeated ssh "ps aux : grep ps"
removed 3/26/13 (dcm-2-03-02) dcm-1225 Peter 3/8/13 800 802 819 Quite repeatable Hung after starting ssh daemon. Also had an instance of days of 100% CPU "wait". Andrew suspects different manifestation of same problem. Hung during tests on 3/20/13
removed 3/26/13 (dcm-2-03-03) dcm-1227 Peter 3/8/13 800 802 Highly repeatable Doesn't seem to finish reboot (Andrew)
removed 3/20/13 (dcm-2-04-07) dcm-1150 needs work needs work highly repeatable
removed 3/20/13 (dcm-2-04-10) dcm-1135 Peter 3/8/13, 3/19/13 810 830 Highly repeatable Hung during ps/grep command, crashed during reserve resources. Got it to crash twice with repeated ssh "ps aux : grep ps" , http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=977 using "top tests" after about 20minutes
removed 3/20/13 (dcm-2-04-11) dcm-1075 Peter 2/15/13 494 819 highly repeatable

DCMs that were in "fragile" list, but apparently due to other problems.

DCM (location) DCM (S/N) First reported by Date first reported ECL Entries Occurrences Comments Resolution
dcm-2-01-08 dcm-1039 (?) Peter 3/8/13 800 801 1 Seems to have recovered.
dcm-2-03-09 dcm-1222 Peter 2/18/12 Problems were general network/DDS issues. Likely OK.
dcm-2-04-03 dcm-1151 Likely OK
dcm-2-04-06 dcm-1085 Likely OK
dcm-2-04-09 dcm-1224 Maybe OK
dcm-2-04-12 dcm-1096

DCMs from the first batch of 50

Rick K. was wondering if the 50 DCMs from the first batch (S/Ns 1006-1055) performed any better
thank the rest.

As a start, here is where they live:
DCM (location) DCM (S/N) Comments
dcm-2-01-10 dcm1032 Significant usage starting 3/14/13. Pegged CPU at times, which A) is not really a symptom of the flaky DCMs, and B) it looks like that is consistent with the high data rates on this DCM
dcm-2-01-07 dcm1038
dcm-2-01-08 dcm1039
dcm-2-01-09 dcm1041
dcm-2-01-12 dcm1043
dcm-2-01-11 dcm1044
dcm-2-04-05 dcm1051

DCMs not failure top torture tests

These are the DCMs that have not failed in more than a continuous month of the multi-top torture test, as of 2013-04-20.
Absence from this list in no way indicates badness, since there could be lots of reasons for a reboot in the preceding month.

DCM (location) DCM (S/N) Comments
dcm-2-04-01
dcm-2-04-02
dcm-2-04-03
dcm-2-04-08
dcm-2-04-09
dcm-2-04-12
dcm-2-05-08
dcm-2-05-09
dcm-2-05-10
dcm-2-05-11

Test Proceedures

The following document different types of testing that were done:

CPU Burning Tests
Network Data Copy Tests
Multiple TOP Tests
Repeated ssh