Dcm problems March2013 » History » Version 52

« Previous - Version 52/61 (diff) - Next » - Current version
Peter Shanahan, 03/20/2013 12:50 PM

Dcm problems March2013

DCM "hard" failures.

DCMs that can never be contacted.

DCM First reported by Date first reported Occurrences Comments Resolution

Fragile DCMs

DCMs that can be contacted for some amount of time before they seize.


  • Please use the full dcm name in the table below, to aid searching.
  • The "Occurrences" count is the number of confirmed times the problem has occurred. It may well be greater...
DCM (location) DCM (S/N) First reported by Date first reported ECL Entries Occurrences Comments Resolution
dcm-2-01-03 dcm-1143 Peter 3/8/13 800 802 808 Highly repeatable Andrew noted it was using it's SN name on 3/8/13
dcm-2-03-01 dcm1220 Peter 3/8/13 800 819 Highly repeatable Got it to crash with repeated ssh "ps aux : grep ps"
dcm-2-03-02 dcm-1225 Peter 3/8/13 800 802 819 2 Hung after starting ssh daemon. Also had an instance of days of 100% CPU "wait". Andrew suspects different manifestation of same problem.
dcm-2-03-03 dcm-1227 Peter 3/8/13 800 802 Highly repeatable Doesn't seem to finish reboot (Andrew)
dcm-2-03-06 dcm-1223 ? ? Was on parole, but now a clear failure on 3/20/13 during testing
dcm-2-03-08 dcm-1228 ? ? Two failures during testing 3/17/13, 3/20/13
dcm-2-04-07 dcm-1150 needs work needs work highly repeatable
dcm-2-04-10 dcm-1135 Peter 3/8/13, 3/19/13 810 830 Highly repeatable Hung during ps/grep command, crashed during reserve resources. Got it to crash twice with repeated ssh "ps aux : grep ps" , using "top tests" after about 20minutes
dcm-2-04-11 dcm-1075 Peter 2/15/13 494 819 highly repeatable
dcm-2-05-12 dcm-1083 Peter 3/15/13 1 Failed in first ~12 hours of ssh test (every 2 seconds)
sent to fnal (was dcm-2-02-11) dcm-1095 Peter near Feb.15 back@fnal(15Mar)-(ssh-ps-aux+memtester+top) at least 1 occurence of hang with no console output
sent to fnal (was dcm-2-02-03) dcm-1158 Peter near Feb.15 Rick K can make it fail on the bench in minutes by doing 10ms top torture test

DCMs that were in "fragile" list, but apparently due to other problems.

DCM (location) DCM (S/N) First reported by Date first reported ECL Entries Occurrences Comments Resolution
dcm-2-01-08 dcm-1039 (?) Peter 3/8/13 800 801 1 Seems to have recovered.
dcm-2-03-09 dcm-1222 Peter 2/18/12 Problems were general network/DDS issues. Likely OK.
dcm-2-04-03 dcm-1151 Likely OK
dcm-2-04-06 dcm-1085 Likely OK
dcm-2-04-09 dcm-1224 Maybe OK
dcm-2-04-12 dcm-1096

DCMs from the first batch of 50

Rick K. was wondering if the 50 DCMs from the first batch (S/Ns 1006-1055) performed any better
thank the rest.

As a start, here is where they live:
DCM (location) DCM (S/N) Comments
dcm-2-01-10 dcm1032 Significant usage starting 3/14/13. Pegged CPU at times, which A) is not really a symptom of the flaky DCMs, and B) it looks like that is consistent with the high data rates on this DCM
dcm-2-01-07 dcm1038
dcm-2-01-08 dcm1039
dcm-2-01-09 dcm1041
dcm-2-01-12 dcm1043
dcm-2-01-11 dcm1044
dcm-2-04-05 dcm1051

Test Proceedures

The following document different types of testing that were done:

CPU Burning Tests
Network Data Copy Tests
Multiple TOP Tests