Project

General

Profile

Dcm problems March2013 » History » Version 60

Peter Shanahan, 04/20/2013 04:26 PM

1 37 Andrew Norman
{{TOC}}
2 37 Andrew Norman
3 1 Peter Shanahan
h1. Dcm problems March2013
4 1 Peter Shanahan
5 1 Peter Shanahan
h2. DCM "hard" failures.
6 1 Peter Shanahan
7 1 Peter Shanahan
DCMs that can never be contacted.
8 1 Peter Shanahan
9 3 Peter Shanahan
|DCM| First reported by| Date first reported| Occurrences| Comments | Resolution |
10 2 Peter Shanahan
11 2 Peter Shanahan
12 2 Peter Shanahan
13 1 Peter Shanahan
h2. Fragile DCMs
14 1 Peter Shanahan
15 1 Peter Shanahan
DCMs that can be contacted for some amount of time before they seize. 
16 1 Peter Shanahan
17 8 Peter Shanahan
Notes:
18 1 Peter Shanahan
19 8 Peter Shanahan
* Please use the full dcm name in the table below, to aid searching.
20 8 Peter Shanahan
* The "Occurrences" count is the number of confirmed times the problem has occurred.  It may well be greater...
21 8 Peter Shanahan
22 9 Peter Shanahan
|DCM (location) | DCM (S/N) | First reported by| Date first reported| ECL Entries | Occurrences| Comments | Resolution |
23 55 Peter Shanahan
| dcm-2-03-06 |   dcm-1223      |         |             |      |     | Was on parole, but now a clear failure on 3/20/13 during testing|  |
24 57 Peter Shanahan
| dcm-2-03-08 |   dcm-1228      |          |               |      |   Highly repeatable   |Fails during ssh tests |  | 
25 1 Peter Shanahan
| dcm-2-05-12     |   dcm-1083        | Peter           | 3/15/13             |       |      Quite repeatable |    Failed in first ~12 hours of ssh test (every 2 seconds) .  Failed again on 3/20, again 3/21.   |       |   
26 1 Peter Shanahan
| dcm-2-06-07     |   dcm-1098        | Peter           | 3/20/13             |       |      Quite repeatable |    Failed in first ~30 minutes of first tests on 3/20 .  Failed again later on 3/20, but with ssh exchange key error, not a crash.  |       |             
27 1 Peter Shanahan
|  sent to fnal (was dcm-2-02-11)|    dcm-1095     |     Peter     |   near Feb.15  |      |      | back@fnal(15Mar)-(ssh-ps-aux+memtester+top) at least 1 occurence of hang with no console output |  |
28 1 Peter Shanahan
|  sent to fnal (was dcm-2-02-03)|    dcm-1158     |     Peter     |   near Feb.15  |      |      | Rick K can make it fail on the bench in minutes by doing 10ms top torture test|  |
29 59 Peter Shanahan
| removed 3/26/13 (dcm-2-01-03)   |  dcm-1143          | Peter           |  3/8/13            | "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802 "808":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=808    |  Highly repeatable   | Andrew noted it was using it's SN name on 3/8/13 |  |
30 59 Peter Shanahan
| removed 3/26/13 (dcm-2-03-01)    |    dcm1220       |Peter            | 3/8/13               |   "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "819":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=819         |   Highly repeatable       |     Got it to crash with repeated ssh "ps aux : grep ps"  |
31 59 Peter Shanahan
| removed 3/26/13 (dcm-2-03-02)   |  dcm-1225      |Peter            | 3/8/13               |   "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802 "819":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=819         |   Quite repeatable | Hung after starting ssh daemon.  Also had an instance of days of 100% CPU "wait".  Andrew suspects different manifestation of same problem.    Hung during tests on 3/20/13  |          |
32 59 Peter Shanahan
| removed 3/26/13 (dcm-2-03-03)   |  dcm-1227      |Peter            | 3/8/13               |   "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802         |   Highly repeatable | Doesn't seem to finish reboot (Andrew)      |          |
33 57 Peter Shanahan
| removed 3/20/13 (dcm-2-04-07) |  dcm-1150   |              |  needs work  |  needs work |  highly repeatable | |  |
34 57 Peter Shanahan
| removed 3/20/13 (dcm-2-04-10) |  dcm-1135   |   Peter          |   3/8/13, 3/19/13   |  "810":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=810  "830":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=830  |  Highly repeatable  | Hung during ps/grep command, crashed during reserve resources.  Got it to crash twice with repeated ssh "ps aux : grep ps" , http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=977 using "top tests" after about 20minutes|  |
35 57 Peter Shanahan
| removed 3/20/13 (dcm-2-04-11)     |   dcm-1075        | Peter           | 2/15/13             |    "494":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=494 "819":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=819       |      highly repeatable |         |       |       
36 41 Peter Shanahan
37 43 Peter Shanahan
h2. DCMs that were in "fragile" list, but apparently due to other problems.  
38 41 Peter Shanahan
39 44 Peter Shanahan
|DCM (location) | DCM (S/N) | First reported by| Date first reported| ECL Entries | Occurrences| Comments | Resolution |
40 41 Peter Shanahan
| dcm-2-01-08  |  dcm-1039 (?) | Peter |          3/8/13 |  "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800  "801":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=801       |          1 | Seems to have recovered. | |
41 41 Peter Shanahan
| dcm-2-03-09 |   dcm-1222         |     Peter        |  2/18/12            |      |      | Problems were general network/DDS issues.  Likely OK. |  |
42 41 Peter Shanahan
| dcm-2-04-03 |  dcm-1151 |             |              |      |      | Likely OK |  |
43 41 Peter Shanahan
| dcm-2-04-06 |  dcm-1085  |             |              |      |      |  | Likely OK  |
44 1 Peter Shanahan
| dcm-2-04-09 |  dcm-1224    |             |              |      |      | Maybe OK  |  |
45 51 Peter Shanahan
| dcm-2-04-12 |    dcm-1096     |             |              |      |      |  |  |
46 30 Peter Shanahan
47 30 Peter Shanahan
48 30 Peter Shanahan
h2. DCMs from the first batch of 50
49 30 Peter Shanahan
50 30 Peter Shanahan
Rick K. was wondering if the 50 DCMs from the first batch (S/Ns 1006-1055) performed any better
51 30 Peter Shanahan
thank the rest.
52 30 Peter Shanahan
53 30 Peter Shanahan
As a start, here is where they live:
54 32 Peter Shanahan
| DCM (location) | DCM (S/N) | Comments |
55 33 Peter Shanahan
|dcm-2-01-10 | dcm1032 | Significant usage starting 3/14/13.  Pegged CPU at times, which A) is not really a symptom of the flaky DCMs, and B) it looks like that is consistent with the high data rates on this DCM |
56 31 Peter Shanahan
|dcm-2-01-07 | dcm1038| |
57 31 Peter Shanahan
|dcm-2-01-08 | dcm1039| |
58 31 Peter Shanahan
|dcm-2-01-09 | dcm1041| |
59 31 Peter Shanahan
|dcm-2-01-12 | dcm1043| |
60 31 Peter Shanahan
|dcm-2-01-11 | dcm1044| |
61 31 Peter Shanahan
|dcm-2-04-05 | dcm1051| |
62 38 Andrew Norman
63 60 Peter Shanahan
h2. DCMs not failure top torture tests
64 60 Peter Shanahan
65 60 Peter Shanahan
These are the DCMs that have not failed in more than a continuous month of the multi-top torture test, as of 2013-04-20.
66 60 Peter Shanahan
Absence from this list in no way indicates badness, since there could be lots of reasons for a reboot in the preceding month.
67 60 Peter Shanahan
68 60 Peter Shanahan
| DCM (location) | DCM (S/N) | Comments|
69 60 Peter Shanahan
| dcm-2-04-01 | | |
70 60 Peter Shanahan
| dcm-2-04-02 | | |
71 60 Peter Shanahan
| dcm-2-04-03 | | | 
72 60 Peter Shanahan
| dcm-2-04-08 | | |
73 60 Peter Shanahan
| dcm-2-04-09 | | |
74 60 Peter Shanahan
| dcm-2-04-12 | | |
75 60 Peter Shanahan
| dcm-2-05-08 | | |
76 60 Peter Shanahan
| dcm-2-05-09 | | |
77 60 Peter Shanahan
| dcm-2-05-10 | | |
78 60 Peter Shanahan
| dcm-2-05-11 | | |
79 60 Peter Shanahan
| dcm-2-06-07 | | |
80 60 Peter Shanahan
81 60 Peter Shanahan
82 38 Andrew Norman
h2. Test Proceedures
83 38 Andrew Norman
84 38 Andrew Norman
The following document different types of testing that were done:
85 38 Andrew Norman
86 39 Andrew Norman
[[CPU Burning Tests]]
87 39 Andrew Norman
[[Network Data Copy Tests]]
88 48 Andrew Norman
[[Multiple TOP Tests]]
89 53 Peter Shanahan
[[Repeated ssh]]