Project

General

Profile

Dcm problems March2013 » History » Version 51

Peter Shanahan, 03/20/2013 11:02 AM

1 37 Andrew Norman
{{TOC}}
2 37 Andrew Norman
3 1 Peter Shanahan
h1. Dcm problems March2013
4 1 Peter Shanahan
5 1 Peter Shanahan
h2. DCM "hard" failures.
6 1 Peter Shanahan
7 1 Peter Shanahan
DCMs that can never be contacted.
8 1 Peter Shanahan
9 3 Peter Shanahan
|DCM| First reported by| Date first reported| Occurrences| Comments | Resolution |
10 2 Peter Shanahan
11 2 Peter Shanahan
12 2 Peter Shanahan
13 1 Peter Shanahan
h2. Fragile DCMs
14 1 Peter Shanahan
15 1 Peter Shanahan
DCMs that can be contacted for some amount of time before they seize. 
16 1 Peter Shanahan
17 8 Peter Shanahan
Notes:
18 1 Peter Shanahan
19 8 Peter Shanahan
* Please use the full dcm name in the table below, to aid searching.
20 8 Peter Shanahan
* The "Occurrences" count is the number of confirmed times the problem has occurred.  It may well be greater...
21 8 Peter Shanahan
22 9 Peter Shanahan
|DCM (location) | DCM (S/N) | First reported by| Date first reported| ECL Entries | Occurrences| Comments | Resolution |
23 45 Peter Shanahan
| dcm-2-01-03   |  dcm-1143          | Peter           |  3/8/13            | "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802 "808":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=808    |  Highly repeatable   | Andrew noted it was using it's SN name on 3/8/13 |  |
24 36 Peter Shanahan
| dcm-2-03-01    |    dcm1220       |Peter            | 3/8/13               |   "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "819":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=819         |   Highly repeatable       |     Got it to crash with repeated ssh "ps aux : grep ps"  |
25 40 Peter Shanahan
| dcm-2-03-02    |  dcm-1225      |Peter            | 3/8/13               |   "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802 "819":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=819         |   2 | Hung after starting ssh daemon.  Also had an instance of days of 100% CPU "wait".  Andrew suspects different manifestation of same problem.      |          |
26 1 Peter Shanahan
| dcm-2-03-03   |  dcm-1227      |Peter            | 3/8/13               |   "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802         |   Highly repeatable | Doesn't seem to finish reboot (Andrew)      |          |
27 40 Peter Shanahan
| dcm-2-03-08 |   dcm-1228      |     ?     |        ?      |      |     | One clear failure, during ssh test on 3/17/13|  |
28 40 Peter Shanahan
| dcm-2-04-07 |  dcm-1150   |              |  needs work  |  needs work |  highly repeatable | |  |
29 49 Andrew Norman
| dcm-2-04-10 |  dcm-1135   |   Peter          |   3/8/13, 3/19/13   |  "810":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=810  "830":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=830  |  Highly repeatable  | Hung during ps/grep command, crashed during reserve resources.  Got it to crash twice with repeated ssh "ps aux : grep ps" , http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=977 using "top tests" after about 20minutes|  |
30 1 Peter Shanahan
| dcm-2-04-11     |   dcm-1075        | Peter           | 2/15/13             |    "494":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=494 "819":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=819       |      highly repeatable |         |       |        
31 47 Peter Shanahan
| dcm-2-05-12     |   dcm-1083        | Peter           | 3/15/13             |       |      1 |    Failed in first ~12 hours of ssh test (every 2 seconds)     |       |        
32 46 Peter Shanahan
|  sent to fnal (was dcm-2-02-11)|    dcm-1095     |     Peter     |   near Feb.15  |      |      | back@fnal(15Mar)-(ssh-ps-aux+memtester+top) at least 1 occurence of hang with no console output |  |
33 51 Peter Shanahan
|  sent to fnal (was dcm-2-02-03)|    dcm-1158     |     Peter     |   near Feb.15  |      |      | Rick K can make it fail on the bench in minutes by doing 10ms top torture test|  |
34 41 Peter Shanahan
35 43 Peter Shanahan
h2. DCMs that were in "fragile" list, but apparently due to other problems.  
36 41 Peter Shanahan
37 1 Peter Shanahan
|DCM (location) | DCM (S/N) | First reported by| Date first reported| ECL Entries | Occurrences| Comments | Resolution |
38 44 Peter Shanahan
| dcm-2-01-08  |  dcm-1039 (?) | Peter |          3/8/13 |  "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800  "801":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=801       |          1 | Seems to have recovered. | |
39 41 Peter Shanahan
| dcm-2-03-06 |   dcm-1223      |     ?     |        ?      |      |     | Not included in partition since 2/20/13, but no comment why in ECL. Likely OK|  |
40 41 Peter Shanahan
| dcm-2-03-09 |   dcm-1222         |     Peter        |  2/18/12            |      |      | Problems were general network/DDS issues.  Likely OK. |  |
41 41 Peter Shanahan
| dcm-2-04-03 |  dcm-1151 |             |              |      |      | Likely OK |  |
42 41 Peter Shanahan
| dcm-2-04-06 |  dcm-1085  |             |              |      |      |  | Likely OK  |
43 1 Peter Shanahan
| dcm-2-04-09 |  dcm-1224    |             |              |      |      | Maybe OK  |  |
44 51 Peter Shanahan
| dcm-2-04-12 |    dcm-1096     |             |              |      |      |  |  |
45 30 Peter Shanahan
46 30 Peter Shanahan
47 30 Peter Shanahan
h2. DCMs from the first batch of 50
48 30 Peter Shanahan
49 30 Peter Shanahan
Rick K. was wondering if the 50 DCMs from the first batch (S/Ns 1006-1055) performed any better
50 30 Peter Shanahan
thank the rest.
51 30 Peter Shanahan
52 30 Peter Shanahan
As a start, here is where they live:
53 32 Peter Shanahan
| DCM (location) | DCM (S/N) | Comments |
54 33 Peter Shanahan
|dcm-2-01-10 | dcm1032 | Significant usage starting 3/14/13.  Pegged CPU at times, which A) is not really a symptom of the flaky DCMs, and B) it looks like that is consistent with the high data rates on this DCM |
55 31 Peter Shanahan
|dcm-2-01-07 | dcm1038| |
56 31 Peter Shanahan
|dcm-2-01-08 | dcm1039| |
57 31 Peter Shanahan
|dcm-2-01-09 | dcm1041| |
58 31 Peter Shanahan
|dcm-2-01-12 | dcm1043| |
59 31 Peter Shanahan
|dcm-2-01-11 | dcm1044| |
60 31 Peter Shanahan
|dcm-2-04-05 | dcm1051| |
61 38 Andrew Norman
62 38 Andrew Norman
h2. Test Proceedures
63 38 Andrew Norman
64 38 Andrew Norman
The following document different types of testing that were done:
65 38 Andrew Norman
66 39 Andrew Norman
[[CPU Burning Tests]]
67 39 Andrew Norman
[[Network Data Copy Tests]]
68 48 Andrew Norman
[[Multiple TOP Tests]]