Project

General

Profile

Dcm problems March2013 » History » Version 43

Peter Shanahan, 03/19/2013 03:11 PM

1 37 Andrew Norman
{{TOC}}
2 37 Andrew Norman
3 1 Peter Shanahan
h1. Dcm problems March2013
4 1 Peter Shanahan
5 1 Peter Shanahan
h2. DCM "hard" failures.
6 1 Peter Shanahan
7 1 Peter Shanahan
DCMs that can never be contacted.
8 1 Peter Shanahan
9 3 Peter Shanahan
|DCM| First reported by| Date first reported| Occurrences| Comments | Resolution |
10 2 Peter Shanahan
11 2 Peter Shanahan
12 2 Peter Shanahan
13 1 Peter Shanahan
h2. Fragile DCMs
14 1 Peter Shanahan
15 1 Peter Shanahan
DCMs that can be contacted for some amount of time before they seize. 
16 1 Peter Shanahan
17 8 Peter Shanahan
Notes:
18 1 Peter Shanahan
19 8 Peter Shanahan
* Please use the full dcm name in the table below, to aid searching.
20 8 Peter Shanahan
* The "Occurrences" count is the number of confirmed times the problem has occurred.  It may well be greater...
21 8 Peter Shanahan
22 9 Peter Shanahan
|DCM (location) | DCM (S/N) | First reported by| Date first reported| ECL Entries | Occurrences| Comments | Resolution |
23 18 Peter Shanahan
| dcm-2-01-03   |  dcm-1143          | Peter           |  3/8/13            | "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802 "808":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=808    |  2   | Andrew noted it was using it's SN name on 3/8/13 |  |
24 18 Peter Shanahan
| dcm-2-01-08  |  dcm-1039 (?) | Peter |          3/8/13 |  "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800  "801":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=801       |          1 | Seems to have recovered. | |
25 36 Peter Shanahan
| dcm-2-03-01    |    dcm1220       |Peter            | 3/8/13               |   "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "819":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=819         |   Highly repeatable       |     Got it to crash with repeated ssh "ps aux : grep ps"  |
26 40 Peter Shanahan
| dcm-2-03-02    |  dcm-1225      |Peter            | 3/8/13               |   "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802 "819":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=819         |   2 | Hung after starting ssh daemon.  Also had an instance of days of 100% CPU "wait".  Andrew suspects different manifestation of same problem.      |          |
27 1 Peter Shanahan
| dcm-2-03-03   |  dcm-1227      |Peter            | 3/8/13               |   "800":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=800 "802":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=802         |   Highly repeatable | Doesn't seem to finish reboot (Andrew)      |          |
28 40 Peter Shanahan
| dcm-2-03-08 |   dcm-1228      |     ?     |        ?      |      |     | One clear failure, during ssh test on 3/17/13|  |
29 40 Peter Shanahan
| dcm-2-04-07 |  dcm-1150   |              |  needs work  |  needs work |  highly repeatable | |  |
30 1 Peter Shanahan
| dcm-2-04-10 |  dcm-1135   |   Peter          |   3/8/13   |  "810":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=810  "830":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=830  |  Highly repeatable  | Hung during ps/grep command, crashed during reserve resources.  Got it to crash twice with repeated ssh "ps aux : grep ps" |  |
31 1 Peter Shanahan
| dcm-2-04-11     |   dcm-1075        | Peter           | 2/15/13             |    "494":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=494 "819":http://dbweb0.fnal.gov/ECL/novashriver/E/show?e=819       |      highly repeatable |         |       |        
32 1 Peter Shanahan
|  set to fnal |    dcm-1095     |     Peter     |   near Feb.15  |      |      | back@fnal(15Mar)-(ssh-ps-aux+memtester+top) at least 1 occurence of hang with no console output |  |
33 41 Peter Shanahan
34 43 Peter Shanahan
h2. DCMs that were in "fragile" list, but apparently due to other problems.  
35 41 Peter Shanahan
36 41 Peter Shanahan
|DCM (location) | DCM (S/N) | First reported by| Date first reported| ECL Entries | Occurrences| Comments | Resolution |
37 41 Peter Shanahan
| dcm-2-03-06 |   dcm-1223      |     ?     |        ?      |      |     | Not included in partition since 2/20/13, but no comment why in ECL. Likely OK|  |
38 41 Peter Shanahan
| dcm-2-03-09 |   dcm-1222         |     Peter        |  2/18/12            |      |      | Problems were general network/DDS issues.  Likely OK. |  |
39 41 Peter Shanahan
| dcm-2-04-03 |  dcm-1151 |             |              |      |      | Likely OK |  |
40 41 Peter Shanahan
| dcm-2-04-06 |  dcm-1085  |             |              |      |      |  | Likely OK  |
41 1 Peter Shanahan
| dcm-2-04-09 |  dcm-1224    |             |              |      |      | Maybe OK  |  |
42 42 Peter Shanahan
| dcm-2-04-12 |    dcm-1096     |             |              |      |      |  |  |
43 30 Peter Shanahan
44 30 Peter Shanahan
45 30 Peter Shanahan
h2. DCMs from the first batch of 50
46 30 Peter Shanahan
47 30 Peter Shanahan
Rick K. was wondering if the 50 DCMs from the first batch (S/Ns 1006-1055) performed any better
48 30 Peter Shanahan
thank the rest.
49 30 Peter Shanahan
50 30 Peter Shanahan
As a start, here is where they live:
51 32 Peter Shanahan
| DCM (location) | DCM (S/N) | Comments |
52 33 Peter Shanahan
|dcm-2-01-10 | dcm1032 | Significant usage starting 3/14/13.  Pegged CPU at times, which A) is not really a symptom of the flaky DCMs, and B) it looks like that is consistent with the high data rates on this DCM |
53 31 Peter Shanahan
|dcm-2-01-07 | dcm1038| |
54 31 Peter Shanahan
|dcm-2-01-08 | dcm1039| |
55 31 Peter Shanahan
|dcm-2-01-09 | dcm1041| |
56 31 Peter Shanahan
|dcm-2-01-12 | dcm1043| |
57 31 Peter Shanahan
|dcm-2-01-11 | dcm1044| |
58 31 Peter Shanahan
|dcm-2-04-05 | dcm1051| |
59 38 Andrew Norman
60 38 Andrew Norman
h2. Test Proceedures
61 38 Andrew Norman
62 38 Andrew Norman
The following document different types of testing that were done:
63 38 Andrew Norman
64 39 Andrew Norman
[[CPU Burning Tests]]
65 39 Andrew Norman
[[Network Data Copy Tests]]