Project

General

Profile

Bad disks in the RAID

All of our uboonedaq- and ubdaq-prod- machines are RAID6. Two bad disks are tolerable. A 3rd is disaster. We urge replacement of the first bad
one quickly. The ws01,02 and uboonelaser servers are not RAIDed. EPICs slowmoncon shows RAID status. The shifter
will notice when a disk is bad, in principle, because this PV will go into alarm. Bonnie from SCD-SLAM will tell us too, hopefully. It is true
that the blinking red disk in each ubdaq-prod- machine labels an unused disk and is the one on those machines that should be hot-swapped in for any bad
one on those machines. Just pop out the bad one and the blinking red one and pop back in the formerly blinking red one into the bad one's slot.
It would not be a bad idea to have O(10) new, unused, unformatted disks sitting in the LArTF computer
room shelf.

My only experience with this right now is on uboonedaq-evb.

Here is how to discover the state of the 3Ware RAID of hard drives on our machines. At the bottom we give the instructions on how
to replace the Western Digital drive if one is determined to be bad in the diagnostics. We only follow this procedure to get new drives from
Western Digital if in fact, the KOI warranty for the servers is expired and yet the Western Digital drive warrantees are still in effect.
This is
a not-unlikely situation for us, as KOI warrantees are 3 years duration (expiring ~2016?) and the WD warrantees go 5 years. If the KOI
warrantees are not yet expired the SLAM team (Bonnie K, Renni, Jason, ...) will get these swapped out, instead. We do nothing in that case.

It is the case that some of our machines do not have 3Ware RAID controllers. Do an lspci | grep -i RAID and google 'em up to find the equivalent.
I also note as I make these instructions that we have Seagate disks in some places, not Western Digital. I do not yet have experience shipping
back to Seagate. I presume it's the same. Jerry Camacho at KOI is a very useful resource. We continue on here for LSI/3ware and WD disks.

Bad disks show up on uboonedaq-evb with its LSI/3ware RAID, as follows

[root@uboonedaq-evb echurch]# tw_cli
//uboonedaq-evb> show

Ctl Model (V)Ports Drives Units NotOpt RRate VRate BBU
------------------------------------------------------------------------
c6 9650SE-24M8 24 13 2 1 1 1 OK

Now go see details. Note the model number of the drives. They're all the same. Any replacement disk must be the same too.
Note p10 has been pulled out and doesn't show up below. So, it reports the RAID as DEGRADED. The second show command allows to get the
serial number, which we'll use to submit the RMA to Western Digital to get a new replacement hard drive.

[uboonedaq-evb]# tw_cli
//uboonedaq-evb> /c6 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-1 OK - - - 232.82 RiW ON
u1 RAID-6 DEGRADED - - 256K 18626.3 RiW ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 233.81 GB SATA 0 - WDC WD2503ABYX-01WE
p1 OK u0 233.81 GB SATA 1 - WDC WD2503ABYX-01WE
p2 OK u1 1.82 TB SATA 2 - WDC WD2003FYYS-02W0
p3 OK u1 1.82 TB SATA 3 - WDC WD2003FYYS-02W0
p4 OK u1 1.82 TB SATA 4 - WDC WD2003FYYS-02W0
p5 OK u1 1.82 TB SATA 5 - WDC WD2003FYYS-02W0
p6 OK u1 1.82 TB SATA 6 - WDC WD2003FYYS-02W0
p7 OK u1 1.82 TB SATA 7 - WDC WD2003FYYS-02W0
p8 OK u1 1.82 TB SATA 8 - WDC WD2003FYYS-02W0
p9 OK u1 1.82 TB SATA 9 - WDC WD2003FYYS-02W0
p11 OK u1 1.82 TB SATA 11 - WDC WD2003FYYS-02W0
p12 OK u1 1.82 TB SATA 12 - WDC WD2003FYYS-02W0
p13 OK u1 1.82 TB SATA 13 - WDC WD2003FYYS-02W0

presuming port 9 is the bad drive, we do the following to extract the serial number of this drive: WD-WMAY03195241.

[uboonedaq-evb]# tw_cli
//uboonedaq-evb> /c6/p9 show
/c6/p9 Status = OK
/c6/p9 Model = WDC WD2003FYYS-02W0B0
/c6/p9 Firmware Version = 01.01D01
/c6/p9 Serial = WD-WMAY03195241
/c6/p9 Capacity = 1.82 TB (3907029168 Blocks)
/c6/p9 Reallocated Sectors = 0
/c6/p9 Power On Hours = 25765
/c6/p9 Temperature = 32 deg C
/c6/p9 Spindle Speed = 7200 RPM
/c6/p9 Link Speed Supported = 1.5 Gbps and 3.0 Gbps
/c6/p9 Link Speed = 3.0 Gbps
/c6/p9 NCQ Supported = Yes
/c6/p9 NCQ Enabled = Yes
/c6/p9 Identify Status = N/A
/c6/p9 Belongs to Unit = u1

/c6/p9 Drive SMART Data:
10 00 01 2F 00 C8 C8 00 00 00 00 00 00 00 03 27
00 FD FD 34 21 00 00 00 00 00 04 32 00 64 64 3D
00 00 00 00 00 00 05 33 00 C8 C8 00 00 00 00 00
...

Go click on https://westerndigital.secure.force.com/ind/ID_CreateRMA, make an account, fill out the form including the Serial number attained as above.
Wait a while. Go request the actual RMA for that Serial number. Pop the bad disk out (they're hot swappable!) unscrew the 4 philips screws from the
mounting bracket. Place the bracket and screws somewhere safe. Walk the disk and the printed-out RMA over to Shipping -- the building on the way to the Wilson
exit -- during business hours. Fill out a Material Move Form there, using Project number 40ND, Task Number 40ND.02.30.01, and your badge ID (visitor or otherwise)
and serial number and RMA #, etc.

This will show up back to you at your FNAL office inside ~ a week. Go screw the bracket back on and pop it in hot, and then tw_cli to see if it's rebuilding.
Should take a few - 6 hours to rebuild.