Project

General

Profile

Bug #11525

HPC: fix Samsung 850 PRO SSD performance issues on ZFS Linux (without TRIM)

Added by Gerard Bernabeu Altayo over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Start date:
01/28/2016
Due date:
% Done:

0%

Estimated time:
Duration:

Description

We found that the write speed for the Samsung 850 Pro SSD decreased from ~500MB/s to around 30MB/s throughtput, with less than 500 IOPS per disk.

After some research it looks like a known problem and a solution needs to be found.

---
after observing 'stale nfs' alarms on tev I started looking at the newtevnfs:/fast performance and noted that it is not very good. Given that this is a RAIDZ1 made of 4 SSD disks, each rated at ~500MB/s and ~30K IOPS, I expected a much higher bandwidth than the observed:

[root@newtevnfs ~]# time dd if=/dev/zero of=/fast0/gerard.test bs=1024k count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 28.2246 s, 38.0 MB/s

real 0m28.233s
user 0m0.003s
sys 0m0.593s
[root@newtevnfs ~]# time dd if=/dev/zero of=/fast0/gerard.test bs=1024k count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 8.99906 s, 119 MB/s

real 0m9.013s
user 0m0.000s
sys 0m0.490s
The first test ran while there was other activity on the /fast zpool and the 2nd while it was idle. I did several runs in similar conditions and the values are stable. At any rate the values are too low for a well behaving SSD RAIDZ1, and iostat shows 100% usage for all 4 sda-sdd disks whenever there is activity (that peaks at less than 150MB/s).

In order to discard a SATA issue I did also some test writes to /data and there I got better results than on the /fast zpool:

[root@newtevnfs ~]# time dd if=/dev/zero of=/data0/gerard.test bs=1024k count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 7.63504 s, 141 MB/s

real 0m7.661s
user 0m0.000s
sys 0m0.576s

Looking at the server's historical data I can see that on December 9th the workload changed and started to be write dominated, that caused a measurable increase in the iowait CPU usage that is most probably comming from the also increased context switches increase.

I don't really know what the issue is, maybe we're on a 'too old' zfs version (we are on 0.6.4.1-1, I see up to v0.6.5.4)?

Perhaps we should try to use ZFS NFS export, I see it is disabled now:
[root@newtevnfs ~]# zfs get sharenfs
NAME PROPERTY VALUE SOURCE
data sharenfs off default
data/data0 sharenfs off default
fast sharenfs off default
fast/fast0 sharenfs off default
[root@newtevnfs ~]#

My past experience with ZFS&NFS was on a Solaris server and using the ZFS NFS capabilities provided higher performance,maybe this is the source of the very high CS? But that would not really explain the low local 'dd' performance...

Maybe the 'raid 5' of raidz1 is not performant with just 4 disks?

Currently there is no user complain about this, but my guess is that this may be limiting some jobs' performance.

I found this thread http://www.tomshardware.com/answers/id-2715455/samsung-850-pro-256gb-poor-write-speed-recently.html form a guy describing a similar behavior.

There is no 'magician' for Linux... I checked that we seem to be running on the latest Firmware; not lots of help from the Samsung webpage, but found https://www.facebook.com/groups/samsungssd850death/ and they say EXM02B6Q is latest, which we have.
---

For the solution part, Samsung's magician SW may be needed:

http://www.samsung.com/global/business/semiconductor/minisite/SSD/global/html/support/server_downloads.html

http://www.samsung.com/global/business/semiconductor/minisite/SSD/downloads/software/samsung_magician_dc-v1.0_rtm_p2.tar.gz

History

#1 Updated by Gerard Bernabeu Altayo over 3 years ago

Don pointed me to TRIM, which is a feature just added on zfs 0.6.5. Searching a bit found reports of the same issue we are seeing in the git commit https://github.com/zfsonlinux/zfs/pull/3656:

@dweezil: Will this also trim l2arc/slog or just the vdev devices?

I'm asking because l2arc in particular get a lot of writes and some SSDs get themselves into a pretty bad steady-state mode if not given trim commands (eg: Samsung 840pro - I've seen 98% write speed slowdown in steady-state(full) over operation with trimming in use in ZFS and when used as DB storage on ext4 systems)

We will probably need to:

1. Make a snapshot and copy all the ZFS data from /fast elsewhere
2. DOWNTIME start
3. Stop NFS and copy the incremental of the ZFS disk
4. Upgrade ZFS
5. Use magician to fully erase/initialize the drives (perhaps TRIM can do it with the data in place)
6. Test write speed
7. If can't do it with data in place, recreate the zpool and restore data
8. Start NFS server
9. DOWNTIME end

Will now do some research on magician and try to deploy it.

#2 Updated by Gerard Bernabeu Altayo over 3 years ago

Got the SW in:

[root@newtevnfs ~]# cd /opt/samsung/
[root@newtevnfs samsung]# wget http://www.samsung.com/global/business/semiconductor/minisite/SSD/downloads/software/samsung_magician_dc-v1.0_rtm_p2.tar.gz
--2016-01-28 14:12:03--  http://www.samsung.com/global/business/semiconductor/minisite/SSD/downloads/software/samsung_magician_dc-v1.0_rtm_p2.tar.gz
Resolving www.samsung.com... 23.193.21.154
Connecting to www.samsung.com|23.193.21.154|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2518005 (2.4M) [text/plain]
Saving to: “samsung_magician_dc-v1.0_rtm_p2.tar.gz”

100%[===========================================================================================================================================>] 2,518,005   12.1M/s   in 0.2s    

2016-01-28 14:12:03 (12.1 MB/s) - “samsung_magician_dc-v1.0_rtm_p2.tar.gz” saved [2518005/2518005]

[root@newtevnfs samsung]# tar -xvf samsung_magician_dc-v1.0_rtm_p2.tar.gz 
samsung_magician_dc-v1.0_rtm_p2/
samsung_magician_dc-v1.0_rtm_p2/64bin/
samsung_magician_dc-v1.0_rtm_p2/64bin/magician
samsung_magician_dc-v1.0_rtm_p2/32bin/
samsung_magician_dc-v1.0_rtm_p2/32bin/magician
[root@newtevnfs samsung]# /opt/samsung/samsung_magician_dc-v1.0_rtm_p2/64bin/magician -h
Magician is now configuring the environment for LSI MegaRAID SAS Controller.
Magician is now configuring the environment for LSI SAS IT/IR Controller.
Magician is now configuring the environment for LSI SAS IT/IR2 Controller.
Magician is now configuring the environment for LSI SAS IT/IR3 Controller.
================================================================================================
Samsung(R) SSD Magician DC Version 1.0
Copyright (c) 2014 Samsung Corporation
================================================================================================
Usage:  /opt/samsung/samsung_magician_dc-v1.0_rtm_p2/64bin/magician  [operation] ..

Allowed Operations:
-L[ --list]              Shows a disk(s) attached to the system.           
-F[ --firmware-update]   Updates firmware to specified disk.               
-E[ --erase]             Securely Erases all data from specified disk.     
-O[ --over-provision]    Performs one of the Over-Provisioning related     
                         operations on specified disk.                     
-T[ --trim]              Optimizes specified disk.                         
-S[ --smart]             Shows S.M.A.R.T values of specified disk.         
-M[ --setmax]            Performs SetMax related operations on specified disk.
-W[ --writecache]        Enables/Disables Write Cache on specified disk.   
-X[ --sctcachestate]     Gets the SCT write cache state for specified disk.
-C[ --command-history]   Shows history of the previously executed commands.
-I[ --info]              Displays the disk details to the user.            
-license                 Shows the End User License Agreement.             
-H[ --help]              Shows detailed Help.                              

[root@newtevnfs samsung]# 

And it detects our disks:

[root@newtevnfs samsung]# /opt/samsung/samsung_magician_dc-v1.0_rtm_p2/64bin/magician --list
================================================================================================
Samsung(R) SSD Magician DC Version 1.0
Copyright (c) 2014 Samsung Corporation
================================================================================================
--------------------------------------------------------------------------------------------------
| Disk   | Model                  | Serial         | Firmware  | Capacity | Drive  | Total Bytes |
| Number |                        | Number         |           |          | Health | Written     |
--------------------------------------------------------------------------------------------------
| 0      |Samsung SSD 850 PRO 1TB |S2BBNWAG103029M |EXM02B6Q   | 953 GB   | GOOD   | 119.82 TB   |
--------------------------------------------------------------------------------------------------
| 1      |Samsung SSD 850 PRO 1TB |S2BBNWAG103016M |EXM02B6Q   | 953 GB   | GOOD   | 116.36 TB   |
--------------------------------------------------------------------------------------------------
| 2      |Samsung SSD 850 PRO 1TB |S2BBNWAG103031B |EXM02B6Q   | 953 GB   | GOOD   | 119.45 TB   |
--------------------------------------------------------------------------------------------------
| 3      |Samsung SSD 850 PRO 1TB |S2BBNWAG103020D |EXM02B6Q   | 953 GB   | GOOD   | 116.36 TB   |
--------------------------------------------------------------------------------------------------
[root@newtevnfs samsung]# 

I think it is even safe to try to trim one disk, as ZFS should detectd any corruption and we have a raidz1 that would rebuild it...

An any rate I will do a backup first, Amitoj said: "Can you kick off step#1 before I send an email to the oversight committee requesting permission for an early next week downtime. To copy the snapshot from /fast to elsewhere there is sufficient space in /extra partition on tev.fnal.gov where we can park this snapshot."

I have exported tev.fnal.gov:/extra

[root@tev ~]# cat /etc/exports 
/usr/local    192.168.76.0/255.255.255.0(rw,sync,no_root_squash) 192.168.176.0/255.255.255.0(rw,sync,no_root_squash)
/home        192.168.76.0/255.255.255.0(rw,sync,no_root_squash) 192.168.176.0/255.255.255.0(rw,sync,no_root_squash)
/opt        192.168.76.0/255.255.255.0(rw,sync,no_root_squash) 192.168.176.0/255.255.255.0(rw,sync,no_root_squash)
/fnal/ups    192.168.76.0/255.255.255.0(rw,sync,no_root_squash) 192.168.176.0/255.255.255.0(rw,sync,no_root_squash)
/extra        192.168.76.0/255.255.255.0(rw,sync,no_root_squash) 192.168.176.0/255.255.255.0(rw,sync,no_root_squash)
[root@tev ~]# exportfs -r

[root@newtevnfs samsung]# mkdir /extra; mount 192.168.176.26:/extra /extra
mkdir: cannot create directory `/extra': File exists
[root@newtevnfs samsung]# df -h /extra/
Filesystem            Size  Used Avail Use% Mounted on
192.168.176.26:/extra
                      1.8T  364G  1.4T  21% /extra
[root@newtevnfs samsung]# ls /extra/
EVAL_L___V8VP-WR7ZWLMK.lic  l_ics_2013.0.028  l_ics_2013.0.028.tgz  lost+found  restore  tev-backup  tmp  usr  vmstat
[root@newtevnfs samsung]# 
[root@newtevnfs samsung]# zfs snapshot fast/fast0@20160181707
[root@newtevnfs samsung]# nohup zfs send fast/fast0@20160181707 | /bin/gzip > /extra/fast0.20160181707.gz &
[1] 22971
[root@newtevnfs samsung]# nohup: ignoring input and redirecting stderr to stdout

[root@newtevnfs samsung]# 

#3 Updated by Gerard Bernabeu Altayo over 3 years ago

next thing I should run is:

[root@newtevnfs ~]# /opt/samsung/samsung_magician_dc-v1.0_rtm_p2/64bin/magician --trim 
================================================================================================
Samsung(R) SSD Magician DC Version 1.0
Copyright (c) 2014 Samsung Corporation
================================================================================================
------------------------------------------------------------------------------------------------
Usage: 

    magician -d [diskindex] -T [ --trim ] [parameter-list]                                    
     Example: magician --disk 1 --trim [or]                                                   
     magician -d 1 -T --force                                                                 
    -d [ --disk ] Disk-Number of the disk to be optimized                                     
    --force Enables the user to perform trim without prompting for any confirmations          

------------------------------------------------------------------------------------------------
[root@newtevnfs ~]# /opt/samsung/samsung_magician_dc-v1.0_rtm_p2/64bin/magician --list
================================================================================================
Samsung(R) SSD Magician DC Version 1.0
Copyright (c) 2014 Samsung Corporation
================================================================================================
--------------------------------------------------------------------------------------------------
| Disk   | Model                  | Serial         | Firmware  | Capacity | Drive  | Total Bytes |
| Number |                        | Number         |           |          | Health | Written     |
--------------------------------------------------------------------------------------------------
| 0      |Samsung SSD 850 PRO 1TB |S2BBNWAG103029M |EXM02B6Q   | 953 GB   | GOOD   | 119.90 TB   |
--------------------------------------------------------------------------------------------------
| 1      |Samsung SSD 850 PRO 1TB |S2BBNWAG103016M |EXM02B6Q   | 953 GB   | GOOD   | 116.44 TB   |
--------------------------------------------------------------------------------------------------
| 2      |Samsung SSD 850 PRO 1TB |S2BBNWAG103031B |EXM02B6Q   | 953 GB   | GOOD   | 119.53 TB   |
--------------------------------------------------------------------------------------------------
| 3      |Samsung SSD 850 PRO 1TB |S2BBNWAG103020D |EXM02B6Q   | 953 GB   | GOOD   | 116.44 TB   |
--------------------------------------------------------------------------------------------------
[root@newtevnfs ~]# /opt/samsung/samsung_magician_dc-v1.0_rtm_p2/64bin/magician --trim 0

Waiting for the gzip to end!

#4 Updated by Gerard Bernabeu Altayo over 3 years ago

It is not possible to use the TRIM option of magician:

[root@newtevnfs ~]# /opt/samsung/samsung_magician_dc-v1.0_rtm_p2/64bin/magician --disk 0 --trim
================================================================================================
Samsung(R) SSD Magician DC Version 1.0
Copyright (c) 2014 Samsung Corporation
================================================================================================
Do you want to continue with Optimization (yes to continue):yes
------------------------------------------------------------------------------------------------
Disk Number:  0 | Model Name: Samsung SSD 850 PRO 1TB | Firmware Version: EXM02B6Q
------------------------------------------------------------------------------------------------
TRIM:  [INFO] TRIM command is not supported for partition :/dev/sda1 (Only ext4 is supported)
TRIM:  [INFO] TRIM command is not supported for partition :/dev/sda9 (Only ext4 is supported)
TRIM:  [ERROR] Unable to complete the TRIM operation!  
------------------------------------------------------------------------------------------------
[root@newtevnfs ~]# 

So I will have to fail the disk and rebuild it.

#5 Updated by Gerard Bernabeu Altayo over 3 years ago

So the gzip last night failed, I restarted it a few hours ago, from a screen, and it is still going on:

[root@newtevnfs ~]# ls -ltarh /extra/
total 268G
drwxr-xr-x  11 root root 4.0K Oct 19  2012 l_ics_2013.0.028
drwx------   2 root root 4.0K Oct 22  2012 lost+found
drwxr-xr-x   3 root root 4.0K Oct 25  2012 .ProjectDirectory
-rw-r--r--   1 root root 2.4G Jul  3  2013 l_ics_2013.0.028.tgz
-rw-r--r--   1 root root 1.3K Jul  3  2013 EVAL_L___V8VP-WR7ZWLMK.lic
drwxr-xr-x   3 root root 4.0K Nov  3  2014 tev-backup
drwxr-xr-x   4 root root 4.0K Dec 15  2014 usr
drwxr-xr-x   2 root root  20K Dec 20  2014 vmstat
drwxr-xr-x   2 root root 4.0K Jun 22  2015 restore
drwxrwxrwt  98 root root  44K Jul 29  2015 tmp
-rw-r--r--   1 root root 114G Jan 28 19:49 fast0.20160181707.gz
drwxr-xr-x  10 root root 4.0K Jan 29 11:38 .
dr-xr-xr-x. 31 root root 4.0K Jan 29 12:02 ..
-rw-r--r--   1 root root 152G Jan 29 14:15 fast0.20160191112.gz
[root@newtevnfs ~]# 

For now my plan is to fail a disk and rebuild it...

#6 Updated by Gerard Bernabeu Altayo over 3 years ago

looks like most of the options from magician do not work on ZFS...

[root@newtevnfs ~]# /opt/samsung/samsung_magician_dc-v1.0_rtm_p2/64bin/magician --disk 0 --over-provision --set 
================================================================================================
Samsung(R) SSD Magician DC Version 1.0
Copyright (c) 2014 Samsung Corporation
================================================================================================
------------------------------------------------------------------------------------------------
Disk Number:  0 | Model Name: Samsung SSD 850 PRO 1TB | Firmware Version: EXM02B6Q
------------------------------------------------------------------------------------------------
Over Provisioning: [INFO-OP] User selected to perform Over-Provision with default value 
             (10% of Total Disk Space).
Over Provisioning: [ERROR-OP] The disk (/dev/sda9) provided is not formatted with 
            supported FileSystem. Over-Provisioning not possible!
------------------------------------------------------------------------------------------------
[root@newtevnfs ~]# 

Maybe we should do mdraid+ext4 for this. We are not using snapshots nor any of the nice ZFS features so it may be better for the performance to set this RAID as a simple ext4 FS, and then TRIM will be supported. We should be careful with a bug that Samsung fixed on the Linux kernel when using TRIM:

http://linux.slashdot.org/story/15/07/30/1814200/samsung-finds-fixes-bug-in-linux-trim-code
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f3f5da624e0a891c34d8cd513c57f1d9b0c7dadc

We should make sure the patch is in.

If we rather want a workaround (ie. something we will have to do again in a few months) we can detach a disk from the RAID, use magician (or hdparm) to do an ATA ERASE that will restore disk to initial (high) performance. We may try to team it with a size decrease (emulating overprovisioning) so that the disk's performance does not decrease that much, but I'm not sure about it's effectiveness.

#7 Updated by Gerard Bernabeu Altayo over 3 years ago

Going to go with this plan:

Given that we are using ZFS on Samsung SSDs in a much more critical application
- the Lustre /lfs MDT - it's really important to run this down.

I would like to understand the current state of the 4 SSDs.  Since their pool is
raidz1, after backing up a snapshot of the zfs filesystem(s) on that pool I
would be in favor of removing one of the disks, leaving the pool degraded but
still functioning.   We can then try assessing the performance of raw I/O to the
removed disk (e.g., using "dd" to do writes) to see if raw (no file system)
performance is bad.  If performance is good, we may have a ZFS issue.  If
performance is bad (writes are much slower than reads) we can do ATA ERASE or
equivalent and see if raw I/O write performance recovers to the expected value.
If so we can add the disk back to the zpool, resilver, rinse and repeat with the
other 3 disks one at a time.   If not, we may have a bad or worn-out disk; we
can try putting down a different file system with TRIM support just in case the
reset (ATA ERASE or whatever) didn't do the job.

It's possible we have 3 good disks and 1 slow disk, since all raidz1 writes have
to touch all 4 disks.  So it may be that the disk we pull will have good raw
write performance.  We'll have to go through all 4.

In addition to the snapshot, we should probably rsync /fast0 to spinning disk
and re-rsync periodically during disk swaps.

At the time that we put an SSD into dscon to handle robinhood, these Samsung
850s were receiving lots of praise but they were not yet in the channel for
purchase.  So we purchased a Micron DC-quality SSD.  We should check on
performance on that filesystem.  It's running XFS which IIRC has trim (discard)
support.

We elected, mostly on the advice of Koi, to move to the Samsung 850s rather than
the Micron DC disks for our /lfs MDT.  We can check with them (Jerry I suppose)
for advice after we get some raw measurements done.

I don't mean to  dictate this path - please can others chime in?

Don

#8 Updated by Gerard Bernabeu Altayo over 3 years ago

Sending the incremental:

[root@newtevnfs ~]# zfs snapshot fast/fast0@201602031334

[root@newtevnfs ~]# zfs send -v -i fast/fast0@20160191119 fast/fast0@201602031334  > /extra/fast0.201602031334.incremental
send from @20160191119 to fast/fast0@201602031334 estimated size is 68.9G
total estimated size is 68.9G
TIME        SENT   SNAPSHOT
13:40:31   40.1M   fast/fast0@201602031334
13:40:32    313M   fast/fast0@201602031334
13:40:33   1.21G   fast/fast0@201602031334
13:40:34   2.10G   fast/fast0@201602031334
13:40:35   3.03G   fast/fast0@201602031334
13:40:36   3.66G   fast/fast0@201602031334
13:40:37   4.05G   fast/fast0@201602031334
13:40:38   4.91G   fast/fast0@201602031334
13:40:39   5.76G   fast/fast0@201602031334
13:40:40   6.55G   fast/fast0@201602031334
13:40:41   6.63G   fast/fast0@201602031334
13:40:42   6.64G   fast/fast0@201602031334
13:40:43   6.64G   fast/fast0@201602031334
13:40:44   6.65G   fast/fast0@201602031334

Compressing takes forever...

Setting the disk offline:

[root@newtevnfs ~]# zpool offline fast s0
[root@newtevnfs ~]# zpool list -v fast
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
fast  3.72T  1.43T  2.28T         -     9%    38%  1.00x  DEGRADED  -
  raidz1  3.72T  1.43T  2.28T         -     9%    38%
    s0      -      -      -         -      -      -
    s1      -      -      -         -      -      -
    s2      -      -      -         -      -      -
    s3      -      -      -         -      -      -
[root@newtevnfs ~]# zpool list fast
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
fast  3.72T  1.43T  2.28T         -     9%    38%  1.00x  DEGRADED  -
[root@newtevnfs ~]# zpool status fast
  pool: fast
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
  scan: scrub canceled on Fri Jan 29 14:34:35 2016
config:

    NAME        STATE     READ WRITE CKSUM
    fast        DEGRADED     0     0     0
      raidz1-0  DEGRADED     0     0     0
        s0      OFFLINE      0     0     0
        s1      ONLINE       0     0     0
        s2      ONLINE       0     0     0
        s3      ONLINE       0     0     0

errors: No known data errors
[root@newtevnfs ~]# 

Starting the tests:

The disk is very slow right after offlining it:

[root@newtevnfs ~]# time dd if=/dev/zero of=/dev/disk/by-vdev/s0 bs=1024k count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 40.8072 s, 26.3 MB/s

real    0m40.826s
user    0m0.003s
sys    0m1.035s

This is now a very simple operation just on the block device, proving the SSD itself is slow.

I will performe an ATA erase following https://wiki.archlinux.org/index.php/SSD_memory_cell_clearing. DID NOT WORK:

[root@newtevnfs ~]# hdparm -I /dev/disk/by-vdev/s0

/dev/disk/by-vdev/s0:

ATA device, with non-removable media
    Model Number:       Samsung SSD 850 PRO 1TB                 
    Serial Number:      S2BBNWAG103029M     
    Firmware Revision:  EXM02B6Q
    Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
    Used: unknown (minor revision code 0x0039) 
    Supported: 9 8 7 6 5 
    Likely used: 9
Configuration:
    Logical        max    current
    cylinders    16383    16383
    heads        16    16
    sectors/track    63    63
    --
    CHS current addressable sectors:   16514064
    LBA    user addressable sectors:  268435455
    LBA48  user addressable sectors: 2000409264
    Logical  Sector size:                   512 bytes
    Physical Sector size:                   512 bytes
    Logical Sector-0 offset:                  0 bytes
    device size with M = 1024*1024:      976762 MBytes
    device size with M = 1000*1000:     1024209 MBytes (1024 GB)
    cache/buffer size  = unknown
    Nominal Media Rotation Rate: Solid State Device
Capabilities:
    LBA, IORDY(can be disabled)
    Queue depth: 32
    Standby timer values: spec'd by Standard, no device specific minimum
    R/W multiple sector transfer: Max = 1    Current = 1
    DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
         Cycle time: min=120ns recommended=120ns
    PIO: pio0 pio1 pio2 pio3 pio4 
         Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
    Enabled    Supported:
       *    SMART feature set
            Security Mode feature set
       *    Power Management feature set
       *    Write cache
       *    Look-ahead
       *    Host Protected Area feature set
       *    WRITE_BUFFER command
       *    READ_BUFFER command
       *    NOP cmd
       *    DOWNLOAD_MICROCODE
            SET_MAX security extension
       *    48-bit Address feature set
       *    Device Configuration Overlay feature set
       *    Mandatory FLUSH_CACHE
       *    FLUSH_CACHE_EXT
       *    SMART error logging
       *    SMART self-test
       *    General Purpose Logging feature set
       *    WRITE_{DMA|MULTIPLE}_FUA_EXT
       *    64-bit World wide name
            Write-Read-Verify feature set
       *    WRITE_UNCORRECTABLE_EXT command
       *    {READ,WRITE}_DMA_EXT_GPL commands
       *    Segmented DOWNLOAD_MICROCODE
       *    Gen1 signaling speed (1.5Gb/s)
       *    Gen2 signaling speed (3.0Gb/s)
       *    Gen3 signaling speed (6.0Gb/s)
       *    Native Command Queueing (NCQ)
       *    Phy event counters
       *    unknown 76[15]
       *    DMA Setup Auto-Activate optimization
            Device-initiated interface power management
       *    Asynchronous notification (eg. media change)
       *    Software settings preservation
            unknown 78[8]
       *    SMART Command Transport (SCT) feature set
       *    SCT Write Same (AC2)
       *    SCT Error Recovery Control (AC3)
       *    SCT Features Control (AC4)
       *    SCT Data Tables (AC5)
       *    reserved 69[4]
       *    DOWNLOAD MICROCODE DMA command
       *    SET MAX SETPASSWORD/UNLOCK DMA commands
       *    WRITE BUFFER DMA command
       *    READ BUFFER DMA command
       *    Data Set Management TRIM supported (limit 8 blocks)
Security: 
    Master password revision code = 65534
        supported
    not    enabled
    not    locked
        frozen
    not    expired: security count
        supported: enhanced erase
    2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT. 
Logical Unit WWN Device Identifier: 50025388700755f2
    NAA        : 5
    IEEE OUI    : 002538
    Unique ID    : 8700755f2
Checksum: correct
[root@newtevnfs ~]# hdparm --user-master u --security-set-pass PasSWorD /dev/disk/by-vdev/s0
security_password="PasSWorD" 

/dev/disk/by-vdev/s0:
 Issuing SECURITY_SET_PASS command, password="PasSWorD", user=user, mode=high
SECURITY_SET_PASS: Input/output error
[root@newtevnfs ~]# hdparm -I /dev/disk/by-vdev/s0

/dev/disk/by-vdev/s0:

ATA device, with non-removable media
    Model Number:       Samsung SSD 850 PRO 1TB                 
    Serial Number:      S2BBNWAG103029M     
    Firmware Revision:  EXM02B6Q
    Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
    Used: unknown (minor revision code 0x0039) 
    Supported: 9 8 7 6 5 
    Likely used: 9
Configuration:
    Logical        max    current
    cylinders    16383    16383
    heads        16    16
    sectors/track    63    63
    --
    CHS current addressable sectors:   16514064
    LBA    user addressable sectors:  268435455
    LBA48  user addressable sectors: 2000409264
    Logical  Sector size:                   512 bytes
    Physical Sector size:                   512 bytes
    Logical Sector-0 offset:                  0 bytes
    device size with M = 1024*1024:      976762 MBytes
    device size with M = 1000*1000:     1024209 MBytes (1024 GB)
    cache/buffer size  = unknown
    Nominal Media Rotation Rate: Solid State Device
Capabilities:
    LBA, IORDY(can be disabled)
    Queue depth: 32
    Standby timer values: spec'd by Standard, no device specific minimum
    R/W multiple sector transfer: Max = 1    Current = 1
    DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
         Cycle time: min=120ns recommended=120ns
    PIO: pio0 pio1 pio2 pio3 pio4 
         Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
    Enabled    Supported:
       *    SMART feature set
            Security Mode feature set
       *    Power Management feature set
       *    Write cache
       *    Look-ahead
       *    Host Protected Area feature set
       *    WRITE_BUFFER command
       *    READ_BUFFER command
       *    NOP cmd
       *    DOWNLOAD_MICROCODE
            SET_MAX security extension
       *    48-bit Address feature set
       *    Device Configuration Overlay feature set
       *    Mandatory FLUSH_CACHE
       *    FLUSH_CACHE_EXT
       *    SMART error logging
       *    SMART self-test
       *    General Purpose Logging feature set
       *    WRITE_{DMA|MULTIPLE}_FUA_EXT
       *    64-bit World wide name
            Write-Read-Verify feature set
       *    WRITE_UNCORRECTABLE_EXT command
       *    {READ,WRITE}_DMA_EXT_GPL commands
       *    Segmented DOWNLOAD_MICROCODE
       *    Gen1 signaling speed (1.5Gb/s)
       *    Gen2 signaling speed (3.0Gb/s)
       *    Gen3 signaling speed (6.0Gb/s)
       *    Native Command Queueing (NCQ)
       *    Phy event counters
       *    unknown 76[15]
       *    DMA Setup Auto-Activate optimization
            Device-initiated interface power management
       *    Asynchronous notification (eg. media change)
       *    Software settings preservation
            unknown 78[8]
       *    SMART Command Transport (SCT) feature set
       *    SCT Write Same (AC2)
       *    SCT Error Recovery Control (AC3)
       *    SCT Features Control (AC4)
       *    SCT Data Tables (AC5)
       *    reserved 69[4]
       *    DOWNLOAD MICROCODE DMA command
       *    SET MAX SETPASSWORD/UNLOCK DMA commands
       *    WRITE BUFFER DMA command
       *    READ BUFFER DMA command
       *    Data Set Management TRIM supported (limit 8 blocks)
Security: 
    Master password revision code = 65534
        supported
    not    enabled
    not    locked
        frozen
    not    expired: security count
        supported: enhanced erase
    2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT. 
Logical Unit WWN Device Identifier: 50025388700755f2
    NAA        : 5
    IEEE OUI    : 002538
    Unique ID    : 8700755f2
Checksum: correct
[root@newtevnfs ~]# hdparm --user-master u --security-erase PasSWorD /dev/disk/by-vdev/s0
security_password="PasSWorD" 

/dev/disk/by-vdev/s0:
 Issuing SECURITY_ERASE command, password="PasSWorD", user=user
ERASE_PREPARE: Input/output error
[root@newtevnfs ~]# 

Going with magician, an error there too!

[root@newtevnfs ~]# /opt/samsung/samsung_magician_dc-v1.0_rtm_p2/64bin/magician --list
================================================================================================
Samsung(R) SSD Magician DC Version 1.0
Copyright (c) 2014 Samsung Corporation
================================================================================================
--------------------------------------------------------------------------------------------------
| Disk   | Model                  | Serial         | Firmware  | Capacity | Drive  | Total Bytes |
| Number |                        | Number         |           |          | Health | Written     |
--------------------------------------------------------------------------------------------------
| 0      |Samsung SSD 850 PRO 1TB |S2BBNWAG103029M |EXM02B6Q   | 953 GB   | GOOD   | 124.50 TB   |
--------------------------------------------------------------------------------------------------
| 1      |Samsung SSD 850 PRO 1TB |S2BBNWAG103016M |EXM02B6Q   | 953 GB   | GOOD   | 120.92 TB   |
--------------------------------------------------------------------------------------------------
| 2      |Samsung SSD 850 PRO 1TB |S2BBNWAG103031B |EXM02B6Q   | 953 GB   | GOOD   | 124.13 TB   |
--------------------------------------------------------------------------------------------------
| 3      |Samsung SSD 850 PRO 1TB |S2BBNWAG103020D |EXM02B6Q   | 953 GB   | GOOD   | 120.93 TB   |
--------------------------------------------------------------------------------------------------
[root@newtevnfs ~]# /opt/samsung/samsung_magician_dc-v1.0_rtm_p2/64bin/magician -d 0 --erase 
================================================================================================
Samsung(R) SSD Magician DC Version 1.0
Copyright (c) 2014 Samsung Corporation
================================================================================================
------------------------------------------------------------------------------------------------
WARNING : All Data on Disk will be Erased and cannot be recovered, Please take a back up of 
any data if necessary. Do not REMOVE SSD in middle of operation, otherwise results 
may be inaccurate. Continue Secure Erase ? [ yes ]:yes
------------------------------------------------------------------------------------------------
Disk Number:  0 | Model Name: Samsung SSD 850 PRO 1TB | Firmware Version: EXM02B6Q
------------------------------------------------------------------------------------------------
Erase:  Selected Disk is frozen : unplug and replug SATA cable and then try to do Secure Erase.
Erase:  [ERROR-SE]Secure Erase Operation Failed  
------------------------------------------------------------------------------------------------
[root@newtevnfs ~]# 

scsi magic did not work, will have to go out and disconnect/connect the disk, going to LCC to do that now.

#9 Updated by Gerard Bernabeu Altayo over 3 years ago

Unplugging the drive worked for the erase.

To locate the drive (no, sda was not the 1st but the 2nd) I did trigger some activity and then the leds start blinking (as they are on by default). They need quite a lot of activity to start blinking, reads were more successful than writes, and putting the disk online/offline on the zpool gave a good hint.

After the erase the disk now performs back at specs:

[root@newtevnfs ~]# /opt/samsung/samsung_magician_dc-v1.0_rtm_p2/64bin/magician --list
================================================================================================
Samsung(R) SSD Magician DC Version 1.0
Copyright (c) 2014 Samsung Corporation
================================================================================================
--------------------------------------------------------------------------------------------------
| Disk   | Model                  | Serial         | Firmware  | Capacity | Drive  | Total Bytes |
| Number |                        | Number         |           |          | Health | Written     |
--------------------------------------------------------------------------------------------------
| 0      |Samsung SSD 850 PRO 1TB |S2BBNWAG103016M |EXM02B6Q   | 953 GB   | GOOD   | 120.93 TB   |
--------------------------------------------------------------------------------------------------
| 1      |Samsung SSD 850 PRO 1TB |S2BBNWAG103031B |EXM02B6Q   | 953 GB   | GOOD   | 124.14 TB   |
--------------------------------------------------------------------------------------------------
| 2      |Samsung SSD 850 PRO 1TB |S2BBNWAG103020D |EXM02B6Q   | 953 GB   | GOOD   | 120.94 TB   |
--------------------------------------------------------------------------------------------------
[root@newtevnfs ~]# /opt/samsung/samsung_magician_dc-v1.0_rtm_p2/64bin/magician --list
================================================================================================
Samsung(R) SSD Magician DC Version 1.0
Copyright (c) 2014 Samsung Corporation
================================================================================================
--------------------------------------------------------------------------------------------------
| Disk   | Model                  | Serial         | Firmware  | Capacity | Drive  | Total Bytes |
| Number |                        | Number         |           |          | Health | Written     |
--------------------------------------------------------------------------------------------------
| 0      |Samsung SSD 850 PRO 1TB |S2BBNWAG103029M |EXM02B6Q   | 953 GB   | GOOD   | 124.50 TB   |
--------------------------------------------------------------------------------------------------
| 1      |Samsung SSD 850 PRO 1TB |S2BBNWAG103016M |EXM02B6Q   | 953 GB   | GOOD   | 120.93 TB   |
--------------------------------------------------------------------------------------------------
| 2      |Samsung SSD 850 PRO 1TB |S2BBNWAG103031B |EXM02B6Q   | 953 GB   | GOOD   | 124.14 TB   |
--------------------------------------------------------------------------------------------------
| 3      |Samsung SSD 850 PRO 1TB |S2BBNWAG103020D |EXM02B6Q   | 953 GB   | GOOD   | 120.94 TB   |
--------------------------------------------------------------------------------------------------
(reverse-i-search)`erase': /opt/samsung/samsung_magician_dc-v1.0_rtm_p2/64bin/magician -d 0 --^Case --force
(reverse-i-search)`-i': [root@newtevnfs ~]# zfs send -v ^C fast/fast0@20160191119 fast/fast0@201602031334  > /extra/fast0.201602031334.incremental
[root@newtevnfs ~]# hdparm -I /dev/disk/by-vdev/s0

/dev/disk/by-vdev/s0:

ATA device, with non-removable media
    Model Number:       Samsung SSD 850 PRO 1TB                 
    Serial Number:      S2BBNWAG103029M     
    Firmware Revision:  EXM02B6Q
    Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
    Used: unknown (minor revision code 0x0039) 
    Supported: 9 8 7 6 5 
    Likely used: 9
Configuration:
    Logical        max    current
    cylinders    16383    16383
    heads        16    16
    sectors/track    63    63
    --
    CHS current addressable sectors:   16514064
    LBA    user addressable sectors:  268435455
    LBA48  user addressable sectors: 2000409264
    Logical  Sector size:                   512 bytes
    Physical Sector size:                   512 bytes
    Logical Sector-0 offset:                  0 bytes
    device size with M = 1024*1024:      976762 MBytes
    device size with M = 1000*1000:     1024209 MBytes (1024 GB)
    cache/buffer size  = unknown
    Nominal Media Rotation Rate: Solid State Device
Capabilities:
    LBA, IORDY(can be disabled)
    Queue depth: 32
    Standby timer values: spec'd by Standard, no device specific minimum
    R/W multiple sector transfer: Max = 1    Current = 1
    DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
         Cycle time: min=120ns recommended=120ns
    PIO: pio0 pio1 pio2 pio3 pio4 
         Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
    Enabled    Supported:
       *    SMART feature set
            Security Mode feature set
       *    Power Management feature set
       *    Write cache
       *    Look-ahead
       *    Host Protected Area feature set
       *    WRITE_BUFFER command
       *    READ_BUFFER command
       *    NOP cmd
       *    DOWNLOAD_MICROCODE
            SET_MAX security extension
       *    48-bit Address feature set
       *    Device Configuration Overlay feature set
       *    Mandatory FLUSH_CACHE
       *    FLUSH_CACHE_EXT
       *    SMART error logging
       *    SMART self-test
       *    General Purpose Logging feature set
       *    WRITE_{DMA|MULTIPLE}_FUA_EXT
       *    64-bit World wide name
            Write-Read-Verify feature set
       *    WRITE_UNCORRECTABLE_EXT command
       *    {READ,WRITE}_DMA_EXT_GPL commands
       *    Segmented DOWNLOAD_MICROCODE
       *    Gen1 signaling speed (1.5Gb/s)
       *    Gen2 signaling speed (3.0Gb/s)
       *    Gen3 signaling speed (6.0Gb/s)
       *    Native Command Queueing (NCQ)
       *    Phy event counters
       *    unknown 76[15]
       *    DMA Setup Auto-Activate optimization
            Device-initiated interface power management
       *    Asynchronous notification (eg. media change)
       *    Software settings preservation
            unknown 78[8]
       *    SMART Command Transport (SCT) feature set
       *    SCT Write Same (AC2)
       *    SCT Error Recovery Control (AC3)
       *    SCT Features Control (AC4)
       *    SCT Data Tables (AC5)
       *    reserved 69[4]
       *    DOWNLOAD MICROCODE DMA command
       *    SET MAX SETPASSWORD/UNLOCK DMA commands
       *    WRITE BUFFER DMA command
       *    READ BUFFER DMA command
       *    Data Set Management TRIM supported (limit 8 blocks)
Security: 
    Master password revision code = 65534
        supported
    not    enabled
    not    locked
    not    frozen
    not    expired: security count
        supported: enhanced erase
    2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT. 
Logical Unit WWN Device Identifier: 50025388700755f2
    NAA        : 5
    IEEE OUI    : 002538
    Unique ID    : 8700755f2
Checksum: correct
[root@newtevnfs ~]# /opt/samsung/samsung_magician_dc-v1.0_rtm_p2/64bin/magician -d 0 --erase --force
================================================================================================
Samsung(R) SSD Magician DC Version 1.0
Copyright (c) 2014 Samsung Corporation
================================================================================================
------------------------------------------------------------------------------------------------
Disk Number:  0 | Model Name: Samsung SSD 850 PRO 1TB | Firmware Version: EXM02B6Q
------------------------------------------------------------------------------------------------
Erase:  Secure Erase is completed successfully.  
------------------------------------------------------------------------------------------------
Completed [  100% ] 
------------------------------------------------------------------------------------------------
[root@newtevnfs ~]# time dd if=/dev/zero of=/dev/disk/by-vdev/s0 bs=1024k count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 2.80737 s, 382 MB/s

real    0m2.809s
user    0m0.004s
sys    0m1.073s
[root@newtevnfs ~]# zpool online fast s0
warning: device 's0' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present
[root@newtevnfs ~]# zpool replace fast s0
invalid vdev specification
use '-f' to override the following errors:
/dev/disk/by-vdev/s0 does not contain an EFI label but it may contain partition
information in the MBR.
[root@newtevnfs ~]# zpool replace -f fast s0
[root@newtevnfs ~]# zpool status
  pool: data
 state: ONLINE
  scan: scrub repaired 0 in 6h26m with 0 errors on Mon Oct 26 05:38:10 2015
config:

    NAME        STATE     READ WRITE CKSUM
    data        ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        h0      ONLINE       0     0     0
        h1      ONLINE       0     0     0
        h2      ONLINE       0     0     0
        h3      ONLINE       0     0     0

errors: No known data errors

  pool: fast
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Feb  3 17:19:03 2016
    712M scanned out of 1.45T at 71.2M/s, 5h56m to go
    178M resilvered, 0.05% done
config:

    NAME             STATE     READ WRITE CKSUM
    fast             DEGRADED     0     0     0
      raidz1-0       DEGRADED     0     0     0
        replacing-0  UNAVAIL      0     0     0
          old        UNAVAIL      0     0     0  corrupted data
          s0         ONLINE       0     0     0  (resilvering)
        s1           ONLINE       0     0     0
        s2           ONLINE       0     0     0
        s3           ONLINE       0     0     0

errors: No known data errors

I'm leaving it rebuilding, knowing that the RAIDZ1 will be degraded unless we do the same with all disks and even in that case it will only last for a limited amount of time. Therefore IMO we should find a more lasting solution. Will investigate.

Strangely enough the resilver is quite slow, maybe due to activity on the other 3 disks?

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.42    0.13    0.00   99.45

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb              52.00     0.00  545.00    9.00    24.87     0.09    92.27     1.58    2.86   1.64  90.60
sdc              63.00     0.00  540.00   10.00    25.41     0.09    94.92     1.58    2.88   1.67  92.00
sdd               9.00     0.00  586.00    8.00    23.59     0.08    81.60     1.71    2.88   1.53  90.80
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    2.00     0.00     0.00     1.50     0.01    6.00   6.00   1.20
sdj               0.00     0.00    0.00    2.00     0.00     0.00     1.50     0.01    6.50   6.50   1.30
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda               0.00     0.00    0.00  333.00     0.00    26.02   160.00     1.00    3.00   3.00 100.00

^C
[root@newtevnfs ~]# 

#10 Updated by Gerard Bernabeu Altayo over 3 years ago

Now about the replacement:

There are two separate bugs here. One, Samsung drives advertise support for queued TRIM even though it's not properly supported, causing corruption. Two, the kernel had a TRIM bug that affected serial TRIM with mdadm RAID, which is the kernel bug Samsung found and fixed (https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f3f5da624e0a891c34d8cd513c57f1d9b0c7dadc). The queued TRIM bug still exists in the Samsung firmware.

Since the fixed bug only affected raid0 and raid10 over mdraid, I think we should go for MDRAID5 (equivalent to raidz1) plus XFS for the /fast filesystem.

For more details on the bug fix:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f3f5da624e0a891c34d8cd513c57f1d9b0c7dadc
This fixes a data corruption bug when using discard on top of MD linear, raid0 and raid10 personalities.

#11 Updated by Gerard Bernabeu Altayo over 3 years ago

This is the data distribution on /fast0:

[root@newtevnfs ~]# du -sh /fast0/*
512 /fast0/ansys
512 /fast0/auger
512 /fast0/booster
512 /fast0/ecloud
571M /fast0/g4
512 /fast0/iota
512 /fast0/lost+found
50G /fast0/macridin
955G /fast0/mi
512 /fast0/mu2e
512 /fast0/projectx
2.0G /fast0/sandbox
512 /fast0/srflinac
512 /fast0/SRFmaterials
512 /fast0/staging
512 /fast0/this.is.fast
512 /fast0/this.is.fast0
29G /fast0/uslarp
[root@newtevnfs ~]# du -sh /fast0/mi/*
43G /fast0/mi/amundson
105K /fast0/mi/debug.00
118G /fast0/mi/egstern
795G /fast0/mi/rainswor

Updated plan:

1. Amitoj: unomunt /fast from all TEV nodes

Gerard from now on:

2. Make sure nobody is using it:

[root@newtevnfs ~]# lsof | grep /fast0
[root@newtevnfs ~]#

3. Copy all data with:

 for i in `ls /fast0/ | grep -v mi`; do tar -czpf /extra/fastcopy/$i.tgz /fast0/$i & done
 for i in `ls /fast0/mi`; do tar -czpf /data0/fastcopy/mi/$i.tgz /fast0/mi/$i & done

4. zpool destroy /fast

5. Go to LCC and Initialize the 4 disks with the full erase by:

5.1 Physically remove and insert all the SSD disks

5.2. ATA ERASE the disks:

   for i in 0 1 2 3; do /opt/samsung/samsung_magician_dc-v1.0_rtm_p2/64bin/magician -d $i --erase --force

6. Create the MDRAID, LVS and XFS on top:

mdadm --create /dev/md0 --level=5 --raid-devices=4 /dev/sd[abcd]
pvcreate /dev/md0
vgcreate VolGroupArray /dev/md0
lvcreate -l +100%FREE VolGroupArray -n lvfast

export CHUNK_SZ_KB=512
export PARITY_DRIVE_COUNT=1
export NON_PARITY_DRIVE_COUNT=3
mkfs.xfs -d sunit=$(($CHUNK_SZ_KB*2)),swidth=$(($CHUNK_SZ_KB*2*$NON_PARITY_DRIVE_COUNT)) /dev/mapper/VolGroupArray-lvfast

mkdir /fast
mount  /dev/mapper/VolGroupArray-lvfast /fast

7. Move the data back:

 for i in `ls /extra/fastcopy/*.tgz`; do cd /fast/ && tar -xvpzf /extra/fastcopy/$i & done
 mkdir /fast/mi0; chown admundson.mi /fast/mi0; chmod 755 /fast/mi0
 for i in `ls /data0/fastcopy/mi/*.tgz`; do cd /fast/mi/ && tar -czpf /data0/fastcopy/mi/$i.tgz & done

8. Remount on all TEV machines

ssh root@tev
rgang --rsh=/usr/bin/rsh tevall "mount /fast" 

#12 Updated by Gerard Bernabeu Altayo over 3 years ago

I am actually making a couple more copies just in case something goes south:

[root@newtevnfs ~]# zfs send -v fast/fast0@201602041708 > /mnt/fast0.201602041708.gz 2> /root/zfs.send.log &
[24] 30210
[23]   Killed                  zfs send fast/fast0@201602041708 > /mnt/fast0.201602041708.gz 2> /root/zfs.send.log
[root@newtevnfs ~]# tail /root/zfs.send.log
send from @ to fast/fast0@201602041708 estimated size is 1.01T
total estimated size is 1.01T
TIME        SENT   SNAPSHOT
17:13:17   78.2M   fast/fast0@201602041708
17:13:18    253M   fast/fast0@201602041708
17:13:19    494M   fast/fast0@201602041708
17:13:20   1.12G   fast/fast0@201602041708
[root@newtevnfs ~]# 

Where /mount is:
tev0302:/scratch on /mnt type nfs (rw,vers=4,addr=192.168.76.41,clientaddr=192.168.76.28)

Also making a tar only copy for /fast/mi/rainswor because the gzip is going VERY slow:

<pre>

/data0/fastcopy/mi/:
total 338G
drwxr-xr-x 3 root root    3 Feb  4 10:56 ..
drwxr-xr-x 2 root root    6 Feb  4 11:09 .
-rw-r--r-- 1 root root  11K Feb  4 11:09 debug.00.tgz
-rw-r--r-- 1 root root  34G Feb  4 11:56 amundson.tgz
-rw-r--r-- 1 root root  65G Feb  4 12:51 egstern.tgz
-rw-r--r-- 1 root root 240G Feb  4 17:16 rainswor.tgz

/extra/fastcopy/:
total 170G
drwxr-xr-x 11 root root 4.0K Feb  4 11:01 ..
-rw-r--r--  1 root root  117 Feb  4 11:09 this.is.fast.tgz
-rw-r--r--  1 root root  119 Feb  4 11:09 this.is.fast0.tgz
-rw-r--r--  1 root root  134 Feb  4 11:09 staging.tgz
-rw-r--r--  1 root root  125 Feb  4 11:09 SRFmaterials.tgz
-rw-r--r--  1 root root  117 Feb  4 11:09 srflinac.tgz
-rw-r--r--  1 root root  119 Feb  4 11:09 projectx.tgz
-rw-r--r--  1 root root  114 Feb  4 11:09 mu2e.tgz
-rw-r--r--  1 root root  114 Feb  4 11:09 iota.tgz
-rw-r--r--  1 root root  117 Feb  4 11:09 ecloud.tgz
-rw-r--r--  1 root root  114 Feb  4 11:09 booster.tgz
-rw-r--r--  1 root root  123 Feb  4 11:09 auger.tgz
-rw-r--r--  1 root root  115 Feb  4 11:09 ansys.tgz
-rw-r--r--  1 root root  57M Feb  4 11:11 g4.tgz
-rw-r--r--  1 root root 395M Feb  4 11:11 sandbox.tgz
-rw-r--r--  1 root root  19G Feb  4 11:36 uslarp.tgz
-rw-r--r--  1 root root  45G Feb  4 12:13 macridin.tgz
drwxr-xr-x  2 root root 4.0K Feb  4 16:42 .
-rw-r--r--  1 root root 106G Feb  4 17:16 mi.rainswor.extranozipped.tar

</pre>

The copy to tev0302 is going at only 12MB/s due to some network error I can't figure out... Tried a bc and it was slow too so the issue is not with ZFS and/or NFS but under the TCP stack (or IB) between newtevnfs and tev0302.

I am going to reseat the drives now so that they get unfrozen and tonight once the backups are done I can erase the whole thing.

The full ZFS stream will most probably not be done due to the network issues, I may need to extend the downtime if proceeding without the extra backup is not acceptable to Amitoj.

#13 Updated by Gerard Bernabeu Altayo over 3 years ago

A reseat of the drives has been done, for the 1st I put the disk offline but then I tried not doing it and it worked too, being it less disruptive to ZFS (for 1 second the ATA commands just retry), now the disks are not frozen so I'll be able to ATA ERASE remotely:

[root@newtevnfs ~]# for i in 0 1 2 3; do hdparm -I /dev/disk/by-vdev/s$i | grep froz; done
    not    frozen
    not    frozen
    not    frozen
    not    frozen
[root@newtevnfs ~]# zpool status
  pool: data
 state: ONLINE
  scan: scrub repaired 0 in 6h26m with 0 errors on Mon Oct 26 05:38:10 2015
config:

    NAME        STATE     READ WRITE CKSUM
    data        ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        h0      ONLINE       0     0     0
        h1      ONLINE       0     0     0
        h2      ONLINE       0     0     0
        h3      ONLINE       0     0     0

errors: No known data errors

  pool: fast
 state: ONLINE
  scan: resilvered 736K in 0h0m with 0 errors on Thu Feb  4 18:04:28 2016
config:

    NAME        STATE     READ WRITE CKSUM
    fast        ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        s0      ONLINE       0     0     0
        s1      ONLINE       0     0     0
        s2      ONLINE       0     0     0
        s3      ONLINE       0     0     0

errors: No known data errors
[root@newtevnfs ~]# 

#14 Updated by Gerard Bernabeu Altayo over 3 years ago

The backups finished successfully, proceeding with the plan:

[root@newtevnfs ~]# ls -ltrah /extra/fastcopy/ /data0/fastcopy/mi/ /root/ibtev0302scratch/
/data0/fastcopy/mi/:
total 605G
drwxr-xr-x 3 root root    3 Feb  4 10:56 ..
drwxr-xr-x 2 root root    6 Feb  4 11:09 .
-rw-r--r-- 1 root root  11K Feb  4 11:09 debug.00.tgz
-rw-r--r-- 1 root root  34G Feb  4 11:56 amundson.tgz
-rw-r--r-- 1 root root  65G Feb  4 12:51 egstern.tgz
-rw-r--r-- 1 root root 506G Feb  5 00:12 rainswor.tgz

/extra/fastcopy/:
total 848G
drwxr-xr-x 11 root root 4.0K Feb  4 11:01 ..
-rw-r--r--  1 root root  117 Feb  4 11:09 this.is.fast.tgz
-rw-r--r--  1 root root  119 Feb  4 11:09 this.is.fast0.tgz
-rw-r--r--  1 root root  134 Feb  4 11:09 staging.tgz
-rw-r--r--  1 root root  125 Feb  4 11:09 SRFmaterials.tgz
-rw-r--r--  1 root root  117 Feb  4 11:09 srflinac.tgz
-rw-r--r--  1 root root  119 Feb  4 11:09 projectx.tgz
-rw-r--r--  1 root root  114 Feb  4 11:09 mu2e.tgz
-rw-r--r--  1 root root  114 Feb  4 11:09 iota.tgz
-rw-r--r--  1 root root  117 Feb  4 11:09 ecloud.tgz
-rw-r--r--  1 root root  114 Feb  4 11:09 booster.tgz
-rw-r--r--  1 root root  123 Feb  4 11:09 auger.tgz
-rw-r--r--  1 root root  115 Feb  4 11:09 ansys.tgz
-rw-r--r--  1 root root  57M Feb  4 11:11 g4.tgz
-rw-r--r--  1 root root 395M Feb  4 11:11 sandbox.tgz
-rw-r--r--  1 root root  19G Feb  4 11:36 uslarp.tgz
-rw-r--r--  1 root root  45G Feb  4 12:13 macridin.tgz
drwxr-xr-x  2 root root 4.0K Feb  4 16:42 .
-rw-r--r--  1 root root 784G Feb  4 22:39 mi.rainswor.extranozipped.tar

/root/ibtev0302scratch/:
total 1.2T
-rw-r--r--  1 nobody nobody  59M Feb  4 17:35 fast0.201602041708.gz.netcat
-rw-r--r--  1 nobody nobody 1.0G Feb  4 17:42 gba.test
drwxrwxrwx  2 nobody nobody 4.0K Feb  4 21:10 .
-rw-r--r--  1 nobody nobody 142G Feb  4 21:12 fast0.201602041708.gz
-rw-r--r--  1 nobody nobody 1.1T Feb  5 01:45 fast0.201602041708IB.gz
dr-xr-x---. 7 root   root   4.0K Feb  5 05:23 ..
[21]-  Done                    tar -czpf /data0/fastcopy/mi/$i.tgz /fast0/mi/$i
[25]+  Done                    zfs send -v fast/fast0@201602041708 > /root/ibtev0302scratch/fast0.201602041708IB.gz 2> /root/zfs.send.log
[root@newtevnfs ~]# 

#15 Updated by Gerard Bernabeu Altayo over 3 years ago

Can't destroy the zpool just like that:

[root@newtevnfs ~]# zpool destroy fast
umount: /fast0: device is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))
cannot unmount '/fast0': umount failed
could not destroy 'fast': could not unmount datasets
[root@newtevnfs ~]# zpool destroy -f fast
umount2: Device or resource busy
umount: /fast0: device is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))
umount2: Device or resource busy
cannot unmount '/fast0': umount failed
could not destroy 'fast': could not unmount datasets
[root@newtevnfs ~]# fuser -c /fast
fast/  fast0/ 
[root@newtevnfs ~]# fuser -c /fast0
[root@newtevnfs ~]# fuser -c /fast
/fast:                   1rce     2rc     3rc     4rc     5rc     6rc     7rc     8rc     9rc    10rc    11rc    12rc    13rc    14rc    15rc    16rc    17rc    18rc    19rc    20rc    21rc    22rc    23rc    24rc    25rc    26rc    27rc    28rc    29rc    30rc    31rc    32rc    33rc    34rc    35rc    36rc    37rc    38rc    39rc    40rc    41rc    42rc    43rc    44rc    45rc    46rc    47rc    48rc    49rc    50rc    51rc    52rc    53rc    54rc    55rc    56rc    57rc    58rc    59rc    60rc    61rc    62rc    63rc    64rc    65rc    66rc    67rc    68rc    69rc    70rc    71rc    72rc    73rc    74rc    75rc    76rc    77rc    78rc    79rc    80rc    81rc    82rc    83rc    84rc    85rc    86rc    87rc    88rc    89rc    90rc    91rc    92rc    93rc    94rc    95rc    96rc    97rc    98rc    99rc   100rc   101rc   102rc   103rc   104rc   105rc   106rc   107rc   108rc   109rc   110rc   111rc   112rc   113rc   114rc   115rc   116rc   117rc   118rc   119rc   120rc   121rc   122rc   123rc   124rc   125rc   126rc   127rc   128rc   129rc   130rc   131rc   132rc   133rc   134rc   135rc   136rc   137rc   138rc   139rc   140rc   141rc   142rc   143rc   144rc   145rc   146rc   147rc   148rc   149rc   150rc   151rc   152rc   153rc   154rc   155rc   156rc   157rc   158rc   159rc   160rc   161rc   162rc   163rc   164rc   165rc   166rc   167rc   168rc   169rc   170rc   171rc   172rc   173rc   174rc   175rc   176rc   177rc   178rc   179rc   180rc   181rc   182rc   183rc   184rc   185rc   186rc   187rc   188rc   189rc   190rc   191rc   192rc   193rc   194rc   195rc   196rc   197rc   198rc   199rc   200rc   201rc   202rc   203rc   204rc   205rc   206rc   207rc   208rc   209rc   210rc   211rc   212rc   213rc   214rc   215rc   216rc   217rc   218rc   219rc   220rc   221rc   222rc   223rc   224rc   225rc   226rc   227rc   228rc   229rc   230rc   231rc   232rc   233rc   234rc   235rc   236rc   237rc   238rc   239rc   240rc   241rc   242rc   243rc   244rc   245rc   246rc   247rc   248rc   249rc   250rc   251rc   252rc   253rc   254rc   255rc   256rc   257rc   259rc   260rc   261rc   262rc   263rc   264rc   265rc   266rc   267rc   268rc   269rc   270rc   271rc   272rc   273rc   274rc   275rc   276rc   277rc   278rc   279rc   280rc   281rc   282rc   283rc   284rc   285rc   286rc   287rc   288rc   289rc   290rc   291rc   292rc   293rc   294rc   295rc   296rc   297rc   298rc   299rc   300rc   301rc   302rc   303rc   304rc   305rc   306rc   307rc   308rc   309rc   310rc   311rc   319rc   320rc   321rc   322rc   323rc   324rc   325rc   326rc   327rc   328rc   329rc   330rc   331rc   332rc   333rc   334rc   335rc   336rc   337rc   338rc   339rc   340rc   341rc   342rc   354rc   355rc   356rc   387rc   388rc   809rc   810rc   811rc   812rc   817rc   818rc   819rc   820rc   821rc   822rc  1182rc  1200rc  1201rc  1300rce  1738rc  1902rc  2277rc  2278rc  2279rc  2280rc  2281rc  2282rc  2283rc  2284rc  2285rc  2286rc  2287rc  2288rc  2289rc  2290rc  2291rc  2292rc  2293rc  2294rc  2295rc  2296rc  2297rc  2298rc  2299rc  2300rc  2301rc  2302rc  2303rc  2304rc  2372rc  2381rc  2382rc  2383rc  2384rc  2385rc  2386rc  2387rc  2388rc  2389rc  2390rc  2391rc  2392rc  2393rc  2394rc  2395rc  2396rc  2397rc  2398rc  2399rc  2400rc  2401rc  2402rc  2403rc  2404rc  2405rc  2406rc  2407rc  2408rc  2409rc  2410rc  2411rc  2412rc  2413rc  2414rc  2415rc  2416rc  2417rc  2418rc  2419rc  2420rc  2421rc  2422rc  2423rc  2424rc  2425rc  2426rc  2427rc  2428rc  2429rc  2430rc  2431rc  2432rc  2433rc  2434rc  2435rc  2436rc  2437rc  2438rc  2439rc  2559rc  2560rc  2561rc  2562rc  2563rc  2564rc  2565rc  2566rc  2567rc  2568rc  2569rc  2570rc  2571rc  2572rc  2573rc  2574rc  2575rc  2576rc  2577rc  2578rc  2579rc  2580rc  2581rc  2582rc  2583rc  2584rc  2586rc  4838rc  4839rc  4883rc  4911rce  5470rc  5471rc  5472rc  5473rc  5474rc  5475rc  5476rc  5477rc  5478rc  5479rc  5480rc  5481rc  5482rc  5483rc  5484rc  5485rc  5486rc  5487rc  5488rc  5489rc  5490rc  5491rc  5492rc  5493rc  5494rc  5495rc  5496rc  5497rc  5498rc  5499rc  5500rc  5501rc  5502rc  5503rc  5504rc  5505rc  5506rc  5507rc  5508rc  5509rc  5510rc  5511rc  5512rc  5513rc  5514rc  5515rc  5516rc  5517rc  5518rc  5519rc  5520rc  5521rc  5522rc  5523rc  5524rc  5525rc  5526rc  5527rc  5528rc  5529rc  5530rc  5531rc  5532rc  5533rc  5534rc  5535rc  5536rc  5537rc  5538rc  5539rc  5540rc  5541rc  5542rc  5543rc  5544rc  5545rc  5546rc  5547rc  5548rc  5549rc  5550rc  5551rc  5552rc  5553rc  5554rc  5555rc  5556rc  5557rc  5558rc  5559rc  5560rc  5561rc  5562rc  5563rc  5564rc  5565rc  5566rc  5567rc  5568rc  5569rc  5570rc  5571rc  5572rc  5573rc  5574rc  5575rc  5576rc  5577rc  5578rc  5579rc  5580rc  5581rc  5582rc  5583rc  5584rc  5585rc  5586rc  5587rc  5588rc  5589rc  5590rc  5600rc  5617rc  5618rc  5619rc  5624rc  5625rc  5626rc  5627rc  5628rc  5629rc  5630rc  5631rc  5632rc  5633rc  5634rc  5635rc  5636rc  5637rc  5638rc  5639rc  5640rc  5641rc  5642rc  5643rc  5644rc  5645rc  5647rc  5648rc  5649rc  5650rc  5651rc  5652rc  5653rc  5654rc  5655rc  5656rc  5657rc  5658rc  5659rc  5660rc  5661rc  5662rc  5663rc  5664rc  5665rc  5666rc  5667rc  5668rc  5669rc  5670rc  5671rc  5672rc  5673rc  5674rc  5675rc  5676rc  5677rc  5678rc  5679rc  5680rc  5681rc  5682rc  5683rc  5684rc  5685rc  5686rc  5687rc  5688rc  5689rc  5690rc  5691rc  5692rc  5693rc  5694rc  5695rc  5696rc  5697rc  5698rc  5699rc  5700rc  5701rc  5702rc  5703rc  5704rc  5705rc  5706rc  5707rc  5708rc  5709rc  5710rc  5711rc  5712rc  5713rc  5714rc  5715rc  5716rc  5717rc  5718rc  5719rc  5720rc  5721rc  5722rc  5723rc  5724rc  5725rc  5726rc  5727rc  5728rc  5729rc  5730rc  5731rc  5732rc  5733rc  5734rc  5735rc  5736rc  5737rc  5738rc  5739rc  5740rc  5741rc  5742rc  5743rc  5744rc  5745rc  5759rc  5772rc  5773rc  5782rc  5788rc  5987rc  5993rc  5997rc  5998rc  5999rc  6000rc  6001rc  6002rc  6003rc  6004rc  6005rc  6006rc  6007rc  6008rc  6009rc  6010rc  6011rc  6012rc  6013rc  6014rc  6015rc  6016rc  6017rc  6018rc  6019rc  6020rc  6024rc  6041rc  6042rc  6358rce  6366rce  6380rce  6400rce  6422rce  6434rce  6435rce  6440rce  6462rce  6478rce  6504rce  6517rce  6518rce  6550rce  6601rce  6617rce  6653rc  6654rc  6655rc  6656rc  6657rc  6658rc  6659rc  6660rc  6661rc  6662rc  6663rc  6664rc  6665rc  6666rc  6667rc  6668rc  6669rc  6670rc  6671rc  6672rc  6673rc  6674rc  6675rc  6676rc  6685rce  6702rce  6712rc  6713rc  6714rc  6715rc  6716rc  6717rc  6718rc  6719rc  6720rc  6721rc  6722rc  6747rce  6757rce  6778rce  6796rce  6808rce  6817rce  6869rce  6974rce  7106rce  7115rce  7130rce  7161rce  7175rce  7188rce  7190rce  7192rce  7194rce  7196rce  7198rce 11087rce 11088rce 11089rce 11090rce 11091rce 11092rce 11093rce 11094rce 15435rc 15436rc 15437rc 15462rc 16331rce 20832rce 23333rce 23334rce 26262rce 26271rce 26607rce 26962rce 28253rc 28254rc 28255rc 28256rc 28257rc 28258rc 28259rc 28260rc
[root@newtevnfs ~]# ps faux | grpe 28260
-bash: grpe: command not found
[root@newtevnfs ~]# ps faux | grep 28260
root     28260  0.5  0.0      0     0 ?        S    Jan27  62:40  \_ [nfsd]
root     26414  0.0  0.0 103252   832 pts/1    S+   05:32   0:00          \_ grep 28260
[root@newtevnfs ~]# service nfsd stop^C
[root@newtevnfs ~]# 
[root@newtevnfs ~]# service nfs stop
Shutting down NFS daemon:                                  [  OK  ]
Shutting down NFS mountd:                                  [  OK  ]
Shutting down NFS quotas:                                  [  OK  ]
Shutting down NFS services:                                [  OK  ]
Shutting down RPC idmapd:                                  [  OK  ]
[root@newtevnfs ~]# zpool destroy -f fast
[root@newtevnfs ~]# service nfs start
Starting NFS services:                                     [  OK  ]
Starting NFS quotas:                                       [  OK  ]
Starting NFS mountd: rpc.mountd: svc_tli_create: could not open connection for udp6
rpc.mountd: svc_tli_create: could not open connection for tcp6
rpc.mountd: svc_tli_create: could not open connection for udp6
rpc.mountd: svc_tli_create: could not open connection for tcp6
rpc.mountd: svc_tli_create: could not open connection for udp6
rpc.mountd: svc_tli_create: could not open connection for tcp6
                                                           [  OK  ]
Starting NFS daemon: rpc.nfsd: address family inet6 not supported by protocol TCP
                                                           [  OK  ]
Starting RPC idmapd:                                       [  OK  ]
[root@newtevnfs ~]# date
Fri Feb  5 05:34:00 CST 2016
[root@newtevnfs ~]# 
[root@newtevnfs ~]#    for i in 0 1 2 3; do /opt/samsung/samsung_magician_dc-v1.0_rtm_p2/64bin/magician -d $i --erase --force; done
================================================================================================
Samsung(R) SSD Magician DC Version 1.0
Copyright (c) 2014 Samsung Corporation
================================================================================================
------------------------------------------------------------------------------------------------
Disk Number:  0 | Model Name: Samsung SSD 850 PRO 1TB | Firmware Version: EXM02B6Q
------------------------------------------------------------------------------------------------
Erase:  Secure Erase is completed successfully.  
------------------------------------------------------------------------------------------------
Completed [  100% ] 
------------------------------------------------------------------------------------------------
================================================================================================
Samsung(R) SSD Magician DC Version 1.0
Copyright (c) 2014 Samsung Corporation
================================================================================================
------------------------------------------------------------------------------------------------
Disk Number:  1 | Model Name: Samsung SSD 850 PRO 1TB | Firmware Version: EXM02B6Q
------------------------------------------------------------------------------------------------
Erase:  Secure Erase is completed successfully.  
------------------------------------------------------------------------------------------------
Completed [  100% ] 
------------------------------------------------------------------------------------------------
================================================================================================
Samsung(R) SSD Magician DC Version 1.0
Copyright (c) 2014 Samsung Corporation
================================================================================================
------------------------------------------------------------------------------------------------
Disk Number:  2 | Model Name: Samsung SSD 850 PRO 1TB | Firmware Version: EXM02B6Q
------------------------------------------------------------------------------------------------
Erase:  Secure Erase is completed successfully.  
------------------------------------------------------------------------------------------------
Completed [  100% ] 
------------------------------------------------------------------------------------------------
================================================================================================
Samsung(R) SSD Magician DC Version 1.0
Copyright (c) 2014 Samsung Corporation
================================================================================================
------------------------------------------------------------------------------------------------
Disk Number:  3 | Model Name: Samsung SSD 850 PRO 1TB | Firmware Version: EXM02B6Q
------------------------------------------------------------------------------------------------
Erase:  Secure Erase is completed successfully.  
------------------------------------------------------------------------------------------------
Completed [  100% ] 
------------------------------------------------------------------------------------------------
[root@newtevnfs ~]# 
[root@newtevnfs ~]# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdi1[0] sdj1[1]
      513984 blocks super 1.0 [2/2] [UU]

md1 : active raid1 sdj3[1] sdi3[0]
      239349824 blocks super 1.1 [2/2] [UU]
      bitmap: 1/2 pages [4KB], 65536KB chunk

unused devices: <none>
[root@newtevnfs ~]# mdadm --create /dev/md2 --level=5 --raid-devices=4 /dev/sd[abcd]
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md2 started.
[root@newtevnfs ~]# cat /proc/mdstat 
Personalities : [raid1] [raid6] [raid5] [raid4] 
md2 : active raid5 sdd[4] sdc[2] sdb[1] sda[0]
      3000219648 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
      [>....................]  recovery =  0.0% (728320/1000073216) finish=160.0min speed=104045K/sec
      bitmap: 8/8 pages [32KB], 65536KB chunk

md0 : active raid1 sdi1[0] sdj1[1]
      513984 blocks super 1.0 [2/2] [UU]

md1 : active raid1 sdj3[1] sdi3[0]
      239349824 blocks super 1.1 [2/2] [UU]
      bitmap: 1/2 pages [4KB], 65536KB chunk

unused devices: <none>
[root@newtevnfs ~]# cat /proc/mdstat 
Personalities : [raid1] [raid6] [raid5] [raid4] 
md2 : active raid5 sdd[4] sdc[2] sdb[1] sda[0]
      3000219648 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
      [>....................]  recovery =  0.3% (3750232/1000073216) finish=137.2min speed=120975K/sec
      bitmap: 0/8 pages [0KB], 65536KB chunk

md0 : active raid1 sdi1[0] sdj1[1]
      513984 blocks super 1.0 [2/2] [UU]

md1 : active raid1 sdj3[1] sdi3[0]
      239349824 blocks super 1.1 [2/2] [UU]
      bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>

PERFORMANCE looks fine now:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    1.10    0.00    0.00   98.90

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc           25931.00     0.00 5118.00    2.00   121.29     0.00    48.52     8.92    1.74   0.19  96.00
sdd               0.00 25802.00    0.00 5240.00     0.00   121.48    47.48    25.32    4.83   0.19  99.90
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda           23184.00     0.00 7867.00    2.00   121.29     0.00    31.57     2.93    0.37   0.04  31.50
sdb           23729.00     0.00 7322.00    2.00   121.29     0.00    33.92     2.67    0.37   0.04  31.50
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    1.19    0.00    0.00   98.81

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc           26454.00     0.00 5212.00    0.00   123.76     0.00    48.63     8.84    1.70   0.19  97.80
sdd               0.00 26266.00    0.00 5418.00     0.00   123.64    46.74    26.55    4.90   0.18 100.00
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda           23772.00     0.00 7890.00    0.00   123.68     0.00    32.10     3.03    0.38   0.04  31.70
sdb           24325.00     0.00 7337.00    0.00   123.68     0.00    34.52     2.75    0.38   0.04  31.70
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

[root@newtevnfs ~]# pvcreate /dev/md2
  Physical volume "/dev/md2" successfully created
[root@newtevnfs ~]# vgcreate VolGroupArray /dev/md2
  Volume group "VolGroupArray" successfully created
[root@newtevnfs ~]# lvcreate -l +100%FREE VolGroupArray -n lvfast
  Logical volume "lvfast" created
[root@newtevnfs ~]# export CHUNK_SZ_KB=512
[root@newtevnfs ~]# export PARITY_DRIVE_COUNT=1
[root@newtevnfs ~]# export NON_PARITY_DRIVE_COUNT=3
[root@newtevnfs ~]# mkfs.xfs -d sunit=$(($CHUNK_SZ_KB*2)),swidth=$(($CHUNK_SZ_KB*2*$NON_PARITY_DRIVE_COUNT)) /dev/mapper/VolGroupArray-lvfast
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/mapper/VolGroupArray-lvfast isize=256    agcount=32, agsize=23439232 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=750054400, imaxpct=5
         =                       sunit=128    swidth=384 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=366240, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@newtevnfs ~]# mkdir /fast
[root@newtevnfs ~]# mount  /dev/mapper/VolGroupArray-lvfast /fast
[root@newtevnfs ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md1              225G  4.2G  209G   2% /
tmpfs                  32G     0   32G   0% /dev/shm
/dev/md0              479M   34M  420M   8% /boot
data/data0            5.0T  4.9T  124G  98% /data0
192.168.176.26:/extra
                      1.8T  1.2T  530G  70% /extra
ibtev0302:/scratch    1.7T  1.2T  552G  69% /root/ibtev0302scratch
/dev/mapper/VolGroupArray-lvfast
                      2.8T   34M  2.8T   1% /fast
[root@newtevnfs ~]# 
[root@newtevnfs ~]# lvs
  LV     VG            Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lvfast VolGroupArray -wi-ao---- 2.79t                                                    
[root@newtevnfs ~]# pvs
  PV         VG            Fmt  Attr PSize PFree
  /dev/md2   VolGroupArray lvm2 a--  2.79t    0 
[root@newtevnfs ~]# 
[root@newtevnfs ~]# time dd if=/dev/zero of=/fast/gba.test bs=1024k count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 3.91438 s, 274 MB/s

real    0m3.941s
user    0m0.002s
sys     0m1.347s
[root@newtevnfs ~]# 

AND this happened while the disk is still syncing! So it is not slow
[root@newtevnfs ~]# umount /fast
[root@newtevnfs ~]# mount -a
[root@newtevnfs ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md1              225G  4.2G  209G   2% /
tmpfs                  32G     0   32G   0% /dev/shm
/dev/md0              479M   34M  420M   8% /boot
data/data0            5.0T  4.9T  124G  98% /data0
192.168.176.26:/extra
                      1.8T  1.2T  530G  70% /extra
ibtev0302:/scratch    1.7T  1.2T  552G  69% /root/ibtev0302scratch
/dev/mapper/VolGroupArray-lvfast
                      2.8T  1.1G  2.8T   1% /fast
[root@newtevnfs ~]# grep fast /etc/fstab 
/dev/mapper/VolGroupArray-lvfast         /fast  xfs     defaults        0 0
#fast/fast0 on /fast0 type zfs (rw,xattr)
[root@newtevnfs ~]# 
[root@newtevnfs ~]# for i in `ls /data0/fastcopy/mi/*.tgz`; do cd /fast/mi/ && tar -xzpf $i & done
[1] 31094
[2] 31095
[3] 31096
[4] 31098
[root@newtevnfs ~]# 
[2]   Done                    cd /fast/mi/ && tar -xzpf $i
[root@newtevnfs ~]# 
[root@newtevnfs ~]# 

ALL The exports are going into the wrong directory now, will need to do a 'mv' once the tars finish.

Right now the performance is not amazing:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.72    0.00    6.02    0.13    0.00   90.13

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc            8049.00  7671.00 6990.00 4795.00    58.70    48.74    18.67    20.12    1.71   0.08  96.50
sdd               0.00  6607.00    0.00 7150.00     0.00    53.74    15.39     3.19    0.45   0.08  59.20
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdg               0.00     0.00   11.00    0.00     0.73     0.00   136.00     0.03    2.82   1.00   1.10
sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda            3354.00  5799.00 11287.00 5878.00    57.20    45.62    12.27     2.19    0.13   0.03  53.70
sdb            3928.00  6073.00 11845.00 5938.00    61.61    46.93    12.50     2.29    0.13   0.03  53.60
md2               0.00     0.00    0.00  249.00     0.00   101.80   837.30     0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00  249.00     0.00   101.94   838.43     7.36   29.73   4.02 100.00

^C
[root@newtevnfs ~]# 

But this is doing the initialization....

#16 Updated by Gerard Bernabeu Altayo over 3 years ago

Just waiting on the 2 big restores now:

fast0/macridin/oforodo/inputoptions_files/input_options_dm0p03125

[1]   Done                    cd /fast/ && tar -xvpzf $i
[2]   Done                    cd /fast/ && tar -xvpzf $i
[3]   Done                    cd /fast/ && tar -xvpzf $i
[4]   Done                    cd /fast/ && tar -xvpzf $i
[5]   Done                    cd /fast/ && tar -xvpzf $i
[6]   Done                    cd /fast/ && tar -xvpzf $i
[7]   Done                    cd /fast/ && tar -xvpzf $i
[8]   Done                    cd /fast/ && tar -xvpzf $i
[9]   Done                    cd /fast/ && tar -xvpzf $i
[10]   Done                    cd /fast/ && tar -xvpzf $i
[11]   Done                    cd /fast/ && tar -xvpzf $i
[12]   Done                    cd /fast/ && tar -xvpzf $i
[13]   Done                    cd /fast/ && tar -xvpzf $i
[14]   Done                    cd /fast/ && tar -xvpzf $i
[15]-  Done                    cd /fast/ && tar -xvpzf $i
[16]+  Done                    cd /fast/ && tar -xvpzf $i
[root@newtevnfs ~]# 
[root@newtevnfs ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md1              225G  4.2G  209G   2% /
tmpfs                  32G     0   32G   0% /dev/shm
/dev/md0              479M   34M  420M   8% /boot
data/data0            5.0T  4.9T  124G  98% /data0
192.168.176.26:/extra
                      1.8T  1.2T  530G  70% /extra
ibtev0302:/scratch    1.7T  1.2T  552G  69% /root/ibtev0302scratch
/dev/mapper/VolGroupArray-lvfast
                      2.8T  232G  2.6T   9% /fast
[root@newtevnfs ~]# 
root     30978  0.0  0.0 104984  5372 ?        Ss   05:51   0:00  \_ sshd: root@pts/3 
root     30991  0.0  0.0 108476  1840 pts/3    Ss   05:51   0:00      \_ -bash
root     31029  0.0  0.0 120896  1320 pts/3    S+   05:51   0:00          \_ screen
root     31031  0.0  0.0 121168  1448 ?        Ss   05:51   0:00              \_ SCREEN
root     31032  0.0  0.0 108472  1956 pts/4    Ss   05:51   0:00                  \_ /bin/bash
root     31096  0.0  0.0 108472   864 pts/4    S    05:53   0:00                      \_ /bin/bash
root     31100  8.9  0.0 116012  1152 pts/4    S    05:53   1:54                      |   \_ tar -xzpf /data0/fastcopy/mi/egstern.tgz
root     31105 39.6  0.0   4436   600 pts/4    D    05:53   8:27                      |       \_ gzip -d
root     31098  0.0  0.0 108472   864 pts/4    S    05:53   0:00                      \_ /bin/bash
root     31101  8.6  0.0 116012  1144 pts/4    S    05:53   1:50                      |   \_ tar -xzpf /data0/fastcopy/mi/rainswor.tgz
root     31104 43.9  0.0   4436   604 pts/4    R    05:53   9:21                      |       \_ gzip -d
root     32734  0.0  0.0 110944  1804 pts/4    R+   06:14   0:00                      \_ ps faux

And we are limited by the gzip speed again, but I'm worried that the performance is NOT that great:

[root@newtevnfs ~]# time dd if=/dev/zero of=/fast/gba.test bs=1024k count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 11.6246 s, 92.4 MB/s

real    0m11.647s
user    0m0.004s
sys     0m1.764s
[root@newtevnfs ~]# time dd if=/dev/zero of=/fast/gba.test bs=1024k count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 15.0429 s, 71.4 MB/s

real    0m15.250s
user    0m0.000s
sys     0m2.267s
[root@newtevnfs ~]# time dd if=/dev/zero of=/fast/gba.test bs=1024 count=1024K conv=fsync
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB) copied, 14.8028 s, 72.5 MB/s

real    0m15.063s
user    0m0.105s
sys     0m2.836s
[root@newtevnfs ~]# 

#17 Updated by Gerard Bernabeu Altayo over 3 years ago

I enabled discards on the LVM layer:

[root@newtevnfs mi]# cat /etc/lvm/lvm.conf | grep -i discards # Issue discards to a logical volumes's underlying physical volume(s) when # lvremove, lvreduce, etc). Discards inform the storage that a region is # no longer in use. Storage that supports discards advertise the protocol # specific way discards should be issued by the kernel (TRIM, UNMAP, or # from discards but SSDs and thinly provisioned LUNs generally do. If set # to 1, discards will only be issued if both the storage and kernel provide
issue_discards = 1 # Specify discards behaviour of the thin pool volume. # thin_pool_discards = "passdown" # discards # discards_non_power_2 # thin_disabled_features = [ "discards", "block_size" ]
[root@newtevnfs mi]#

I am worried that TRIM is not really in place:

[root@newtevnfs ~]# fstrim -v /fast
fstrim: /fast: FITRIM ioctl failed: Operation not supported
[root@newtevnfs ~]# mount | grep xfs
/dev/mapper/VolGroupArray-lvfast on /fast type xfs (rw,discard)
[root@newtevnfs ~]# 

#18 Updated by Gerard Bernabeu Altayo over 3 years ago

Apparently TRIM is enabled on all levels for this SSDs now:

[root@newtevnfs ~]# lsblk -D
NAME                            DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sdc                                    0      512B       2G         0
└─md2                                  0        2M       2G         0
  └─VolGroupArray-lvfast (dm-0)   524288        2M       2G         0
sdd                                    0      512B       2G         0
└─md2                                  0        2M       2G         0
  └─VolGroupArray-lvfast (dm-0)   524288        2M       2G         0
sde                                    0        0B       0B         0
├─sde1                                 0        0B       0B         0
└─sde9                                 0        0B       0B         0
sdf                                    0        0B       0B         0
├─sdf1                                 0        0B       0B         0
└─sdf9                                 0        0B       0B         0
sdg                                    0        0B       0B         0
├─sdg1                                 0        0B       0B         0
└─sdg9                                 0        0B       0B         0
sdh                                    0        0B       0B         0
├─sdh1                                 0        0B       0B         0
└─sdh9                                 0        0B       0B         0
sdi                                    0        0B       0B         0
├─sdi1                                 0        0B       0B         0
│ └─md0                                0        0B       0B         0
├─sdi2                                 0        0B       0B         0
└─sdi3                                 0        0B       0B         0
  └─md1                                0        0B       0B         0
sdj                                    0        0B       0B         0
├─sdj1                                 0        0B       0B         0
│ └─md0                                0        0B       0B         0
├─sdj2                                 0        0B       0B         0
└─sdj3                                 0        0B       0B         0
  └─md1                                0        0B       0B         0
sda                                    0      512B       2G         0
└─md2                                  0        2M       2G         0
  └─VolGroupArray-lvfast (dm-0)   524288        2M       2G         0
sdb                                    0      512B       2G         0
└─md2                                  0        2M       2G         0
  └─VolGroupArray-lvfast (dm-0)   524288        2M       2G         0
[root@newtevnfs ~]# 

#19 Updated by Gerard Bernabeu Altayo over 3 years ago

  • Status changed from New to Resolved

I have moved /fast to the original /fast0 and created the symlink of /fast on newtevnfs. I've also mounted /fast on all tev workers so the work is done. Notified Amitoj so that he can end the downtime now.

Performance can not be measured yet as the raid is still syncing (and for one more hour). We should run some iozone to make sure that performance is ok.

Now that the disk is not syncing anymore:

[root@newtevnfs ~]# time dd if=/dev/zero of=/fast/gba.test bs=1024k count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.51419 s, 238 MB/s

real    0m4.545s
user    0m0.004s
sys    0m1.297s
[root@newtevnfs ~]# time dd if=/dev/zero of=/fast/gba.test bs=1024k count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 3.95686 s, 271 MB/s

real    0m4.178s
user    0m0.002s
sys    0m1.477s
[root@newtevnfs ~]# time dd if=/dev/zero of=/fast/gba.test bs=1024k count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 3.87424 s, 277 MB/s

real    0m4.090s
user    0m0.002s
sys    0m1.781s
[root@newtevnfs ~]# time dd if=/dev/zero of=/fast/gba.test bs=1024k count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 3.84044 s, 280 MB/s

real    0m4.119s
user    0m0.003s
sys    0m1.967s
[root@newtevnfs ~]# time dd if=/dev/zero of=/fast/gba.test2 bs=1024k count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 3.9606 s, 271 MB/s

real    0m3.962s
user    0m0.000s
sys    0m1.313s
[root@newtevnfs ~]# time dd if=/dev/zero of=/fast/gba.test2 bs=1024k count=1024 conv=fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 3.87548 s, 277 MB/s

real    0m4.117s
user    0m0.004s
sys    0m1.497s
[root@newtevnfs ~]# 

We've done a bunch of performance tests, all in https://fermipoint.fnal.gov/organization/cs/scd/sci_comp_acq/Pages/Technical-Evaluations.aspx

Closing this ticket.



Also available in: Atom PDF