Project

General

Profile

Bug #9248

decommissioning cmseos6.fnal.gov

Added by Gerard Bernabeu Altayo over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Start date:
06/18/2015
Due date:
06/19/2015
% Done:

0%

Estimated time:
Duration: 2

Description

Lisa,

More sanity checking on satabeasts that one should have gone away
year before last, it's a 500gb unit. Not tomorrow and I guess it
doesn't much matter if you want to keep it but it's old and crusty if
you can let it go.

History

#1 Updated by Gerard Bernabeu Altayo over 4 years ago

Lisa started the drain last week but stopped it on 2 FS to avoid overloading the node, I'm restarting it now:

I see that cmseos6 is still on:

[root@cmssrv222 ~]# eos fs ls | grep -v rw

#..........................................................................................................................................
  1. host (#...) # id # path # schedgroup # geotag # boot # configstatus # drain # active
    #..........................................................................................................................................
    cmseos6.fnal.gov (1095) 77 /storage/data1 default.1 booted ro nodrain online
    cmseos6.fnal.gov (1095) 78 /storage/data2 default.2 booted empty drained online
    cmseos6.fnal.gov (1095) 79 /storage/data3 default.3 booted ro nodrain online

I've restarted the drain:

[root@cmssrv222 ~]# eos fs ls | grep cmseos6.fnal
cmseos6.fnal.gov (1095) 77 /storage/data1 default.1 booted drain draining online
cmseos6.fnal.gov (1095) 78 /storage/data2 default.2 booted empty drained online
cmseos6.fnal.gov (1095) 79 /storage/data3 default.3 booted drain prepare online
[root@cmssrv222 ~]#

Hopefully will be done by tomorrow morning.

#2 Updated by Gerard Bernabeu Altayo over 4 years ago

cmseos6 is almost empty now:

[root@cmseos6 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 57G 9.1G 45G 17% /
tmpfs 3.9G 0 3.9G 0% /dev/shm
/dev/sda1 976M 86M 839M 10% /boot
/dev/sde 11T 26G 11T 1% /storage/data3
/dev/sdc 11T 2.7G 11T 1% /storage/data1
/dev/sdd 11T 3.8G 11T 1% /storage/data2
eosmain 6.9P 2.7P 4.3P 39% /eos

But it's not totally empty yet:

[root@cmssrv222 uscms]# eos fs ls | grep cmseos6.fnal
cmseos6.fnal.gov (1095) 77 /storage/data1 default.1 booted drain expired online
cmseos6.fnal.gov (1095) 78 /storage/data2 default.2 booted empty drained online
cmseos6.fnal.gov (1095) 79 /storage/data3 default.3 booted drain expired online

There are definitely some files in it still:

[root@cmssrv222 uscms]# eos fs dumpmd 77 -path
path=/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe
[root@cmssrv222 uscms]# eos fs dumpmd 78 -path
[root@cmssrv222 uscms]# eos fs dumpmd 79 -path
path=/eos/uscms/store/user/sdurgut/FourMuon/TxtFiles/STDOUTD/gen_reco_167.log
path=/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt300400_50k_9_unweighted_events.lhe
path=/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_5_unweighted_events.lhe
[root@cmssrv222 uscms]#

One of the FS has some fsck issues that may be preventing the migration:

[root@cmssrv222 uscms]# eos fsck report -a | grep fsid=77
[root@cmssrv222 uscms]# eos fsck report -a | grep fsid=78
[root@cmssrv222 uscms]# eos fsck report -a | grep fsid=79
timestamp=1435161509 tag="d_mem_sz_diff" fsid=79 n=2
timestamp=1435161509 tag="rep_diff_n" fsid=79 n=14
[root@cmssrv222 uscms]#

[root@cmssrv222 uscms]# eos fsck report -a -l | grep fsid=79
timestamp=1435161509 tag="d_mem_sz_diff" fsid=79 n=2 lfn="/eos/uscms/store/user/awhitbe1/13TeVGJetsStudies/13TeV_50ns40PU.DYJetsToLL_M-50_HT-600toInf_Tune4C_13TeV-madgraph-tauola_7_SumJetMass_AnalysisTree.root","/eos/uscms/store/user/awhitbe1/13TeVGJetsStudies/13TeV_50ns40PU.DYJetsToLL_M-50_HT-600toInf_Tune4C_13TeV-madgraph-tauola_65_SumJetMass_AnalysisTree.root"
timestamp=1435161509 tag="rep_diff_n" fsid=79 n=14 lfn="/eos/uscms/store/user/awhitbe1/13TeVGJetsStudies/13TeV_50ns40PU.DYJetsToLL_M-50_HT-600toInf_Tune4C_13TeV-madgraph-tauola_7_SumJetMass_AnalysisTree.root","/eos/uscms/store/user/awhitbe1/13TeVGJetsStudies/13TeV_50ns40PU.DYJetsToLL_M-50_HT-600toInf_Tune4C_13TeV-madgraph-tauola_65_SumJetMass_AnalysisTree.root","/eos/uscms/store/user/awhitbe1/13TeVGJetsStudies/13TeV_50ns40PU.DYJetsToLL_M-50_HT-600toInf_Tune4C_13TeV-madgraph-tauola_106_SumJetMass_AnalysisTree.root","/eos/uscms/store/user/awhitbe1/13TeVGJetsStudies/13TeV_50ns40PU.DYJetsToLL_M-50_HT-600toInf_Tune4C_13TeV-madgraph-tauola_157_SumJetMass_AnalysisTree.root","/eos/uscms/store/user/awhitbe1/13TeVGJetsStudies/13TeV_50ns40PU.DYJetsToLL_M-50_HT-600toInf_Tune4C_13TeV-madgraph-tauola_219_SumJetMass_AnalysisTree.root","/eos/uscms/store/user/awhitbe1/13TeVGJetsStudies/13TeV_50ns40PU.DYJetsToLL_M-50_HT-600toInf_Tune4C_13TeV-madgraph-tauola_245_SumJetMass_AnalysisTree.root","/eos/uscms/store/user/awhitbe1/13TeVGJetsStudies/13TeV_50ns40PU.DYJetsToLL_M-50_HT-600toInf_Tune4C_13TeV-madgraph-tauola_248_SumJetMass_AnalysisTree.root","/eos/uscms/store/user/awhitbe1/13TeVGJetsStudies/13TeV_50ns40PU.DYJetsToLL_M-50_HT-400to600_Tune4C_13TeV-madgraph-tauola_25_SumJetMass_AnalysisTree.root","/eos/uscms/store/user/awhitbe1/13TeVGJetsStudies/13TeV_50ns40PU.DYJetsToLL_M-50_HT-400to600_Tune4C_13TeV-madgraph-tauola_67_SumJetMass_AnalysisTree.root","/eos/uscms/store/user/awhitbe1/13TeVGJetsStudies/13TeV_50ns40PU.DYJetsToLL_M-50_HT-400to600_Tune4C_13TeV-madgraph-tauola_238_SumJetMass_AnalysisTree.root","/eos/uscms/store/user/awhitbe1/13TeVGJetsStudies/13TeV_50ns40PU.DYJetsToLL_M-50_HT-400to600_Tune4C_13TeV-madgraph-tauola_274_SumJetMass_AnalysisTree.root","/eos/uscms/store/user/awhitbe1/13TeVGJetsStudies/13TeV_50ns40PU.DYJetsToLL_M-50_HT-400to600_Tune4C_13TeV-madgraph-tauola_294_SumJetMass_AnalysisTree.root","/eos/uscms/store/user/awhitbe1/13TeVGJetsStudies/13TeV_50ns40PU.DYJetsToLL_M-50_HT-400to600_Tune4C_13TeV-madgraph-tauola_293_SumJetMass_AnalysisTree.root","/eos/uscms/store/user/awhitbe1/13TeVGJetsStudies/13TeV_50ns40PU.GJets_400HT600_13TeV_PU_S14_57_SumJetMass_AnalysisTree.root"

Trying to repair all this issues automagically:

[root@cmssrv222 uscms]# eos fsck repair --all

#3 Updated by Gerard Bernabeu Altayo over 4 years ago

The fsck repair didn't really fix anything on cmseos6, trying to issue migrate one last time:

[root@cmssrv222 uscms]# eos fs config 77 configstatus=drain
[root@cmssrv222 uscms]# eos fs config 79 configstatus=drain
[root@cmssrv222 uscms]# eos fs ls | grep cmseos6.fnal
cmseos6.fnal.gov (1095) 77 /storage/data1 default.1 booted drain prepare online
cmseos6.fnal.gov (1095) 78 /storage/data2 default.2 booted empty drained online
cmseos6.fnal.gov (1095) 79 /storage/data3 default.3 booted drain prepare online
[root@cmssrv222 uscms]#

After a few minutes the migration is stalling:

[root@cmssrv222 uscms]# eos fs ls | grep cmseos6.fnal
cmseos6.fnal.gov (1095) 77 /storage/data1 default.1 booted drain stalling online
cmseos6.fnal.gov (1095) 78 /storage/data2 default.2 booted empty drained online
cmseos6.fnal.gov (1095) 79 /storage/data3 default.3 booted drain stalling online
[root@cmssrv222 uscms]#

The same files are affected:

[root@cmssrv222 uscms]# eos fs dumpmd 77 -path
path=/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe
[root@cmssrv222 uscms]# eos fs dumpmd 79 -path
path=/eos/uscms/store/user/sdurgut/FourMuon/TxtFiles/STDOUTD/gen_reco_167.log
path=/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt300400_50k_9_unweighted_events.lhe
path=/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_5_unweighted_events.lhe
[root@cmssrv222 uscms]#

[root@cmssrv222 uscms]# eos fileinfo /eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe
File: '/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe' Flags: 0000
Size: 2631870808
Modify: Sun Oct 26 00:50:04 2014 Timestamp: 1414302604.738941000
Change: Sat Oct 25 23:48:36 2014 Timestamp: 1414298916.638898000
CUid: 3373 CGid: 5063 Fxid: 03e1995c Fid: 65116508 Pid: 1209091 Pxid: 00127303
XStype: adler XS: f2 51 36 52 ETAG: 17479579518107648:f2513652
replica Stripes: 2 Blocksize: 4k LayoutId: 00100112
#Rep: 2 # fs-id #................................................................................................................................... # host # schedgroup # path # boot # configstatus # drain # active # geotag
#...................................................................................................................................
0 77 cmseos6.fnal.gov default.1 /storage/data1 booted drain stalling online
1 66 cmseos5.fnal.gov default.1 /storage/data2 booted rw nodrain online *
[root@cmssrv222 uscms]# eos fileinfo /eos/uscms/store/user/sdurgut/FourMuon/TxtFiles/STDOUTD/gen_reco_167.log
File: '/eos/uscms/store/user/sdurgut/FourMuon/TxtFiles/STDOUTD/gen_reco_167.log' Flags: 0000
Size: 834
Modify: Wed Oct 22 10:01:24 2014 Timestamp: 1413990084.549657000
Change: Wed Aug 6 16:56:22 2014 Timestamp: 1407362182.688704000
CUid: 46497 CGid: 5063 Fxid: 0348ae89 Fid: 55094921 Pid: 1032653 Pxid: 000fc1cd
XStype: adler XS: 00 00 00 01 ETAG: 14789430241918976:00000001
replica Stripes: 2 Blocksize: 4k LayoutId: 00100112
#Rep: 2 # fs-id #................................................................................................................................... # host # schedgroup # path # boot # configstatus # drain # active # geotag
#...................................................................................................................................
0 79 cmseos6.fnal.gov default.3 /storage/data3 booted drain stalling online
1 106 cmseos26.fnal.gov default.3 /storage/data2 booted rw nodrain online *
[root@cmssrv222 uscms]# eos fileinfo /eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt300400_50k_9_unweighted_events.lhe
File: '/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt300400_50k_9_unweighted_events.lhe' Flags: 0000
Size: 3159755379
Modify: Sun Oct 26 00:36:57 2014 Timestamp: 1414301817.245975000
Change: Sat Oct 25 23:45:57 2014 Timestamp: 1414298757.715018000
CUid: 3373 CGid: 5063 Fxid: 03e198c9 Fid: 65116361 Pid: 1209091 Pxid: 00127303
XStype: adler XS: eb bd 36 25 ETAG: 17479540058095616:ebbd3625
replica Stripes: 2 Blocksize: 4k LayoutId: 00100112
#Rep: 2 # fs-id #................................................................................................................................... # host # schedgroup # path # boot # configstatus # drain # active # geotag
#...................................................................................................................................
0 79 cmseos6.fnal.gov default.3 /storage/data3 booted drain stalling online
1 120 cmseos31.fnal.gov default.3 /storage/data2 booted rw nodrain online *
[root@cmssrv222 uscms]# eos fileinfo /eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_5_unweighted_events.lhe
File: '/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_5_unweighted_events.lhe' Flags: 0000
Size: 2622653880
Modify: Sun Oct 26 00:36:39 2014 Timestamp: 1414301799.572047000
Change: Sat Oct 25 23:48:36 2014 Timestamp: 1414298916.291785000
CUid: 3373 CGid: 5063 Fxid: 03e1995a Fid: 65116506 Pid: 1209091 Pxid: 00127303
XStype: adler XS: f2 c7 36 56 ETAG: 17479578981236736:f2c73656
replica Stripes: 2 Blocksize: 4k LayoutId: 00100112
#Rep: 2 # fs-id #................................................................................................................................... # host # schedgroup # path # boot # configstatus # drain # active # geotag
#...................................................................................................................................
0 113 cmseos29.fnal.gov default.3 /storage/data2 booted rw nodrain online
1 79 cmseos6.fnal.gov default.3 /storage/data3 booted drain stalling online *
[root@cmssrv222 uscms]#

BTW, fsid 78 is empty:

[root@cmssrv222 uscms]# eos fs dumpmd 78 -path
[root@cmssrv222 uscms]#

#4 Updated by Gerard Bernabeu Altayo about 4 years ago

OK, going back to this:

[root@cmssrv222 ~]# eos fs ls | grep cmseos6
cmseos6.fnal.gov (1095) 77 /storage/data1 default.1 booted drain expired online
cmseos6.fnal.gov (1095) 78 /storage/data2 default.2 booted empty drained online
cmseos6.fnal.gov (1095) 79 /storage/data3 default.3 booted drain expired online
cmseos60.fnal.gov (1095) 186 /storage/data1 default.0 booted rw nodrain online
cmseos60.fnal.gov (1095) 187 /storage/data2 default.1 booted rw nodrain online
cmseos61.fnal.gov (1095) 188 /storage/data1 default.2 booted rw nodrain online
cmseos61.fnal.gov (1095) 189 /storage/data2 default.3 booted rw nodrain online
cmseos62.fnal.gov (1095) 190 /storage/data1 default.0 booted rw nodrain online
cmseos62.fnal.gov (1095) 191 /storage/data2 default.1 booted rw nodrain online
cmseos63.fnal.gov (1095) 192 /storage/data1 default.2 booted rw nodrain online
cmseos63.fnal.gov (1095) 193 /storage/data2 default.3 booted rw nodrain online
[root@cmssrv222 ~]# eos fs dumpmd 77 -path
path=/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe
[root@cmssrv222 ~]# eos fs dumpmd 79 -path
path=/eos/uscms/store/user/sdurgut/FourMuon/TxtFiles/STDOUTD/gen_reco_167.log
path=/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt300400_50k_9_unweighted_events.lhe
path=/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_5_unweighted_events.lhe

I'll try to move the repolica from fsid77:

[root@cmssrv222 ~]# eos file info /eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe
File: '/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe' Flags: 0000
Size: 2631870808
Modify: Sun Oct 26 00:50:04 2014 Timestamp: 1414302604.738941000
Change: Sat Oct 25 23:48:36 2014 Timestamp: 1414298916.638898000
CUid: 3373 CGid: 5063 Fxid: 03e1995c Fid: 65116508 Pid: 1209091 Pxid: 00127303
XStype: adler XS: f2 51 36 52 ETAG: 17479579518107648:f2513652
replica Stripes: 2 Blocksize: 4k LayoutId: 00100112
#Rep: 2 # fs-id #................................................................................................................................... # host # schedgroup # path # boot # configstatus # drain # active # geotag
#...................................................................................................................................
0 77 cmseos6.fnal.gov default.1 /storage/data1 booted drain expired online
1 66 cmseos5.fnal.gov default.1 /storage/data2 booted rw nodrain online *
[root@cmssrv222 ~]# eos file verify /eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe
success: sending verify to fsid= 77 for path=/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe
success: sending verify to fsid= 66 for path=/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe
[root@cmssrv222 ~]# eos file check /eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe
path="/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe" fid="03e1995c" size="2631870808" nrep="2" checksumtype="adler" checksum="f251365200000000000000000000000000000000"
nrep="00" fsid="77" host="cmseos6.fnal.gov:1095" fstpath="/storage/data1/0000196f/03e1995c" size="2631870808" statsize="2631870808" checksum="f251365200000000000000000000000000000000"
nrep="01" fsid="66" host="cmseos5.fnal.gov:1095" fstpath="/storage/data2/0000196f/03e1995c" size="2631870808" statsize="2631870808" checksum="f251365200000000000000000000000000000000"
[root@cmssrv222 ~]# eos file move /eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe 77 190
success: scheduled move from source fs=77 => target fs=190

Looks like it's actually corrupted:

[root@cmseos6 ~]# tail /var/log/eos/fst/xrdlog.fst
150717 14:12:37 time=1437160357.500193 func=ThreadProc level=NOTE logid=a3087c16-1a8e-11e5-a971-0030485a1628 unit=:1095 tid=00007fafe24fc700 source=ScanDir:588 tident=<service> sec= uid=0 gid=0 name= geo="" Directory: /storage/data1, files=3661 scanduration=0.36 [s] scansize=0 [Bytes] [ 0 MB ] scannedfiles=0 corruptedfiles=0 nochecksumfiles=0 skippedfiles=3661
150717 14:28:52 28412 XrootdXeq: root.18021:36@cmssrv222 login as root
150717 14:28:52 28412 XrootdXeq: root.18021:36@cmssrv222 disc 0:00:00
150717 14:29:29 28393 XrootdXeq: daemon.24251:30@cmseos62 login as daemon
150717 14:29:29 time=1437161369.456538 func=open level=WARN logid=249e5132-2cba-11e5-87a7-0030485a1628 unit=:1095 tid=00007faff09ed700 source=XrdFstOfsFile:709 tident=<service> sec=sss uid=0 gid=0 name=daemon geo="" removing creation flag because of 0 2
150717 14:29:29 time=1437161369.456818 func=stat level=NOTE logid=249e5132-2cba-11e5-87a7-0030485a1628 unit=:1095 tid=00007faff09ed700 source=XrdFstOfsFile:2981 tident=daemon.24251:30@cmseos62 sec= uid=1 gid=1 name=nobody geo="" path=/replicate:03e1995c inode=65116508
150717 14:30:00 28393 FstOfs_read: daemon.24251:30@cmseos62 Unable to read file - wrong file checksum fn= /storage/data1/0000196f/03e1995c; input/output error
150717 14:30:00 28393 XrootdXeq: daemon.24251:30@cmseos62 disc 0:00:31
150717 14:30:00 28393 FstOfs_close: ? Unable to verify checksum - checksum error for file fn= /eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe; input/output error
150717 14:30:00 time=1437161400.468778 func=close level=CRIT logid=249e5132-2cba-11e5-87a7-0030485a1628 unit=:1095 tid=00007faff09ed700 source=XrdFstOfsFile:2236 tident=daemon.24251:30@cmseos62 sec= uid=1 gid=1 name=nobody geo="" file-xs error file=&mgm.access=read&mgm.lid=1048834&mgm.cid=1209091&mgm.ruid=1&mgm.rgid=1&mgm.uid=1&mgm.gid=1&mgm.path=/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe&mgm.manager=cmssrv222.fnal.gov:1094&mgm.fid=03e1995c&mgm.sec=sss|eos|eos|-|-|-|-|eos/replication&mgm.drainfsid=77&mgm.localprefix=/storage/data1&mgm.fsid=77&mgm.sourcehostport=cmseos6.fnal.gov:1095&cap.valid=1437164968
[root@cmseos6 ~]#

Dropping this replica:

[root@cmssrv222 ~]# eos file drop /eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe 77
success: dropped stripe on fs=77
[root@cmssrv222 ~]# eos file info /eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe
File: '/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe' Flags: 0000
Size: 2631870808
Modify: Sun Oct 26 00:50:04 2014 Timestamp: 1414302604.738941000
Change: Sat Oct 25 23:48:36 2014 Timestamp: 1414298916.638898000
CUid: 3373 CGid: 5063 Fxid: 03e1995c Fid: 65116508 Pid: 1209091 Pxid: 00127303
XStype: adler XS: f2 51 36 52 ETAG: 17479579518107648:f2513652
replica Stripes: 2 Blocksize: 4k LayoutId: 00100112
#Rep: 1 # fs-id #................................................................................................................................... # host # schedgroup # path # boot # configstatus # drain # active # geotag
#...................................................................................................................................
0 66 cmseos5.fnal.gov default.1 /storage/data2 booted rw nodrain online *
[root@cmssrv222 ~]# eos file adjustreplica /eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_3_unweighted_events.lhe
success: scheduled replication from source fs=66 => target fs=117
[root@cmssrv222 ~]#

#5 Updated by Gerard Bernabeu Altayo about 4 years ago

I applied the same medicine to all files, now cmeos6 is empty:

[root@cmssrv222 ~]# eos fs dumpmd 77 -path
[root@cmssrv222 ~]# eos fs dumpmd 79 -path
path=/eos/uscms/store/user/sdurgut/FourMuon/TxtFiles/STDOUTD/gen_reco_167.log
path=/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt300400_50k_9_unweighted_events.lhe
path=/eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_5_unweighted_events.lhe
[root@cmssrv222 ~]# eos file info /eos/uscms/store/user/sdurgut/FourMuon/TxtFiles/STDOUTD/gen_reco_167.log
File: '/eos/uscms/store/user/sdurgut/FourMuon/TxtFiles/STDOUTD/gen_reco_167.log' Flags: 0000
Size: 834
Modify: Wed Oct 22 10:01:24 2014 Timestamp: 1413990084.549657000
Change: Wed Aug 6 16:56:22 2014 Timestamp: 1407362182.688704000
CUid: 46497 CGid: 5063 Fxid: 0348ae89 Fid: 55094921 Pid: 1032653 Pxid: 000fc1cd
XStype: adler XS: 00 00 00 01 ETAG: 14789430241918976:00000001
replica Stripes: 2 Blocksize: 4k LayoutId: 00100112
#Rep: 2 # fs-id #................................................................................................................................... # host # schedgroup # path # boot # configstatus # drain # active # geotag
#...................................................................................................................................
0 79 cmseos6.fnal.gov default.3 /storage/data3 booted drain expired online
1 106 cmseos26.fnal.gov default.3 /storage/data2 booted rw nodrain online *
[root@cmssrv222 ~]# eos file drop /eos/uscms/store/user/sdurgut/FourMuon/TxtFiles/STDOUTD/gen_reco_167.log 79
success: dropped stripe on fs=79
[root@cmssrv222 ~]# eos file adjustreplica /eos/uscms/store/user/sdurgut/FourMuon/TxtFiles/STDOUTD/gen_reco_167.log
success: scheduled replication from source fs=106 => target fs=169
[root@cmssrv222 ~]# eos file drop /eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt300400_50k_9_unweighted_events.lhe 79
success: dropped stripe on fs=79
[root@cmssrv222 ~]# eos file adjustreplica /eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt300400_50k_9_unweighted_events.lhe
success: scheduled replication from source fs=120 => target fs=179
[root@cmssrv222 ~]# eos file drop /eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_5_unweighted_events.lhe 79
success: dropped stripe on fs=79
[root@cmssrv222 ~]# eos file adjustreplica /eos/uscms/store/user/ntran/BOOST13/hadronizedFiles_JHU/wlwl_pt500600_50k_5_unweighted_events.lhe
success: scheduled replication from source fs=113 => target fs=179
[root@cmssrv222 ~]# eos fs dumpmd 79 -path
[root@cmssrv222 ~]#

I restarted eos on cmseos6 and set its FST on RO:

[root@cmssrv222 ~]# eos fs ls | grep cmseos6.f
cmseos6.fnal.gov (1095) 77 /storage/data1 default.1 booted ro nodrain online
cmseos6.fnal.gov (1095) 78 /storage/data2 default.2 booted empty drained online
cmseos6.fnal.gov (1095) 79 /storage/data3 default.3 booted ro nodrain online
[root@cmssrv222 ~]# eos fs dumpmd 77 -path
[root@cmssrv222 ~]# eos fs dumpmd 78 -path
[root@cmssrv222 ~]# eos fs dumpmd 79 -path
[root@cmssrv222 ~]#

I think it's now safe to retire. I've created and applied the procedure https://cmsweb.fnal.gov/bin/view/Storage/EOSOperationalProcedures#Decommission_an_EOS_FST_node

#6 Updated by Gerard Bernabeu Altayo about 4 years ago

  • Status changed from New to Resolved

[root@cmseos6 ~]# for i in $fsids; do

ssh $mgm eos fs config $i configstatus=drain
done

[root@cmseos6 ~]# ssh $mgm eos fs ls | grep $HOSTNAME
cmseos6.fnal.gov (1095) 77 /storage/data1 default.1 booted empty drained online
cmseos6.fnal.gov (1095) 78 /storage/data2 default.2 booted empty drained online
cmseos6.fnal.gov (1095) 79 /storage/data3 default.3 booted empty drained online
[root@cmseos6 ~]# fsids=`ssh $mgm eos fs ls -m $HOSTNAME | awk '{print $3}' | grep id= | cut -d= -f2`
[root@cmseos6 ~]# for i in $fsids; do

ssh $mgm eos fs rm $i
done

success: unregistered cmseos6 77 from the FsView
success: unregistered cmseos6 78 from the FsView
success: unregistered cmseos6 79 from the FsView
[root@cmseos6 ~]# ssh $mgm eos node rm ${HOSTNAME}:1095
error: this node was still sending a heartbeat < 5 seconds ago - stop the FST daemon first!
(errc=16) (Device or resource busy)
[root@cmseos6 ~]# ssh $mgm eos vid remove gateway ${HOSTNAME}
success: rm vid [ eos.rgid=0 eos.ruid=0 mgm.cmd=vid mgm.subcmd=rm mgm.vid.cmd=unmap mgm.vid.key=tident:"*@cmseos6.fnal.gov":uid]
success: rm vid [ eos.rgid=0 eos.ruid=0 mgm.cmd=vid mgm.subcmd=rm mgm.vid.cmd=unmap mgm.vid.key=tident:"*@cmseos6.fnal.gov":gid]
[root@cmseos6 ~]# puppet agent --disable 'This node needs to be reshoot'
[root@cmseos6 ~]# service eosd stop
Stopping eosd:
[ OK ]
[root@cmseos6 ~]# service eos stop
Stopping xrootd: fst [ OK ]

[root@cmseos6 ~]# df -h /storage/*
Filesystem Size Used Avail Use% Mounted on
/dev/sdc 11T 98M 11T 1% /storage/data1
/dev/sdd 11T 98M 11T 1% /storage/data2
/dev/sde 11T 109M 11T 1% /storage/data3
[root@cmseos6 ~]# ssh $mgm eos node rm ${HOSTNAME}:1095
success: removed node '/eos/cmseos6.fnal.gov:1095/fst'
[root@cmseos6 ~]#

[root@cmseos6 ~]# chkconfig eos off
[root@cmseos6 ~]# chkconfig --list | grep eos
eos 0:off 1:off 2:off 3:off 4:off 5:off 6:off
eos-gridftp 0:off 1:off 2:on 3:on 4:on 5:on 6:off
eosd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
eosslave 0:off 1:off 2:off 3:on 4:off 5:on 6:off
[root@cmseos6 ~]# chkconfig eos-gridftp off
[root@cmseos6 ~]# chkconfig eosd off
[root@cmseos6 ~]# chkconfig --list | grep eos
eos 0:off 1:off 2:off 3:off 4:off 5:off 6:off
eos-gridftp 0:off 1:off 2:off 3:off 4:off 5:off 6:off
eosd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
eosslave 0:off 1:off 2:off 3:on 4:off 5:on 6:off
[root@cmseos6 ~]#
[root@cmseos6 ~]# poweroff

Broadcast message from
(/dev/pts/0) at 15:24 ...

The system is going down for power off NOW!
[root@cmseos6 ~]# Connection to cmseos6.fnal.gov closed by remote host.
Connection to cmseos6.fnal.gov closed.
-bash-4.1$

-bash-4.1$ /srv/admin/bin/cis-please-retire cmseos6
you (DCSO) are in charge of removing the host from zabbix and the ENC
Primary Ticket Information
Number: INC000000572646
Summary: Retire host: cmseos6
Status: Assigned
Submitted: 2015-07-17 15:29:16 CDT
Urgency: 4 - Low
Priority: 3 - Medium
Service Type: Server

Requestor Info
Name: Gerard Bernabeu Altayo
Email:
Created By: cd-srv-cms-snow

Assignee Info
Group: ECF-CIS
Name: (none)
Last Modified: 2015-07-17 15:29:18 CDT

User-Provided Description
Please retire host: cmseos6
-bash-4.1$

Changes activated in check_mk, Node manually removed from zabbix too.



Also available in: Atom PDF