Feature #11694
Retire cmseos50-63 (will be added to dCache disk)
0%
Description
As per David Fagan's requests:
Gerard,
Based on our brief conversation the right machines are from cmseos50 up, as you mentioned one at a time and see what happens.
Please let me know how things go, I'll watch the net lot of this will be local to a switch but I don't know if the space is totally all over the space or specific to nodes..
If I see a network top out I'll let you know, we are pushing but they aren't bouncing off the ceiling.
thanks,
David.
This is similar to https://cdcvs.fnal.gov/redmine/issues/10763
History
#1 Updated by Gerard Bernabeu Altayo about 5 years ago
Draining all this nodes means removing ~2PB of disk from EOS:
71*2*14
1988
Following the drain documentation from http://eos.readthedocs.org/en/latest/configuration/draining.html.
[root@cmssrv222 ~]# for i in `seq 166 193`; do eos fs config $i configstatus=drain; done [root@cmssrv222 ~]# eos fs ls #.......................................................................................................................................... # host (#...) # id # path # schedgroup # geotag # boot # configstatus # drain # active #.......................................................................................................................................... cmseos12.fnal.gov (1095) 41 /storage/data3 default.2 booted rw nodrain online cmseos12.fnal.gov (1095) 42 /storage/data1 default.3 booted rw nodrain online cmseos12.fnal.gov (1095) 43 /storage/data2 default.1 booted rw nodrain online cmseos11.fnal.gov (1095) 44 /storage/data1 default.0 booted rw nodrain online cmseos11.fnal.gov (1095) 45 /storage/data2 default.3 booted rw nodrain online cmseos11.fnal.gov (1095) 46 /storage/data3 default.2 booted rw nodrain online cmseos13.fnal.gov (1095) 47 /storage/data1 default.0 booted rw nodrain online cmseos13.fnal.gov (1095) 48 /storage/data2 default.1 booted rw nodrain online cmseos13.fnal.gov (1095) 49 /storage/data3 default.2 booted rw nodrain online cmseos14.fnal.gov (1095) 50 /storage/data1 default.0 booted rw nodrain online cmseos14.fnal.gov (1095) 51 /storage/data2 default.1 booted rw nodrain online cmseos14.fnal.gov (1095) 52 /storage/data3 default.3 booted rw nodrain online cmseos15.fnal.gov (1095) 53 /storage/data1 default.0 booted rw nodrain online cmseos15.fnal.gov (1095) 54 /storage/data2 default.1 booted rw nodrain online cmseos15.fnal.gov (1095) 55 /storage/data3 default.3 booted rw nodrain online cmseos16.fnal.gov (1095) 56 /storage/data1 default.3 booted rw nodrain online cmseos16.fnal.gov (1095) 57 /storage/data2 default.1 booted rw nodrain online cmseos16.fnal.gov (1095) 58 /storage/data3 default.2 booted rw nodrain online cmseos17.fnal.gov (1095) 59 /storage/data1 default.0 booted rw nodrain online cmseos17.fnal.gov (1095) 60 /storage/data2 default.1 booted rw nodrain online cmseos17.fnal.gov (1095) 61 /storage/data3 default.2 booted rw nodrain online cmseos1.fnal.gov (1095) 62 /storage/data1 default.0 booted rw nodrain online cmseos1.fnal.gov (1095) 63 /storage/data2 default.1 booted rw nodrain online cmseos1.fnal.gov (1095) 64 /storage/data3 default.2 booted rw nodrain online cmseos5.fnal.gov (1095) 65 /storage/data1 default.0 booted drain stalling online cmseos5.fnal.gov (1095) 66 /storage/data2 default.1 booted drain stalling online cmseos5.fnal.gov (1095) 67 /storage/data3 default.2 booted drain stalling online cmseos4.fnal.gov (1095) 68 /storage/data1 default.0 booted ro nodrain online cmseos4.fnal.gov (1095) 69 /storage/data2 default.1 booted ro nodrain online cmseos4.fnal.gov (1095) 70 /storage/data3 default.2 booted ro nodrain online cmseos3.fnal.gov (1095) 71 /storage/data1 default.0 booted ro nodrain online cmseos3.fnal.gov (1095) 72 /storage/data2 default.2 booted ro nodrain online cmseos3.fnal.gov (1095) 73 /storage/data3 default.1 booted ro nodrain online cmseos2.fnal.gov (1095) 74 /storage/data1 default.0 booted ro nodrain online cmseos2.fnal.gov (1095) 75 /storage/data2 default.1 booted ro nodrain online cmseos2.fnal.gov (1095) 76 /storage/data3 default.3 booted ro nodrain online cmseos25.fnal.gov (1095) 103 /storage/data1 default.0 booted rw nodrain online cmseos25.fnal.gov (1095) 104 /storage/data2 default.1 booted rw nodrain online cmseos26.fnal.gov (1095) 105 /storage/data1 default.2 booted rw nodrain online cmseos26.fnal.gov (1095) 106 /storage/data2 default.3 booted rw nodrain online cmseos27.fnal.gov (1095) 107 /storage/data1 default.0 booted rw nodrain online cmseos27.fnal.gov (1095) 108 /storage/data2 default.1 booted rw nodrain online cmseos27.fnal.gov (1095) 109 /storage/data3 default.2 booted rw nodrain online cmseos28.fnal.gov (1095) 110 /storage/data1 default.0 booted rw nodrain online cmseos28.fnal.gov (1095) 111 /storage/data2 default.1 booted rw nodrain online cmseos29.fnal.gov (1095) 112 /storage/data1 default.2 booted rw nodrain online cmseos29.fnal.gov (1095) 113 /storage/data2 default.3 booted rw nodrain online cmseos30.fnal.gov (1095) 116 /storage/data1 default.0 booted rw nodrain online cmseos30.fnal.gov (1095) 117 /storage/data2 default.1 booted rw nodrain online cmseos31.fnal.gov (1095) 119 /storage/data1 default.2 booted rw nodrain online cmseos31.fnal.gov (1095) 120 /storage/data2 default.3 booted rw nodrain online cmseos32.fnal.gov (1095) 122 /storage/data1 default.0 booted rw nodrain online cmseos32.fnal.gov (1095) 123 /storage/data2 default.1 booted rw nodrain online cmseos33.fnal.gov (1095) 125 /storage/data1 default.2 booted rw nodrain online cmseos33.fnal.gov (1095) 126 /storage/data2 default.3 booted rw nodrain online cmseos39.fnal.gov (1095) 134 /storage/data1 default.2 booted rw nodrain online cmseos39.fnal.gov (1095) 135 /storage/data2 default.3 booted rw nodrain online cmseos34.fnal.gov (1095) 136 /storage/data1 default.0 booted rw nodrain online cmseos36.fnal.gov (1095) 137 /storage/data1 default.0 booted rw nodrain online cmseos36.fnal.gov (1095) 138 /storage/data2 default.1 booted rw nodrain online cmseos34.fnal.gov (1095) 139 /storage/data2 default.1 booted rw nodrain online cmseos38.fnal.gov (1095) 140 /storage/data1 default.0 booted rw nodrain online cmseos38.fnal.gov (1095) 141 /storage/data2 default.1 booted rw nodrain online cmseos37.fnal.gov (1095) 142 /storage/data1 default.2 booted rw nodrain online cmseos37.fnal.gov (1095) 143 /storage/data2 default.3 booted rw nodrain online cmseos35.fnal.gov (1095) 144 /storage/data1 default.2 booted rw nodrain online cmseos35.fnal.gov (1095) 145 /storage/data2 default.3 booted rw nodrain online cmseos42.fnal.gov (1095) 146 /storage/data1 default.0 booted rw nodrain online cmseos42.fnal.gov (1095) 147 /storage/data2 default.1 booted rw nodrain online cmseos40.fnal.gov (1095) 148 /storage/data1 default.0 booted rw nodrain online cmseos40.fnal.gov (1095) 149 /storage/data2 default.1 booted rw nodrain online cmseos41.fnal.gov (1095) 150 /storage/data1 default.2 booted rw nodrain online cmseos41.fnal.gov (1095) 151 /storage/data2 default.3 booted rw nodrain online cmseos43.fnal.gov (1095) 152 /storage/data1 default.2 booted rw nodrain online cmseos43.fnal.gov (1095) 153 /storage/data2 default.3 booted rw nodrain online cmseos44.fnal.gov (1095) 154 /storage/data1 default.0 booted rw nodrain online cmseos44.fnal.gov (1095) 155 /storage/data2 default.1 booted rw nodrain online cmseos45.fnal.gov (1095) 156 /storage/data1 default.2 booted rw nodrain online cmseos45.fnal.gov (1095) 157 /storage/data2 default.3 booted rw nodrain online cmseos46.fnal.gov (1095) 158 /storage/data1 default.0 booted rw nodrain online cmseos46.fnal.gov (1095) 159 /storage/data2 default.1 booted rw nodrain online cmseos47.fnal.gov (1095) 160 /storage/data1 default.2 booted rw nodrain online cmseos47.fnal.gov (1095) 161 /storage/data2 default.3 booted rw nodrain online cmseos49.fnal.gov (1095) 162 /storage/data1 default.2 booted rw nodrain online cmseos49.fnal.gov (1095) 163 /storage/data2 default.3 booted rw nodrain online cmseos48.fnal.gov (1095) 164 /storage/data1 default.0 booted rw nodrain online cmseos48.fnal.gov (1095) 165 /storage/data2 default.1 booted rw nodrain online cmseos50.fnal.gov (1095) 166 /storage/data1 default.0 booted drain prepare online cmseos50.fnal.gov (1095) 167 /storage/data2 default.1 booted drain prepare online cmseos51.fnal.gov (1095) 168 /storage/data1 default.2 booted drain prepare online cmseos51.fnal.gov (1095) 169 /storage/data2 default.3 booted drain prepare online cmseos52.fnal.gov (1095) 170 /storage/data1 default.0 booted drain prepare online cmseos52.fnal.gov (1095) 171 /storage/data2 default.1 booted drain prepare online cmseos53.fnal.gov (1095) 172 /storage/data1 default.2 booted drain prepare online cmseos53.fnal.gov (1095) 173 /storage/data2 default.3 booted drain prepare online cmseos56.fnal.gov (1095) 174 /storage/data1 default.0 booted drain prepare online cmseos56.fnal.gov (1095) 175 /storage/data2 default.1 booted drain prepare online cmseos54.fnal.gov (1095) 176 /storage/data1 default.0 booted drain prepare online cmseos54.fnal.gov (1095) 177 /storage/data2 default.1 booted drain prepare online cmseos55.fnal.gov (1095) 178 /storage/data1 default.2 booted drain prepare online cmseos55.fnal.gov (1095) 179 /storage/data2 default.3 booted drain prepare online cmseos57.fnal.gov (1095) 180 /storage/data1 default.2 booted drain prepare online cmseos57.fnal.gov (1095) 181 /storage/data2 default.3 booted drain prepare online cmseos58.fnal.gov (1095) 182 /storage/data1 default.0 booted drain prepare online cmseos58.fnal.gov (1095) 183 /storage/data2 default.1 booted drain prepare online cmseos59.fnal.gov (1095) 184 /storage/data1 default.2 booted drain prepare online cmseos59.fnal.gov (1095) 185 /storage/data2 default.3 booted drain prepare online cmseos60.fnal.gov (1095) 186 /storage/data1 default.0 booted drain prepare online cmseos60.fnal.gov (1095) 187 /storage/data2 default.1 booted drain prepare online cmseos61.fnal.gov (1095) 188 /storage/data1 default.2 booted drain prepare online cmseos61.fnal.gov (1095) 189 /storage/data2 default.3 booted drain prepare online cmseos62.fnal.gov (1095) 190 /storage/data1 default.0 booted drain prepare online cmseos62.fnal.gov (1095) 191 /storage/data2 default.1 booted drain prepare online cmseos63.fnal.gov (1095) 192 /storage/data1 default.2 booted drain prepare online cmseos63.fnal.gov (1095) 193 /storage/data2 default.3 booted drain prepare online [root@cmssrv222 ~]#
#2 Updated by Gerard Bernabeu Altayo about 5 years ago
So the drains are not working very well, will now set all this disks as Read Only and (re)start the drain only for one of the pools:
[root@cmssrv222 ~]# eos fs ls -d #.................................................................................................................................... # host (#...) # id # path # drain # progress # files # bytes-left # timeleft #retry #wopen #.................................................................................................................................... cmseos5.fnal.gov (1095) 65 /storage/data1 expired 0 188.00 284.18 MB 99999999999 0 0 cmseos5.fnal.gov (1095) 66 /storage/data2 expired 0 249.00 3.49 GB 99999999999 0 0 cmseos5.fnal.gov (1095) 67 /storage/data3 expired 0 239.00 2.84 GB 99999999999 0 0 cmseos4.fnal.gov (1095) 68 /storage/data1 expired 0 69.72 k 4.10 TB 99999999999 0 0 cmseos4.fnal.gov (1095) 69 /storage/data2 expired 0 68.17 k 3.98 TB 99999999999 0 0 cmseos4.fnal.gov (1095) 70 /storage/data3 expired 0 74.04 k 4.40 TB 99999999999 0 0 cmseos3.fnal.gov (1095) 71 /storage/data1 expired 0 69.38 k 4.03 TB 99999999999 0 0 cmseos3.fnal.gov (1095) 72 /storage/data2 expired 0 103.29 k 6.27 TB 99999999999 0 0 cmseos3.fnal.gov (1095) 73 /storage/data3 expired 0 65.91 k 4.02 TB 99999999999 0 0 cmseos2.fnal.gov (1095) 74 /storage/data1 expired 0 98.74 k 6.05 TB 99999999999 0 0 cmseos2.fnal.gov (1095) 75 /storage/data2 expired 0 98.90 k 6.06 TB 99999999999 0 0 cmseos2.fnal.gov (1095) 76 /storage/data3 expired 0 316.00 281.15 MB 99999999999 0 0 cmseos50.fnal.gov (1095) 166 /storage/data1 draining 0 427.76 k 34.65 TB 4726 0 6 cmseos50.fnal.gov (1095) 167 /storage/data2 draining 0 420.56 k 34.18 TB 4726 0 2 cmseos51.fnal.gov (1095) 168 /storage/data1 draining 0 438.24 k 35.81 TB 4726 0 5 cmseos51.fnal.gov (1095) 169 /storage/data2 stalling 0 1.00 965.53 MB 4726 0 5 cmseos52.fnal.gov (1095) 170 /storage/data1 stalling 0 437.25 k 34.83 TB 4726 0 4 cmseos52.fnal.gov (1095) 171 /storage/data2 stalling 0 418.86 k 34.22 TB 4725 0 4 cmseos53.fnal.gov (1095) 172 /storage/data1 stalling 0 440.64 k 35.84 TB 4726 0 2 cmseos53.fnal.gov (1095) 173 /storage/data2 stalling 0 467.83 k 37.93 TB 4726 0 2 cmseos56.fnal.gov (1095) 174 /storage/data1 stalling 0 435.39 k 34.92 TB 4725 0 4 cmseos56.fnal.gov (1095) 175 /storage/data2 stalling 0 418.34 k 34.62 TB 4726 0 6 cmseos54.fnal.gov (1095) 176 /storage/data1 stalling 0 427.91 k 35.75 TB 4726 0 6 cmseos54.fnal.gov (1095) 177 /storage/data2 stalling 0 422.56 k 34.38 TB 4727 0 3 cmseos55.fnal.gov (1095) 178 /storage/data1 stalling 0 436.75 k 35.99 TB 4727 0 0 cmseos55.fnal.gov (1095) 179 /storage/data2 stalling 0 462.48 k 37.66 TB 4727 0 0 cmseos57.fnal.gov (1095) 180 /storage/data1 stalling 0 437.23 k 36.10 TB 4727 0 9 cmseos57.fnal.gov (1095) 181 /storage/data2 stalling 0 460.50 k 37.95 TB 4727 0 6 cmseos58.fnal.gov (1095) 182 /storage/data1 stalling 0 431.55 k 35.09 TB 4727 0 4 cmseos58.fnal.gov (1095) 183 /storage/data2 stalling 0 418.25 k 34.73 TB 4727 0 9 cmseos59.fnal.gov (1095) 184 /storage/data1 stalling 0 442.29 k 36.11 TB 4727 0 9 cmseos59.fnal.gov (1095) 185 /storage/data2 stalling 0 466.13 k 37.85 TB 4727 0 4 cmseos60.fnal.gov (1095) 186 /storage/data1 stalling 0 434.35 k 35.36 TB 4727 0 4 cmseos60.fnal.gov (1095) 187 /storage/data2 stalling 0 415.03 k 34.95 TB 4727 0 4 cmseos61.fnal.gov (1095) 188 /storage/data1 stalling 0 434.57 k 35.97 TB 4727 0 8 cmseos61.fnal.gov (1095) 189 /storage/data2 stalling 0 470.36 k 37.45 TB 4728 0 7 cmseos62.fnal.gov (1095) 190 /storage/data1 stalling 0 431.21 k 35.71 TB 4728 0 5 cmseos62.fnal.gov (1095) 191 /storage/data2 stalling 0 414.52 k 34.50 TB 4728 0 1 cmseos63.fnal.gov (1095) 192 /storage/data1 stalling 0 435.64 k 35.83 TB 69876 0 0 cmseos63.fnal.gov (1095) 193 /storage/data2 stalling 0 460.88 k 37.47 TB 69876 0 0 [root@cmssrv222 ~]# for i in `seq 65 76` `seq 166 193`; do eos fs config $i configstatus=ro; done [root@cmssrv222 ~]# for i in 168; do eos fs config $i configstatus=drain; done [root@cmssrv222 ~]# [root@cmssrv222 ~]# eos fs ls -d #.................................................................................................................................... # host (#...) # id # path # drain # progress # files # bytes-left # timeleft #retry #wopen #.................................................................................................................................... cmseos51.fnal.gov (1095) 168 /storage/data1 draining 0 438.09 k 35.71 TB 86276 0 5 [root@cmssrv222 ~]#
And the good news is that this seems to be working now! Data is coming down and the box is draining at ~500MB/s!
[root@cmssrv222 ~]# eos fs ls -d #.................................................................................................................................... # host (#...) # id # path # drain # progress # files # bytes-left # timeleft #retry #wopen #.................................................................................................................................... cmseos51.fnal.gov (1095) 168 /storage/data1 draining 0 436.43 k 35.60 TB 85785 0 5 [root@cmssrv222 ~]#
My new theory is that drains stall whenever they find files in error state, which means I'll have to go dig out all what's wrong in the FS:
[root@cmssrv222 ~]# eos fsck stat 160216 15:35:32 1455658532.284256 started check 160216 15:35:32 1455658532.284286 Filesystems to check: 115 160216 15:35:42 1455658542.322971 d_cx_diff : 14 (14) 160216 15:35:42 1455658542.322992 d_mem_sz_diff : 13729 (16516) 160216 15:35:42 1455658542.322997 orphans_n : 2 (2) 160216 15:35:42 1455658542.323000 rep_diff_n : 251 (298) 160216 15:35:42 1455658542.323005 rep_offline : 0 (0) 160216 15:35:42 1455658542.323009 unreg_n : 3 (3) 160216 15:35:42 1455658542.323013 zero_replica : 44 (44) 160216 15:35:43 1455658543.937750 stopping check 160216 15:35:43 1455658543.937786 => next run in 30 minutes [root@cmssrv222 ~]#
#3 Updated by Gerard Bernabeu Altayo about 5 years ago
I've noticed that the migration speed with a single FST was gradually decreasing speed. Since there are sooo many small files, I've increased the parallelism per FST migration from 2 to 20 transfers at a time, this boosted the transfer bandwidth back up :)
#4 Updated by Gerard Bernabeu Altayo about 5 years ago
With 20 the load increased very high in many FSTs, lowered it to 5...
#5 Updated by Gerard Bernabeu Altayo about 5 years ago
The node is almost drained:
[root@cmssrv222 ~]# eos fs ls -d #.................................................................................................................................... # host (#...) # id # path # drain # progress # files # bytes-left # timeleft #retry #wopen #.................................................................................................................................... cmseos51.fnal.gov (1095) 168 /storage/data1 draining 99 98.00 2.50 GB 22028 0 5 [root@cmssrv222 ~]#
Starting another node migration:
[root@cmssrv222 ~]# for i in 166 167 169; do eos fs config $i configstatus=drain; done
#6 Updated by Gerard Bernabeu Altayo about 5 years ago
The drain is working pretty well going FST by FST (it will just take a while), starting a new one and this afternoon I'll retire 2 nodes:
[root@cmssrv222 ~]# eos fs ls -d #.................................................................................................................................... # host (#...) # id # path # drain # progress # files # bytes-left # timeleft #retry #wopen #.................................................................................................................................... cmseos50.fnal.gov (1095) 166 /storage/data1 expired 99 21.00 396.48 MB 99999999999 0 6 cmseos50.fnal.gov (1095) 167 /storage/data2 expired 99 6.00 1.75 GB 99999999999 0 2 cmseos51.fnal.gov (1095) 168 /storage/data1 expired 99 30.00 596.66 MB 99999999999 0 5 cmseos51.fnal.gov (1095) 169 /storage/data2 expired 0 1.00 965.53 MB 99999999999 0 5 [root@cmssrv222 ~]# for i in 170 171; do eos fs config $i configstatus=drain; done [root@cmssrv222 ~]#
In order to clear the running transfers, restarting cmseos50 & cmseos51 and then I'll look at each individual FST to clear them up.
#7 Updated by Gerard Bernabeu Altayo about 5 years ago
Nodes keep draining, now it is time to look at the few remaining files and solve the individual issues, for exmample:
[root@cmssrv222 ~]# for i in `eos fs dumpmd 166 -path | cut -d= -f2`; do eos file info $i; done File: '/eos/uscms/store/user/drankin/Theta_Oct13/Elec_ML_3fb/Elec_ML_3fb_24/cache/quant-toys:0-wp3000-c04209a4ad.db' Flags: 0644 Size: 61440 Modify: Wed Oct 14 05:54:21 2015 Timestamp: 1444820061.136974000 Change: Wed Oct 14 05:54:02 2015 Timestamp: 1444820042.80888040 CUid: 47452 CGid: 5063 Fxid: 081c8ce6 Fid: 136088806 Pid: 2434882 Pxid: 00252742 XStype: adler XS: 32 43 c4 88 ETAG: 36531060695105536:3243c488 replica Stripes: 2 Blocksize: 4k LayoutId: 00100112 #Rep: 2 # fs-id #................................................................................................................................... # host # schedgroup # path # boot # configstatus # drain # active # geotag #................................................................................................................................... 0 166 cmseos50.fnal.gov default.0 /storage/data1 booted ro nodrain online 1 136 cmseos34.fnal.gov default.0 /storage/data1 booted rw nodrain online ******* File: '/eos/uscms/store/user/drankin/Theta_Oct13/Muon_MT_3fb/Muon_MT_3fb_16/cache/nll_der-toys:0.0-wp3000-04f3872ac8.db' Flags: 0644 Size: 693248 Modify: Wed Oct 14 05:55:31 2015 Timestamp: 1444820131.970225000 Change: Wed Oct 14 05:55:28 2015 Timestamp: 1444820128.408139999 CUid: 47452 CGid: 5063 Fxid: 081ca568 Fid: 136095080 Pid: 2435048 Pxid: 002527e8 XStype: adler XS: 18 26 2b 73 ETAG: 36532744859156480:18262b73 replica Stripes: 2 Blocksize: 4k LayoutId: 00100112 #Rep: 2 # fs-id #................................................................................................................................... # host # schedgroup # path # boot # configstatus # drain # active # geotag #................................................................................................................................... 0 166 cmseos50.fnal.gov default.0 /storage/data1 booted ro nodrain online 1 158 cmseos46.fnal.gov default.0 /storage/data1 booted rw nodrain online ******* .... <OUTPUT CUT>
#8 Updated by Gerard Bernabeu Altayo about 5 years ago
I've managed to empty one of the FSTs:
[root@cmssrv222 ~]# for i in `eos fs dumpmd 166 -path | cut -d= -f2`; do eos file move $i 166 45; done success: scheduled move from source fs=166 => target fs=45 success: scheduled move from source fs=166 => target fs=45 success: scheduled move from source fs=166 => target fs=45 success: scheduled move from source fs=166 => target fs=45 success: scheduled move from source fs=166 => target fs=45 success: scheduled move from source fs=166 => target fs=45 success: scheduled move from source fs=166 => target fs=45 success: scheduled move from source fs=166 => target fs=45 [root@cmssrv222 ~]# eos fs dumpmd 166 -path path=/eos/uscms/store/user/drankin/Theta_Oct13/Muon_ML_3fb/Muon_ML_3fb_21/nll_der-toys:0.0-wp2000-ee81029424.cfg path=/eos/uscms/store/user/drankin/Theta_Oct13/Elec_MT_3fb/electron_ge1btags_IDMTJet250Lep150MET150_discovery.json path=/eos/uscms/store/user/drankin/Theta_Oct13/Muon_MT_3fb/Muon_MT_3fb_20/model_summary_general.thtml path=/eos/uscms/store/user/drankin/Theta_Oct13/Elec_ML_3fb/Elec_ML_3fb_16/quant-toys:0-wp3000-4160ae2cb8.cfg [root@cmssrv222 ~]# eos fs dumpmd 166 -path path=/eos/uscms/store/user/drankin/Theta_Oct13/Elec_MT_3fb/electron_ge1btags_IDMTJet250Lep150MET150_discovery.json path=/eos/uscms/store/user/drankin/Theta_Oct13/Muon_MT_3fb/Muon_MT_3fb_20/model_summary_general.thtml path=/eos/uscms/store/user/drankin/Theta_Oct13/Elec_ML_3fb/Elec_ML_3fb_16/quant-toys:0-wp3000-4160ae2cb8.cfg [root@cmssrv222 ~]# eos fs dumpmd 166 -path path=/eos/uscms/store/user/drankin/Theta_Oct13/Muon_MT_3fb/Muon_MT_3fb_20/model_summary_general.thtml path=/eos/uscms/store/user/drankin/Theta_Oct13/Elec_ML_3fb/Elec_ML_3fb_16/quant-toys:0-wp3000-4160ae2cb8.cfg [root@cmssrv222 ~]# eos fs dumpmd 166 -path path=/eos/uscms/store/user/drankin/Theta_Oct13/Muon_MT_3fb/Muon_MT_3fb_20/model_summary_general.thtml path=/eos/uscms/store/user/drankin/Theta_Oct13/Elec_ML_3fb/Elec_ML_3fb_16/quant-toys:0-wp3000-4160ae2cb8.cfg [root@cmssrv222 ~]# eos fs dumpmd 166 -path path=/eos/uscms/store/user/drankin/Theta_Oct13/Muon_MT_3fb/Muon_MT_3fb_20/model_summary_general.thtml path=/eos/uscms/store/user/drankin/Theta_Oct13/Elec_ML_3fb/Elec_ML_3fb_16/quant-toys:0-wp3000-4160ae2cb8.cfg [root@cmssrv222 ~]# eos fs dumpmd 166 -path path=/eos/uscms/store/user/drankin/Theta_Oct13/Elec_ML_3fb/Elec_ML_3fb_16/quant-toys:0-wp3000-4160ae2cb8.cfg [root@cmssrv222 ~]# eos fs dumpmd 166 -path path=/eos/uscms/store/user/drankin/Theta_Oct13/Elec_ML_3fb/Elec_ML_3fb_16/quant-toys:0-wp3000-4160ae2cb8.cfg [root@cmssrv222 ~]# eos fs dumpmd 166 -path [root@cmssrv222 ~]# eos fs dumpmd 166 -path [root@cmssrv222 ~]#
#9 Updated by Gerard Bernabeu Altayo about 5 years ago
now just following https://cmsweb.fnal.gov/bin/view/Storage/EOSOperationalProcedures#Decommission_an_EOS_FST_node
Will do cmseos50 first:
-bash-4.1$ /srv/admin/bin/cis-please-retire cmseos50 you (DCSO) are in charge of removing the host from zabbix and the ENC Primary Ticket Information Number: INC000000669206 Summary: Retire host: cmseos50 Status: Assigned Submitted: 2016-02-24 14:59:48 CST Urgency: 4 - Low Priority: 3 - Medium Service Type: Server Requestor Info Name: Gerard Bernabeu Altayo Email: gerard1@fnal.gov Created By: cd-srv-cms-snow Assignee Info Group: ECF-CIS Name: (none) Last Modified: 2016-02-24 14:59:53 CST User-Provided Description Please retire host: cmseos50 -bash-4.1$
#10 Updated by Gerard Bernabeu Altayo about 5 years ago
Now doing cmseos51, it has some 'sticky' files:
[root@cmssrv222 ~]# for i in `eos fs dumpmd 168 -path | cut -d= -f2`; do eos file move $i 168 46; done success: scheduled move from source fs=168 => target fs=46 [root@cmssrv222 ~]# eos file info /eos/uscms/store/user/kreis/Hbb/20150608_wide_fa3zz_nohup/log_Wh_CMS_float_observed_minuit_0p01_tries10_strategy2.log File: '/eos/uscms/store/user/kreis/Hbb/20150608_wide_fa3zz_nohup/log_Wh_CMS_float_observed_minuit_0p01_tries10_strategy2.log' Flags: 0644 Size: 6602829 Modify: Tue Jun 9 14:42:16 2015 Timestamp: 1433878936.112778000 Change: Tue Jun 9 13:58:57 2015 Timestamp: 1433876337.474202243 CUid: 44456 CGid: 5063 Fxid: 0608092a Fid: 101189930 Pid: 1895919 Pxid: 001cedef XStype: adler XS: 4e 20 7e 0d ETAG: 27162965002158080:4e207e0d replica Stripes: 2 Blocksize: 4k LayoutId: 00100112 #Rep: 2 # fs-id #................................................................................................................................... # host # schedgroup # path # boot # configstatus # drain # active # geotag #................................................................................................................................... 0 61 cmseos17.fnal.gov default.2 /storage/data3 booted rw nodrain online 1 168 cmseos51.fnal.gov default.2 /storage/data1 booted ro nodrain online ******* [root@cmssrv222 ~]# eos file drop /eos/uscms/store/user/kreis/Hbb/20150608_wide_fa3zz_nohup/log_Wh_CMS_float_observed_minuit_0p01_tries10_strategy2.log 168 success: dropped stripe on fs=168 [root@cmssrv222 ~]# [root@cmseos51 ~]# for i in $fsids; do ssh $mgm eos fs dumpmd $i -path; done path=/eos/uscms/store/user/kreis/Hbb/20150608_wide_fa3zz_nohup/log_Wh_CMS_float_observed_minuit_0p01_tries10_strategy2.log [root@cmseos51 ~]# for i in $fsids; do ssh $mgm eos fs dumpmd $i -path; done [root@cmseos51 ~]#
So now I can move forward with standard procedure:
[root@cmseos51 ~]# fsids=`ssh $mgm eos fs ls -m $HOSTNAME | awk '{print $3}' | grep id= | cut -d= -f2` [root@cmseos51 ~]# for i in $fsids; do > ssh $mgm eos fs rm $i > done error: you can only remove file systems which are in 'empty' status (errc=22) (Invalid argument) error: you can only remove file systems which are in 'empty' status (errc=22) (Invalid argument) [root@cmseos51 ~]# ssh $mgm eos vid remove gateway ${HOSTNAME} success: rm vid [ eos.rgid=0 eos.ruid=0 mgm.cmd=vid mgm.subcmd=rm mgm.vid.cmd=unmap mgm.vid.key=tident:"*@cmseos51.fnal.gov":uid] success: rm vid [ eos.rgid=0 eos.ruid=0 mgm.cmd=vid mgm.subcmd=rm mgm.vid.cmd=unmap mgm.vid.key=tident:"*@cmseos51.fnal.gov":gid] [root@cmseos51 ~]# puppet agent --disable 'This node needs to be reshoot' [root@cmseos51 ~]# service eosd stop Stopping eosd: [ OK ] [root@cmseos51 ~]# service eos stop Stopping xrootd: fst [ OK ] [root@cmseos51 ~]# chkconfig eos off [root@cmseos51 ~]# chkconfig eosd off [root@cmseos51 ~]# chkconfig eos-gridftp off [root@cmseos51 ~]# sleep 5 [root@cmseos51 ~]# ssh $mgm eos node rm ${HOSTNAME}:1095 error: unable to remove node '/eos/cmseos51.fnal.gov:1095/fst' - filesystems are not all in empty state - try to drain them or: node config <name> configstatus=empty (errc=16) (Device or resource busy) [root@cmseos51 ~]# fsids=`ssh $mgm eos fs ls -m $HOSTNAME | awk '{print $3}' | grep id= | cut -d= -f2` [root@cmseos51 ~]# for i in $fsids; do > ssh $mgm eos fs config $i configstatus=empty > done [root@cmseos51 ~]# ssh $mgm eos node rm ${HOSTNAME}:1095 success: removed node '/eos/cmseos51.fnal.gov:1095/fst' [root@cmseos51 ~]# [root@cmseos51 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda2 40G 6.1G 32G 17% / tmpfs 32G 0 32G 0% /dev/shm /dev/sda1 1008M 138M 820M 15% /boot /dev/sda5 857G 72M 814G 1% /storage/local/data1 /dev/sdb 71T 169M 71T 1% /storage/data1 /dev/sdc 71T 213M 71T 1% /storage/data2 [root@cmseos51 ~]# poweroff Broadcast message from root@cmseos51.fnal.gov (/dev/pts/0) at 15:09 ... The system is going down for power off NOW! [root@cmseos51 ~]# Connection to cmseos51 closed by remote host. Connection to cmseos51 closed. -bash-4.1$ -bash-4.1$ /srv/admin/bin/cis-please-retire cmseos51 you (DCSO) are in charge of removing the host from zabbix and the ENC Primary Ticket Information Number: INC000000669215 Summary: Retire host: cmseos51 Status: Assigned Submitted: 2016-02-24 15:10:51 CST Urgency: 4 - Low Priority: 3 - Medium Service Type: Server Requestor Info Name: Gerard Bernabeu Altayo Email: gerard1@fnal.gov Created By: cd-srv-cms-snow Assignee Info Group: ECF-CIS Name: (none) Last Modified: 2016-02-24 15:10:56 CST User-Provided Description Please retire host: cmseos51 -bash-4.1$ comp-4:hosts gerard1$ HOST=cmseos51; git rm $HOST.fnal.gov.yaml; git commit -m "retiring $HOST"; git push rm 'hosts/cmseos51.fnal.gov.yaml' [master 228e747] retiring cmseos51 1 file changed, 12 deletions(-) delete mode 100644 hosts/cmseos51.fnal.gov.yaml Counting objects: 7, done. Delta compression using up to 4 threads. Compressing objects: 100% (3/3), done. Writing objects: 100% (3/3), 296 bytes | 0 bytes/s, done. Total 3 (delta 2), reused 0 (delta 0) remote: remote: diff-tree: remote: :100644 000000 23e5d116ff7eff80ed1f02a4204375ed5824b282 0000000000000000000000000000000000000000 D hosts/cmseos51.fnal.gov.yaml remote: fatal: Path 'hosts/cmseos51.fnal.gov.yaml' does not exist in '228e7477063198cc7ab6764e1e92c4bcc35c0886' remote: omd-host-crud update cmseos51 --role UNSET --instance UNSET --extra 'UNSET' remote: cmseos51 - host updated, will now inventory... remote: cmseos51 - unknown error remote: omd-host-crud delete cmseos51 remote: cmseos51 - host deleted remote: Recieved from stdin: remote: oldrev: 218f475579d7028a4a2808ce0064c1ca708b7a3b remote: newrev: 228e7477063198cc7ab6764e1e92c4bcc35c0886 remote: refname: refs/heads/master remote: Derived Configuration: remote: REPO: puppet@cms-git:/var/lib/puppet/enc.git remote: BRANCH: master remote: BRANCH_DIR: /srv/puppet/enc remote: PUPPET_SERVERS: puppet@cmssrv166.fnal.gov puppet@cmspuppet2.fnal.gov puppet@cmspuppet1.fnal.gov remote: Updating remote branch /srv/puppet/enc/master on puppet@cmssrv166.fnal.gov remote: From cms-git:/var/lib/puppet/enc remote: * branch master -> FETCH_HEAD remote: Updating 218f475..228e747 remote: Fast-forward remote: hosts/cmseos51.fnal.gov.yaml | 12 ------------ remote: 1 files changed, 0 insertions(+), 12 deletions(-) remote: delete mode 100644 hosts/cmseos51.fnal.gov.yaml remote: Updating remote branch /srv/puppet/enc/master on puppet@cmspuppet2.fnal.gov remote: From cms-git:/var/lib/puppet/enc remote: * branch master -> FETCH_HEAD remote: Updating 218f475..228e747 remote: Fast-forward remote: hosts/cmseos51.fnal.gov.yaml | 12 ------------ remote: 1 files changed, 0 insertions(+), 12 deletions(-) remote: delete mode 100644 hosts/cmseos51.fnal.gov.yaml remote: Updating remote branch /srv/puppet/enc/master on puppet@cmspuppet1.fnal.gov remote: From cms-git:/var/lib/puppet/enc remote: * branch master -> FETCH_HEAD remote: Updating 218f475..228e747 remote: Fast-forward remote: hosts/cmseos51.fnal.gov.yaml | 12 ------------ remote: 1 files changed, 0 insertions(+), 12 deletions(-) remote: delete mode 100644 hosts/cmseos51.fnal.gov.yaml To puppet@cms-git.fnal.gov:enc 218f475..228e747 master -> master comp-4:hosts gerard1$
#11 Updated by Gerard Bernabeu Altayo about 5 years ago
cmseos52 retired too.
#12 Updated by Gerard Bernabeu Altayo about 5 years ago
cmseos53 retired, cmseos56 started drain :)
#13 Updated by Gerard Bernabeu Altayo about 5 years ago
cmseos56 retired :)
All the others keep draining...
#14 Updated by Gerard Bernabeu Altayo about 5 years ago
only 2 more nodes to go :D
[root@cmssrv222 ~]# eos fs ls -d
#....................................................................................................................................- host (#...) # id # path # drain # progress # files # bytes-left # timeleft #retry #wopen
#....................................................................................................................................
cmseos61.fnal.gov (1095) 188 /storage/data1 drained 100 0.00 0.00 B 0 0 0
cmseos61.fnal.gov (1095) 189 /storage/data2 draining 85 59.71 k 5.67 TB 587624 0 0
cmseos63.fnal.gov (1095) 192 /storage/data1 draining 25 283.88 k 26.64 TB 587624 0 0
cmseos63.fnal.gov (1095) 193 /storage/data2 stalling 2 391.70 k 35.78 TB 587625 0 0
[root@cmssrv222 ~]#
#15 Updated by Gerard Bernabeu Altayo about 5 years ago
- Status changed from New to Resolved
all nodes have been retired!