Project

General

Profile

Bug #10913

Run yum update on EOS in preparation for Monday 16/11/2015 downtime

Added by Gerard Bernabeu Altayo almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Start date:
11/11/2015
Due date:
% Done:

0%

Estimated time:
Duration:

Description

I've done it in the test instance, yum update --exclude=xrootd* would work. I'll run this on Nov 12th in the morning.

I need to find the whole list (farmlets and do a pssh). Then on Monday just perform reboots all across. I'll be in a meeting so this should be done in a very automated fashion that only requires a couple commands to execute and verify.

History

#1 Updated by Gerard Bernabeu Altayo almost 4 years ago

-bash-4.1$ grep cmseos /usr/local/etc/farmlets/all > /tmp/eosfsts.gba

-bash-4.1$ pssh -l root -p50 --hosts=/tmp/eosfsts.gba -o eosfsts.out -e eosfsts.err 'yum -y update --exclude=xrootd*'

Manually updating cmssrv222 and cmssrv238 with the same command.

Updates successful, on Monday I just need to run (according to the procedure in https://cmsweb.fnal.gov/bin/view/Storage/EOSOperationalProcedures#Fast):

ssh cmsadmin1.fnal.gov
CLUSTER=cmssrv222.fnal.gov
scp root@cmssrv222.fnal.gov:/opt/dcso/eos/reboot_or_update_eos_fst.sh.puppet eos/
chmod +x eos/reboot_or_update_eos_fst.sh.puppet 
HOSTLIST=`ssh -t -oBatchMode=yes -oConnectTimeout=5 -oStrictHostKeyChecking=no -q -l root "${CLUSTER}" "eos -b node ls -m" | tr ' :' '\n' | grep hostport | tr '=' ' ' | awk '{print $2}'`
for h in $HOSTLIST; do
      LOG=`mktemp /tmp/${CLUSTER}_upgrade.XXX`
      nohup eos/reboot_or_update_eos_fst.sh.puppet  --force-reboot $CLUSTER $h > $LOG &
      sleep 1
done

#2 Updated by Gerard Bernabeu Altayo almost 4 years ago

Started:

cmsadmin1.fnal.gov - bastion/production (SLF 6.7)
32-core Opteron 6320 (H8QG6); 62.89 GB RAM, 16.00 GB swap
-bash-4.1$ date
Mon Nov 16 09:37:50 CST 2015
-bash-4.1$ CLUSTER=cmssrv222.fnal.gov
-bash-4.1$ scp :/opt/dcso/eos/reboot_or_update_eos_fst.sh.puppet eos/
reboot_or_update_eos_fst.sh.puppet 100% 7558 7.4KB/s 00:00
-bash-4.1$ chmod +x eos/reboot_or_update_eos_fst.sh.puppet
-bash-4.1$ HOSTLIST=`ssh -t -oBatchMode=yes -oConnectTimeout=5 -oStrictHostKeyChecking=no -q -l root "${CLUSTER}" "eos -b node ls -m" | tr ' :' '\n' | grep hostport | tr '=' ' ' | awk '{print $2}'`
-bash-4.1$ echo $HOSTLIST
cmseos1.fnal.gov cmseos11.fnal.gov cmseos12.fnal.gov cmseos13.fnal.gov cmseos14.fnal.gov cmseos15.fnal.gov cmseos16.fnal.gov cmseos17.fnal.gov cmseos2.fnal.gov cmseos25.fnal.gov cmseos26.fnal.gov cmseos27.fnal.gov cmseos28.fnal.gov cmseos29.fnal.gov cmseos3.fnal.gov cmseos30.fnal.gov cmseos31.fnal.gov cmseos32.fnal.gov cmseos33.fnal.gov cmseos34.fnal.gov cmseos35.fnal.gov cmseos36.fnal.gov cmseos37.fnal.gov cmseos38.fnal.gov cmseos39.fnal.gov cmseos4.fnal.gov cmseos40.fnal.gov cmseos41.fnal.gov cmseos42.fnal.gov cmseos43.fnal.gov cmseos44.fnal.gov cmseos45.fnal.gov cmseos46.fnal.gov cmseos47.fnal.gov cmseos48.fnal.gov cmseos49.fnal.gov cmseos5.fnal.gov cmseos50.fnal.gov cmseos51.fnal.gov cmseos52.fnal.gov cmseos53.fnal.gov cmseos54.fnal.gov cmseos55.fnal.gov cmseos56.fnal.gov cmseos57.fnal.gov cmseos58.fnal.gov cmseos59.fnal.gov cmseos60.fnal.gov cmseos61.fnal.gov cmseos62.fnal.gov cmseos63.fnal.gov
-bash-4.1$ for h in $HOSTLIST; do

LOG=`mktemp /tmp/${CLUSTER}_upgrade.XXX`
nohup eos/reboot_or_update_eos_fst.sh.puppet --force-reboot $CLUSTER $h > $LOG &
sleep 1
done

[1] 1479
nohup: ignoring input and redirecting stderr to stdout
[2] 1494
nohup: ignoring input and redirecting stderr to stdout
[3] 1528
nohup: ignoring input and redirecting stderr to stdout
[4] 1570
nohup: ignoring input and redirecting stderr to stdout
[5] 1600
nohup: ignoring input and redirecting stderr to stdout
[6] 1645
nohup: ignoring input and redirecting stderr to stdout
[7] 1701
nohup: ignoring input and redirecting stderr to stdout
[8] 1754
nohup: ignoring input and redirecting stderr to stdout
[9] 1793
nohup: ignoring input and redirecting stderr to stdout
[10] 1837
nohup: ignoring input and redirecting stderr to stdout
[11] 1883
nohup: ignoring input and redirecting stderr to stdout
[12] 1921
nohup: ignoring input and redirecting stderr to stdout
[13] 1999
nohup: ignoring input and redirecting stderr to stdout
[14] 2031
nohup: ignoring input and redirecting stderr to stdout
[15] 2068
nohup: ignoring input and redirecting stderr to stdout
[16] 2153
nohup: ignoring input and redirecting stderr to stdout
[17] 2190
nohup: ignoring input and redirecting stderr to stdout
[18] 2230
nohup: ignoring input and redirecting stderr to stdout
[19] 2288
nohup: ignoring input and redirecting stderr to stdout
[20] 2330
nohup: ignoring input and redirecting stderr to stdout
[21] 2371
nohup: ignoring input and redirecting stderr to stdout
[22] 2418
nohup: ignoring input and redirecting stderr to stdout
[23] 2466
nohup: ignoring input and redirecting stderr to stdout
[24] 2502
nohup: ignoring input and redirecting stderr to stdout
[25] 2547
nohup: ignoring input and redirecting stderr to stdout
[26] 2589
nohup: ignoring input and redirecting stderr to stdout
[27] 2632
nohup: ignoring input and redirecting stderr to stdout
[28] 2670
nohup: ignoring input and redirecting stderr to stdout
[29] 2716
nohup: ignoring input and redirecting stderr to stdout
[30] 2765
nohup: ignoring input and redirecting stderr to stdout
[31] 2806
nohup: ignoring input and redirecting stderr to stdout
[32] 2841
nohup: ignoring input and redirecting stderr to stdout
[33] 2888
nohup: ignoring input and redirecting stderr to stdout
[34] 2931
nohup: ignoring input and redirecting stderr to stdout
[35] 2979
nohup: ignoring input and redirecting stderr to stdout
[36] 3022
nohup: ignoring input and redirecting stderr to stdout
[37] 3077
nohup: ignoring input and redirecting stderr to stdout
[38] 3120
nohup: ignoring input and redirecting stderr to stdout
[39] 3161
nohup: ignoring input and redirecting stderr to stdout
[40] 3204
nohup: ignoring input and redirecting stderr to stdout
[41] 3267
nohup: ignoring input and redirecting stderr to stdout
[42] 3327
nohup: ignoring input and redirecting stderr to stdout
[43] 3372
nohup: ignoring input and redirecting stderr to stdout
[44] 3420
nohup: ignoring input and redirecting stderr to stdout
[45] 3465
nohup: ignoring input and redirecting stderr to stdout
[46] 3511
nohup: ignoring input and redirecting stderr to stdout
[47] 3570
nohup: ignoring input and redirecting stderr to stdout
[48] 3611
nohup: ignoring input and redirecting stderr to stdout
[49] 3663
nohup: ignoring input and redirecting stderr to stdout
[50] 3716
nohup: ignoring input and redirecting stderr to stdout
[51] 3833
nohup: ignoring input and redirecting stderr to stdout
-bash-4.1$ date
Mon Nov 16 09:40:19 CST 2015

#3 Updated by Gerard Bernabeu Altayo almost 4 years ago

Rebooting the slave node:

[root@cmssrv238 ~]# reboot

Broadcast message from
(/dev/pts/0) at 9:42 ...

The system is going down for reboot NOW!
[root@cmssrv238 ~]#

I checked that there was nothing to be updated :)

#4 Updated by Gerard Bernabeu Altayo almost 4 years ago

Some FSTs (like pools) did not come back OK:

[root@cmssrv222 ~]# eos fs ls | grep -v 'booted rw nodrain online' | grep -v 'booting rw '

#..........................................................................................................................................
  1. host (#...) # id # path # schedgroup # geotag # boot # configstatus # drain # active
    #..........................................................................................................................................
    cmseos33.fnal.gov (1095) 125 /storage/data1 default.2 booted ro nodrain online
    cmseos40.fnal.gov (1095) 148 /storage/data1 default.0 booted ro nodrain online
    cmseos57.fnal.gov (1095) 180 /storage/data1 default.2 booted rw nodrain offline
    cmseos57.fnal.gov (1095) 181 /storage/data2 default.3 booted ro nodrain offline
    [root@cmssrv222 ~]#

Also cmssrv238 had trouble starting (NS would not finish initialize step), I had to restart eossync on cmssrv222 and then restart eos on cmssrv238... Now it's booting.

#5 Updated by Gerard Bernabeu Altayo almost 4 years ago

EOS slave crashed on boot again, opening a JIRA with the stacktrace: https://its.cern.ch/jira/browse/EOS-1303

#6 Updated by Gerard Bernabeu Altayo almost 4 years ago

Fixing the FSTs:

[root@cmssrv222 ~]# eos node set cmseos57.fnal.gov on

[root@cmssrv222 ~]# eos fs ls | grep -v 'booted rw nodrain online' | grep -v 'booting rw '

#..........................................................................................................................................
  1. host (#...) # id # path # schedgroup # geotag # boot # configstatus # drain # active
    #..........................................................................................................................................
    cmseos33.fnal.gov (1095) 125 /storage/data1 default.2 booted ro nodrain online
    cmseos40.fnal.gov (1095) 148 /storage/data1 default.0 booted ro nodrain online
    cmseos57.fnal.gov (1095) 181 /storage/data2 default.3 booted ro nodrain online
    [root@cmssrv222 ~]# eos fs config 181 configstatus=rw
    [root@cmssrv222 ~]# eos fs config 148 configstatus=rw
    [root@cmssrv222 ~]# ssh cmseos33 uptime
    11:02:10 up 46 min, 0 users, load average: 0.06, 0.04, 0.03
    [root@cmssrv222 ~]# eos fs config 125 configstatus=rw
    [root@cmssrv222 ~]# eos fs ls | grep -v 'booted rw nodrain online' | grep -v 'booting rw '
#..........................................................................................................................................
  1. host (#...) # id # path # schedgroup # geotag # boot # configstatus # drain # active
    #..........................................................................................................................................
    [root@cmssrv222 ~]#

#7 Updated by Gerard Bernabeu Altayo almost 4 years ago

Looks like the slave got the directories file corrupted:

ALL Files 21818486 [booting] (446s)
ALL Directories 81

On cmssrv222 I see:

ALL Files 21818486 [booting] (446s)
ALL Directories 81

The MGM crashed again on the slave... Will remove MGM files on the slave and sync again:

[root@cmssrv222 ~]# puppet agent --disable 'resyncing cmssrv238'
[root@cmssrv222 ~]# service eossync stop

Stopping eossync: [ OK ]
[root@cmssrv222 ~]#

[root@cmssrv238 ~]# service eossync stop

Stopping eossync: [ OK ]
[root@cmssrv238 ~]# service eos stop
Stopping xrootd: mgm [ OK ]
Stopping xrootd: mq [ OK ]
Stopping xrootd: sync [ OK ]
[root@cmssrv238 ~]#
[root@cmssrv238 eos]# pwd
/var/eos
[root@cmssrv238 eos]# mv md md.old; mkdir md; chown daemon.root md; chmod 700 md
[root@cmssrv238 eos]#
[root@cmssrv238 eos]# service eos start sync

Starting xrootd as sync with -n sync -c /etc/xrd.cf.sync -l /var/log/eos/xrdlog.sync -b -Rdaemon
[ OK ]
[root@cmssrv238 eos]#

Stopping eossync: [ OK ]
[root@cmssrv222 ~]# service eossync start
Starting eossync:
FILE 0 => TARGET cmssrv222.fnal.gov:1096 [PASSED]
FILE 0 => TARGET cmssrv238.fnal.gov:1096/usr/bin/dirname: extra operand `&.pid'
Try `/usr/bin/dirname --help' for more information.
[ OK ]
FILE 1 => TARGET cmssrv222.fnal.gov:1096 [PASSED]
FILE 1 => TARGET cmssrv238.fnal.gov:1096/usr/bin/dirname: extra operand `&.pid'
Try `/usr/bin/dirname --help' for more information.
[ OK ]
FILE 2 => TARGET cmssrv222.fnal.gov:1096 [PASSED]
FILE 2 => TARGET cmssrv238.fnal.gov:1096/usr/bin/dirname: extra operand `&.pid'
Try `/usr/bin/dirname --help' for more information.
[ OK ]
CONF => TARGET cmssrv222.fnal.gov:1096
CONF => TARGET cmssrv238.fnal.gov:1096 [PASSED]
/usr/bin/dirname: extra operand `&.pid'
Try `/usr/bin/dirname --help' for more information.
[ OK ]
[root@cmssrv222 ~]#

I see this is syncing now :)

[root@cmssrv238 eos]# ll /var/eos/md
total 1302644
rw-r--r- 1 daemon daemon 162946532 Nov 16 11:08 directories.cmssrv222.fnal.gov.mdlog
rw-r--r- 1 daemon daemon 1170210816 Nov 16 11:09 files.cmssrv222.fnal.gov.mdlog
rw-r--r- 1 daemon daemon 742176 Nov 16 11:09 iostat.cmssrv222.fnal.gov.dump
[root@cmssrv238 eos]#

#8 Updated by Gerard Bernabeu Altayo almost 4 years ago

Sync finished.

Now re-enabling puppet:

[root@cmssrv222 ~]# puppet agent --enable

Starting EOS on cmssrv238 too, now I see it looks better:

[root@cmssrv238 md]# eos ns
  1. ------------------------------------------------------------------------------------
  2. Namespace Statistic
  3. ------------------------------------------------------------------------------------
    ALL Files 16167800 [booting] (51s)
    ALL Directories 518555

I've re-enabled puppet too.

BTW, this commands come from http://eos.readthedocs.org/en/latest/configuration/master.html

As soon as the slave is UP I will reboot cmssrv222. There are still some FSTs trying to boot:

[root@cmssrv222 ~]# eos fs ls | grep -v 'booted rw nodrain online'

#..........................................................................................................................................
  1. host (#...) # id # path # schedgroup # geotag # boot # configstatus # drain # active
    #..........................................................................................................................................
    cmseos35.fnal.gov (1095) 145 /storage/data2 default.3 booting rw nodrain online
    cmseos42.fnal.gov (1095) 146 /storage/data1 default.0 booting rw nodrain online
    cmseos40.fnal.gov (1095) 149 /storage/data2 default.1 booting rw nodrain online
    cmseos41.fnal.gov (1095) 150 /storage/data1 default.2 booting rw nodrain online
    cmseos50.fnal.gov (1095) 167 /storage/data2 default.1 booting rw nodrain online
    cmseos56.fnal.gov (1095) 175 /storage/data2 default.1 booting rw nodrain online
    cmseos54.fnal.gov (1095) 177 /storage/data2 default.1 booting rw nodrain online
    cmseos55.fnal.gov (1095) 179 /storage/data2 default.3 booting rw nodrain online
    cmseos61.fnal.gov (1095) 189 /storage/data2 default.3 booting rw nodrain online
    [root@cmssrv222 ~]#

#9 Updated by Gerard Bernabeu Altayo almost 4 years ago

A regular reboot of the master server did NOT work, now it believes it's a slave:

=====> mgmofs.alias: cmseos.fnal.gov
151116 11:24:23 time=1447694663.518283 func=Configure level=NOTE logid=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx unit=:1094 tid=00007f54be170740 source=XrdMgmOfsConfigure:1427 tident=<single-exec> sec=(null) uid=0 gid=0 name=(null) geo="" MGM_HOST=cmssrv222.fnal.gov MGM_PORT=1094 VERSION=0.3.127 RELEASE=beryl_aquamarine KEYTABADLER=98706d9e SYMKEY=2PGdUPZuVzkMuZIncdKMW6dmNL4=
  1. config [/config/eosuscmst1prod.fnal.gov/all/] broad cast > [/eos/*]
  2. config [/config/eosuscmst1prod.fnal.gov/fst/] broad cast > [/eos/*/fst]
  3. config [/config/eosuscmst1prod.fnal.gov/mgm/] broad cast > [/eos/*/mgm]
    151116 11:24:25 time=1447694665.583374 func=BootNamespace level=ALERT logid=e0debd06-8c86-11e5-b42f-a0369f23d048 unit=:1094 tid=00007f54be170740 source=Master:1861 tident=<service> sec= uid=0 gid=0 name= geo="" msg="running boot sequence (as slave)"
    151116 11:24:25 time=1447694665.638528 func=BootNamespace level=NOTE logid=e0debd06-8c86-11e5-b42f-a0369f23d048 unit=:1094 tid=00007f54be170740 source=Master:1920 tident=<service> sec= uid=0 gid=0 name= geo="" eos directory view configure started as slave
    151116 11:24:25 time=1447694665.654602 func=BootNamespace level=CRIT logid=e0debd06-8c86-11e5-b42f-a0369f23d048 unit=:1094 tid=00007f54be170740 source=Master:1975 tident=<service> sec= uid=0 gid=0 name= geo="" eos view initialization failed after 0 seconds
    151116 11:24:25 time=1447694665.654696 func=BootNamespace level=CRIT logid=e0debd06-8c86-11e5-b42f-a0369f23d048 unit=:1094 tid=00007f54be170740 source=Master:1977 tident=<service> sec= uid=0 gid=0 name= geo="" initialization returned ec=14 File does not exist and Create flag is absent: /var/eos/md/directories.cmssrv238.fnal.gov.mdlog
    151116 11:24:25 7721 XrootdConfig: Unable to create file system object via libXrdEosMgm.so
    151116 11:24:25 7721 XrootdConfig: Unable to load file system.
    ------ xrootd protocol initialization failed.
    151116 11:24:25 7721 XrdProtocol: Protocol xrootd could not be loaded
    ------ xrootd :-1 initialization failed.

Long story short both servers believe they're slaves... Running commands (from http://eos.readthedocs.org/en/latest/configuration/master.html):

[root@cmssrv238 md]# eos -b ns master cmssrv222.fnal.gov #Note that this is not really needed, just to make sure nothing goes nuts...

[root@cmssrv222 ~]# service eos master mgm
Configured MGM on localhost as master [ OK ]
[root@cmssrv222 ~]# service eos start mgm

Starting xrootd as mgm with -n mgm -c /etc/xrd.cf.mgm -m -l /var/log/eos/xrdlog.mgm -b -Rdaemon

After a while I see it booting OK:

[root@cmssrv222 ~]# eos ns
  1. ------------------------------------------------------------------------------------
  2. Namespace Statistic
  3. ------------------------------------------------------------------------------------
    ALL Files 4648810 [booting] (9s)
    ALL Directories 518563
  4. ....................................................................................
    ALL Compactification status=off waitstart=0 interval=0 ratio-file=0.0:1 ratio-dir=0.0:1
  5. ....................................................................................
    ALL Replication mode=master-rw state=master-rw master=cmssrv222.fnal.gov configdir=/var/eos/config/cmssrv222.fnal.gov/ config=default active=true mgm:cmssrv238.fnal.gov=ok mgm:mode=slave-ro mq:cmssrv238.fnal.gov:1097=ok
  6. ....................................................................................
    ALL File Changelog Size 3.61 GB
    ALL Dir Changelog Size 162.95 MB
  7. ....................................................................................
    ALL avg. File Entry Size 776 B
    ALL avg. Dir Entry Size 314 B
  8. ------------------------------------------------------------------------------------
    ALL memory virtual 4.91 GB
    ALL memory resident 4.53 GB
    ALL memory share 9.09 MB
    ALL memory growths 4.91 GB
    ALL threads 64
    ALL uptime 18
  9. ------------------------------------------------------------------------------------
    [root@cmssrv222 ~]#

#10 Updated by Gerard Bernabeu Altayo almost 4 years ago

Apparently there is another (bug?) with the log level (called debug in EOS), I had to reissue:

[root@cmssrv222 ~]# eos debug warning \*
success: switched to mgm.debuglevel=warning on nodes mgm.nodename=/eos/*/fst
success: switched to mgm.debuglevel=warning on nodes mgm.nodename=/eos/*/mgm
[root@cmssrv222 ~]#

The FSTs kept the level to warning, but the MGM did not.

Also I see lots of errors in the logfile, and 'df' does not show the EOS filesystem...

Apparently the FSTs did not register back with the MGM yet!

[root@cmssrv222 ~]# eos fs ls

#..........................................................................................................................................
  1. host (#...) # id # path # schedgroup # geotag # boot # configstatus # drain # active
    #..........................................................................................................................................
    cmseos12.fnal.gov (1095) 41 /storage/data3 default.2 rw nodrain
    cmseos12.fnal.gov (1095) 42 /storage/data1 default.3 rw nodrain
    cmseos12.fnal.gov (1095) 43 /storage/data2 default.1 rw nodrain
    cmseos11.fnal.gov (1095) 44 /storage/data1 default.0 rw nodrain
    cmseos11.fnal.gov (1095) 45 /storage/data2 default.3 rw nodrain
    cmseos11.fnal.gov (1095) 46 /storage/data3 default.2 rw nodrain
    cmseos13.fnal.gov (1095) 47 /storage/data1 default.0 rw nodrain
    cmseos13.fnal.gov (1095) 48 /storage/data2 default.1 rw nodrain
    cmseos13.fnal.gov (1095) 49 /storage/data3 default.2 rw nodrain
    cmseos14.fnal.gov (1095) 50 /storage/data1 default.0 rw nodrain
    cmseos14.fnal.gov (1095) 51 /storage/data2 default.1 rw nodrain
    cmseos14.fnal.gov (1095) 52 /storage/data3 default.3 rw nodrain
    cmseos15.fnal.gov (1095) 53 /storage/data1 default.0 rw nodrain
    cmseos15.fnal.gov (1095) 54 /storage/data2 default.1 rw nodrain
    cmseos15.fnal.gov (1095) 55 /storage/data3 default.3 rw nodrain
    cmseos16.fnal.gov (1095) 56 /storage/data1 default.3 rw nodrain
    cmseos16.fnal.gov (1095) 57 /storage/data2 default.1

#11 Updated by Gerard Bernabeu Altayo almost 4 years ago

Nodes show up as 'unknown':

[root@cmssrv222 ~]# eos node ls
#----------------------------------------------------------------------------------------------------------------------------------------------
  1. type # hostport # geotag # status # status # txgw #gw-queued # gw-ntx #gw-rate # heartbeatdelta #nofs
    #----------------------------------------------------------------------------------------------------------------------------------------------
    nodesview cmseos1.fnal.gov:1095 unknown on off 0 10 120 ~ 3
    nodesview cmseos11.fnal.gov:1095 unknown on off 0 10 120 ~ 3
    nodesview cmseos12.fnal.gov:1095 unknown on off 0 10 120 ~ 3
    nodesview cmseos13.fnal.gov:1095 unknown on off 0 10 120 ~ 3
    nodesview cmseos14.fnal.gov:1095 unknown on off 0 10 120 ~ 3
    nodesview cmseos15.fnal.gov:1095 unknown on off 0 10 120 ~ 3
    nodesview cmseos16.fnal.gov:1095 unknown on off 0 10 120 ~ 3
    nodesview cmseos17.fnal.gov:1095 unknown on off 0 10 120 ~ 3
    nodesview cmseos2.fnal.gov:1095 unknown on off 0 10 120 ~ 3
    nodesview cmseos25.fnal.gov:1095 unknown on off 0 10 120 ~ 2
    nodesview cmseos26.fnal.gov:1095 unknown on off 0 10 120 ~ 2
    nodesview cmseos27.fnal.gov:1095 unknown on off 0 10 120 ~ 3
    nodesview cmseos28.fnal.gov:1095 unknown on off 0 10 120 ~ 2

Will try rebooting one of the FSTs...

[root@cmseos63 ~]# service eos restart
Stopping eosd:
[ OK ]
Stopping xrootd: fst

If this works I'll just restart EOS on all nodes, it will take a while for them to come back. My theory here is that the retry mechanism on the EOS FSTs is not robust enough and the failure on starting the master MGM probably made this mechamism to fail...

#12 Updated by Gerard Bernabeu Altayo almost 4 years ago

My theory was wrong, I think that the 'mq' component was also confused thinking that both were 'slaves'. Somehow the system still worked though... I did:

[root@cmssrv222 ~]# service eos stop mq
[root@cmssrv222 ~]# service eos master mq

Now the FS shows up on 'df' and also most of the FS show up online. Of course cmseos63 failed to boot... Restarting it again.

Also some other FSTs failed to boot:

[root@cmssrv222 ~]# eos fs ls | grep -v 'booted rw nodrain online'

#..........................................................................................................................................
  1. host (#...) # id # path # schedgroup # geotag # boot # configstatus # drain # active
    #..........................................................................................................................................
    cmseos35.fnal.gov (1095) 145 /storage/data2 default.3 bootfailure rw nodrain online
    cmseos42.fnal.gov (1095) 146 /storage/data1 default.0 bootfailure rw nodrain online
    cmseos40.fnal.gov (1095) 149 /storage/data2 default.1 bootfailure rw nodrain online
    cmseos41.fnal.gov (1095) 150 /storage/data1 default.2 bootfailure rw nodrain online
    cmseos50.fnal.gov (1095) 167 /storage/data2 default.1 bootfailure rw nodrain online
    cmseos56.fnal.gov (1095) 175 /storage/data2 default.1 bootfailure rw nodrain online
    cmseos61.fnal.gov (1095) 189 /storage/data2 default.3 bootfailure rw nodrain online
    cmseos63.fnal.gov (1095) 192 /storage/data1 default.2 booting rw nodrain online
    cmseos63.fnal.gov (1095) 193 /storage/data2 default.3 booting rw nodrain online
    [root@cmssrv222 ~]#

This are the ones that were booting when I restarted the MGM, so this was completely expected. Restarting those FSTs now...

#13 Updated by Gerard Bernabeu Altayo almost 4 years ago

[root@cmssrv222 ~]# eos fs boot 145
success: boot message send to cmseos35.fnal.gov:/storage/data2
[root@cmssrv222 ~]# eos fs ls | grep -v 'booted rw nodrain online'

#..........................................................................................................................................
  1. host (#...) # id # path # schedgroup # geotag # boot # configstatus # drain # active
    #..........................................................................................................................................
    cmseos35.fnal.gov (1095) 145 /storage/data2 default.3 booting rw nodrain online
    cmseos42.fnal.gov (1095) 146 /storage/data1 default.0 bootfailure rw nodrain online
    cmseos40.fnal.gov (1095) 149 /storage/data2 default.1 bootfailure rw nodrain online
    cmseos41.fnal.gov (1095) 150 /storage/data1 default.2 bootfailure rw nodrain online
    cmseos50.fnal.gov (1095) 167 /storage/data2 default.1 bootfailure rw nodrain online
    cmseos56.fnal.gov (1095) 175 /storage/data2 default.1 bootfailure rw nodrain online
    cmseos61.fnal.gov (1095) 189 /storage/data2 default.3 bootfailure rw nodrain online
    cmseos63.fnal.gov (1095) 192 /storage/data1 default.2 booting rw nodrain online
    cmseos63.fnal.gov (1095) 193 /storage/data2 default.3 booting rw nodrain online
    [root@cmssrv222 ~]# eos fs boot 146
    success: boot message send to cmseos42.fnal.gov:/storage/data1
    [root@cmssrv222 ~]# eos fs boot 149
    success: boot message send to cmseos40.fnal.gov:/storage/data2
    [root@cmssrv222 ~]# eos fs boot 150
    success: boot message send to cmseos41.fnal.gov:/storage/data1
    [root@cmssrv222 ~]# eos fs boot 167
    success: boot message send to cmseos50.fnal.gov:/storage/data2
    [root@cmssrv222 ~]# eos fs boot 175
    success: boot message send to cmseos56.fnal.gov:/storage/data2
    [root@cmssrv222 ~]# eos fs boot 189
    success: boot message send to cmseos61.fnal.gov:/storage/data2
    [root@cmssrv222 ~]# eos fs ls | grep -v 'booted rw nodrain online'
#..........................................................................................................................................
  1. host (#...) # id # path # schedgroup # geotag # boot # configstatus # drain # active
    #..........................................................................................................................................
    cmseos35.fnal.gov (1095) 145 /storage/data2 default.3 booting rw nodrain online
    cmseos42.fnal.gov (1095) 146 /storage/data1 default.0 booting rw nodrain online
    cmseos40.fnal.gov (1095) 149 /storage/data2 default.1 booting rw nodrain online
    cmseos41.fnal.gov (1095) 150 /storage/data1 default.2 booting rw nodrain online
    cmseos50.fnal.gov (1095) 167 /storage/data2 default.1 booting rw nodrain online
    cmseos56.fnal.gov (1095) 175 /storage/data2 default.1 booting rw nodrain online
    cmseos61.fnal.gov (1095) 189 /storage/data2 default.3 booting rw nodrain online
    cmseos63.fnal.gov (1095) 192 /storage/data1 default.2 booting rw nodrain online
    cmseos63.fnal.gov (1095) 193 /storage/data2 default.3 booting rw nodrain online
    [root@cmssrv222 ~]#

This almost concludes the downtime :)

#14 Updated by Gerard Bernabeu Altayo almost 4 years ago

Summary of the issues that occurred during the reboot:

1. Slave MGM had corrupted metadata and it needed to be regenerated.
2. Master MGM booted as slave and needed to be restarted forcing it to be master.

#15 Updated by Gerard Bernabeu Altayo almost 4 years ago

  • Status changed from New to Resolved


Also available in: Atom PDF