Project

General

Profile

Milestone #24326

Migrating from SL 6 to SL 7

Added by Thomas Carroll 4 months ago. Updated about 1 month ago.

Status:
Accepted
Priority:
Normal
Start date:
04/17/2020
Due date:
06/01/2020
% Done:

10%

Estimated time:
100.00 h
Duration: 46

Related issues

Related to MINOS - Support #24243: Minos FCRSG 2020 computing support requestWork in progress03/27/202004/13/2020

History

#1 Updated by Thomas Carroll 4 months ago

  • Related to Support #24243: Minos FCRSG 2020 computing support request added

#2 Updated by Thomas Carroll 4 months ago

Requested 2 SL 7 interactive systems on 4/14/20 (see request RITM0956521).

Learned that we already have a test machine setup with SL 7 at minostestgpvm01. We will use this machine for tests before migrating and ignore the request for 2 SL 7 systems for now.

Art - do we still need a "data handling" system?

#3 Updated by Arthur Kreymer 4 months ago

  • Status changed from New to Assigned

Yes, we need to keep the monitoring processes on minos-data separate from regular users.

I think we can shut down minos-nearline, after we have moved any crons to minos-data.
I am looking to see what runs there now.

#4 Updated by Arthur Kreymer 3 months ago

  • Estimated time set to 100.00 h
  • % Done changed from 0 to 10
  • Due date set to 10/01/2020

From: Kenneth Richard Herner <>
To: Arthur Kreymer <>
Subject: SL7 test gpvms for Chips and MINOS
Date: Thu, 7 May 2020 17:55:27 +0000

Hi Art,

My apologies if you aren't the proper contact for this; let me know if there's someone else who is taking care of OS migration issues. Ed has been setting up test gpvms for experiments that haven't started the SL6 -> SL7 migration process yet. For both Chips and MINOS there is a VM available for SL7 testing now, chipstestgpvm01 and minostestgpvm01, respectively. Can you, by the end of May, log into these nodes and make sure that things work as expected for you? Things to check might include

Submitting jobs if needed
Running analysis code
Building analysis code

There is one thing to note with SL7 machines: the /nashome areas are mounted via NFS4 for additional security. The practical impact of that is that you need to have a Kerberos ticket in the gpvm to be able to access the home areas. The best way to do that is to be sure to get a forwardable ticket via kinit -f before you log in. Another consequence of that is that shared accounts, such as production accounts, with home areas in the /nashome area will not work in this way. Shared accounts with home areas in the /experiment/app areas will continue to work.

As I said, please test by the end of May and let me know if you see problems. We're planning to start migrating the remaining SL6 nodes on a rolling basis in June and July, and we need to know if there are problems preventing that for any experiments. Thanks very much in advance and let me know if you have questions!

Regards,

Ken

#5 Updated by Thomas Carroll 3 months ago

  • Due date changed from 10/01/2020 to 06/01/2020

-Art, have you been able to verify that our cron jobs, and other regularly running scripts, etc. work on the SL7 test machine?

I am going to start working down the list of job submission testing, analysis code running, and building--in that order.

#6 Updated by Arthur Kreymer 3 months ago

I have not looked at the cron jobs recently, but can revive this work.
This was being done under https://cdcvs.fnal.gov/redmine/issues/22598

#7 Updated by Arthur Kreymer about 1 month ago

  • Status changed from Assigned to Accepted

We got mail last week offering SL7 upgrades 07/15.
I have removed non-minos elements, to keep it short
We discussed this at the 07/06 MAAM

From: Edward A Simmonds <>
To: cs-liaison <>
Subject: Reinstall interactive nodes with SL7 on July 15th
Date: Thu, 2 Jul 2020 14:12:43 +0000

Greetings CS Liaisons and others,

This email is long, so please read all the way to the bottom.

As you may already be aware, support for Scientific Linux 6 (SL6) ends in November.
We need to start moving all systems currently installed with SL6 to SL7,
and we cannot wait until November to do this,
as there are over 700 of these to move and the process is very tedious.

Just to be clear, there is no upgrade from SL6 to SL7
this is a full reinstall which will not preserve local user files.
Files in NFS-mounted areas (ex., /nashome, /experiment/app, /experiment/data)
will be unaffected.
In most cases this should be a non-issue,
as for interactive systems we require users to keep their data in NFS-mounted filesystems.

IMPORTANT NOTE: An upgrade will remove any 'user crons' from these systems,
so please warn your users to back up their crons
so they can restore them after the reinstalls.

Those receiving this email should have already been contacted about this.
Test SL7 nodes were provided to all of your respective experiments/groups as follows:

. . .
minostestgpvm01
. . .

We would like to upgrade the 'regular' SL6 interactive nodes
related to these test systems to SL7 on July 15th.
That would include the following systems:
. . .
minosdatagpvm01
minosgpvm03
minosgpvm04
minosgpvm05
minosgpvm06
minosgpvm07
. . .

What we need from each group/experiment is a reply in the form:

"Ed, please reinstall the following systems with SL7 on July 15th: {list of systems here} "

I think this process will be too chaotic to track with tickets,
so this is the most expedient way to do this.

Please note also, that there is a 12 hour dCache downtime planned for July 15th,
which suggests this may be a good time to do these upgrades,
as all these systems are likely to be degraded or offline during this time.

If you have any questions, please let me know.
Thanks much,

Edward Simmonds
SCF/SSI

#8 Updated by Arthur Kreymer about 1 month ago

Outline of possible actions discussed at the Monday 07/06 MAAM :

0) review of letter from Ed
1) listed systems, roles, support,
2) Proposed 7/15 migration
minos-data to SL7
Later shutdown of
minos-slf6
minos-nearline
minostestgpvm01
3) 2020 plan for Oct +
Singularity container
Adam - max for build
grid for running on OSG
Alex - NOvA using this for scidaq, etc ( or Docker ? )
Consensus seems to be this is only way forward
Work can start in a couple of weeks
Arthur will resume coordination then.

#9 Updated by Arthur Kreymer about 1 month ago

My reply email to Ed sent 07/07

Ed, please reinstall the following systems with SL7 on July 15th:

minosdatagpvm01

Please preserve the /opt private directories and content
/opt/mindata
/opt/minos
/opt/minospro
/opt/minosraw

There is not enough time to contact the users and get their crontab
entries preserved.
So for reference, please place a copy of all minosdatagpvm01 non-root
crontab entries
into private directory /opt/minndata/crontab

By the end of July Minos expects to request the shutdown of 3 of the 8
Minos servers.
We will make a separate RITM for this, after the Jul 15 SL7 upgrade of
minosdatagpvm01.

#10 Updated by Arthur Kreymer about 1 month ago

Glenn Cooper's reply , 07/08

Hi Arthur,

We will do what we can, but...

--- The announcement was sent Thursday morning, almost two weeks before the downtime. Why is that not enough time to contact users? We would very much prefer to leave user files, including crontabs, up to them. Among other reasons, this is a chance to clean out old cron entries that are no longer useful.

--- Note that the /opt directories typically have only a file or two (or none). Again, we definitely encourage experiments to keep track of their own data, including these files.

--- Good to hear about plans to shut down some nodes; thanks. I suggest opening the ticket before the downtime, so we don't waste time upgrading those nodes to SL7.

Thanks,
Glenn

#11 Updated by Arthur Kreymer about 1 month ago

My reply to Glenn :

or reference, see our System Upgrade checklist at
https://cdcvs.fnal.gov/redmine/projects/admin/wiki/Upgrades

Only minosdatagpvm01 is being upgraded Jul 15, and I think it is ready to go.

The grafana monitoring plots show very little activity.

Please remember to restore the sudoers configuration.
I think userkill is all we have, and it is rarely used.

There are probably no regular user crontab entries.
Recent prochistory files seem to show just my interactive logins.
https://minos.fnal.gov/data/computing/dh/procsum/minosdatagpvm01/

I have snapshotted the content of the /opt/min* directories.
Your group will have to recreate the directories after the upgrade.

I suspect that nobody is using kcron on this system,
so it may not be worth preserving user kcron files.
Users can rerun kcroninit if they need to.

#12 Updated by Arthur Kreymer about 1 month ago

Per a conversation with Glenn Cooper, I sent this email to all parties :

Here is a little more background on the Minos request to upgrade
just minosdatagpvm01 to SL7 on July 15.

This issue was discussed at the Minos All Analysis Meeting July 6, 
in response to the SCF/SSI email received July 2.

Minos cannot  presently run existing code under SL7.
Minos selects SL6 Grid resources.

The Minos hope for running past Nov 2020, when SL6 becomes unavailable,
has been to build and run code using appropriate Singularity containers.

Work on this has been deferred due to preparation of an important paper,
and preparation for the Nu 2020 meeting now concluded.

There is a bare handful of people actively working on Minos code development,
as the experiment moves toward a Data Preservation phase.

The specific plan for existing Fermilab servers is something like this,
in the order in which action is likely to be taken :

minosdatagpvm01 (minos-data) SLF6 to SL7 upgrade July 15
minos-nearline     - shutdown request likely  in August
minostestgpvm01 - shutdown request likely in August
minosgpvm04 (minos60) - upgrade likely in September, after tests of Singularity containers
minosgpvm05 (minos61) - upgrade likely in September, after tests of Singularity containers
minosgpvm06 (minos62) - upgrade likely in September, after tests of Singularity containers
minosgpvm07 (minos63) - upgrade likely in September, after tests of Singularity containers
minosgpvm03 (minos-slf6) - shut down as late as possible.

We will likely want  two SL6 Singularity containers:
SLF6FULL - for building new code from source
SLF6LITE - for running on the Grid

Ideally these would be provided at Fermilab for use by all experiments.
Docker containers are already used to run on the grid, but may go away past Nov.
We hope these can be cloned to Singularity containers which run in user space.



Also available in: Atom PDF