Initial daq cluster setup checklist » History » Version 20

Version 19 (Geoff Savage, 01/27/2020 01:48 PM) → Version 20/40 (Geoff Savage, 01/27/2020 01:50 PM)

h1. Initial DAQ cluster setup checklist.

_Objective: To reduce the number service desk tickets during the initial setup of DAQ development / production clusters._

h2. Networking

# define subnets for IPMI, fnal/public and data/daq interfaces
** How many cables are needed? Shared IPMI/public?
** Name interfaces by function.
# define host names for all network interfaces and make them consistent
** mydaq-br01, mydaq-eb01, mydaq-br01-ipmi, mydaq-br01-daq
** the list of host names should be complete as if all hardware is available
*** Reserve a few for IPs for the next computer installs
** put all host names into /etc/hosts and distribute it across all servers
** How do we automate generation of the hosts file?
** Right now /etc/hosts managed by puppet. Remove this?
# make a consistent IP address assignment across all subnets
** use address blocks for the same server roles
** make the last octet of an IP address being the same across all NICs of the same host
*** Discussion with networking
# configure authentication
** Kerberos for the public interface
** public key for the data interface
*** Access everything over private network for the daq user
*** User testing artdaq will get instructions to set up their own public key
# create instructions for rebooting servers using IPMI
# enable the 9000 MTU frames on -all- DAQ interfaces and networking equipment by default
** Switch configuration by networking.
** Just on DAQ network, not all interfaces.
** NFS on public network - jumbo frames for performance? No jumbo on public.
# configure and verify that multicasting is enabled and working all networking equipment
** Need testing software to verify the configuration.

h2. Users

# define a shared user for
** managing UPS products
** running daq, dcs, databases
** Experiments manage .k5logins for the shared accounts
# add all people from the RSI group to the /root/.k5login
# add all known daq users to the daq and dcs shared accounts
# shared user profiles are not expected to have any customizations

h2. Storage areas

# setup a reliable NFS server for /home, /daq/products, /daq/database, /daq/log, /daq/database, /daq/tmp,.... /data /scratch, /daq/backup
** No mounts from labs central storage or pnfs.
# reserve adequate disk space for each area
# create a designated scratch area for doing builds on a local NVMe derive, preferably on the fastest server
** a faster NVMe drive such as Samsung 970 Pro or faster is preferred
# setup a nightly backup for /home and a weekly backup for /daq/backup areas
# the performance of the NFS should be monitored

h2. Software

# any base software such as the OS and productivity RPMs should be identical on all servers
# a default list of installed software packages should not be impeding the development/ testing work, e.g. emacs, vim, mc, tmux, perf, iperf, strace, dstat,..... VNC/MATE should be installed by default
# implement system monitoring using ganglia
or similar software
# Support for MOSH?
** We should try it. Might be blocked by ACLs.

h2. System Services

# Optional: DNS, Kerberos, NIS, Supervisord, influxdb, prometheus.
# Ganglia, graphite.

h2. Geoff

* Turn off checking of raid arrays.
* Raid arrays must be raid 10? You lose half the disk size?
* Do we really need hosts file?
** If we use a hosts file we should use a script to create the file.
* ntp from fermilab servers works well. No need for an experiment ntp server.
* Who is in the RSI group?
* Use Ansible to verify the settings from puppet are correct?

* Buffer sizes in network switches
* Database computer specs