Project

General

Profile

Initial daq cluster setup checklist » History » Version 13

Geoff Savage, 01/27/2020 01:28 PM

1 1 Gennadiy Lukhanin
h1. Initial DAQ cluster setup checklist.
2 1 Gennadiy Lukhanin
3 1 Gennadiy Lukhanin
_Objective: To reduce the number service desk tickets during the initial setup of DAQ development / production clusters._
4 1 Gennadiy Lukhanin
5 1 Gennadiy Lukhanin
h2. Networking
6 1 Gennadiy Lukhanin
7 11 Geoff Savage
# define subnets for IPMI, fnal/public and data/daq interfaces
8 10 Geoff Savage
** How many cables are needed?  Shared IPMI/public?
9 10 Geoff Savage
** Name interfaces by function.
10 1 Gennadiy Lukhanin
# define host names for all network interfaces and make them consistent
11 11 Geoff Savage
** mydaq-br01, mydaq-eb01, mydaq-br01-ipmi, mydaq-br01-daq
12 1 Gennadiy Lukhanin
** the list of host names should be complete as if all hardware is available
13 13 Geoff Savage
*** Reserve a few for IPs for the next computer installs
14 3 Ron Rechenmacher
** put all host names into /etc/hosts and distribute it across all servers
15 12 Geoff Savage
** How do we automate generation of the hosts file?
16 12 Geoff Savage
** Right now /etc/hosts managed by puppet.  Remove this?
17 1 Gennadiy Lukhanin
# make a consistent IP address assignment across all subnets
18 1 Gennadiy Lukhanin
** use address blocks for the same server roles
19 1 Gennadiy Lukhanin
** make the last octet of an IP address being the same across all NICs of the same host
20 1 Gennadiy Lukhanin
# configure authentication
21 1 Gennadiy Lukhanin
** Kerberos for the public interface 
22 1 Gennadiy Lukhanin
** publickey for the data interface
23 1 Gennadiy Lukhanin
# create instructions for rebooting servers using IPMI
24 10 Geoff Savage
# enable the 9000 MTU frames on -all- DAQ interfaces and networking equipment by default
25 10 Geoff Savage
** Switch configuration by networking.
26 10 Geoff Savage
** Just on DAQ network, not all interfaces.
27 11 Geoff Savage
** NFS on public network - jumbo frames for performance?   No jumbo on public.
28 1 Gennadiy Lukhanin
# configure and verify that multicasting is  enabled and working all networking equipment
29 1 Gennadiy Lukhanin
30 1 Gennadiy Lukhanin
h2. Users
31 1 Gennadiy Lukhanin
32 1 Gennadiy Lukhanin
# define a shared user for 
33 1 Gennadiy Lukhanin
** managing UPS products
34 1 Gennadiy Lukhanin
** running daq, dcs, databases
35 1 Gennadiy Lukhanin
# add all people from  the RSI group to the /root/.k5login
36 1 Gennadiy Lukhanin
# add all known daq users to the daq and dcs shared accounts
37 1 Gennadiy Lukhanin
# shared user profiles are not expected to have any customizations
38 1 Gennadiy Lukhanin
39 1 Gennadiy Lukhanin
h2. Storage areas
40 1 Gennadiy Lukhanin
41 1 Gennadiy Lukhanin
# setup a reliable NFS server for /home, /daq/products, /daq/database, /daq/log, /daq/database, /daq/tmp,.... /data /scratch, /daq/backup
42 1 Gennadiy Lukhanin
# reserve adequate disk space for each area
43 1 Gennadiy Lukhanin
# create a designated scratch area for doing builds on a local NVMe derive, preferably on the fastest server
44 1 Gennadiy Lukhanin
** a faster NVMe drive such as Samsung 970 Pro or faster is preferred
45 1 Gennadiy Lukhanin
# setup a nightly backup for /home and a weekly backup  for /daq/backup areas
46 1 Gennadiy Lukhanin
# the performance of the NFS should be monitored
47 1 Gennadiy Lukhanin
48 1 Gennadiy Lukhanin
h2. Software
49 1 Gennadiy Lukhanin
50 1 Gennadiy Lukhanin
# any base software such as the OS and productivity RPMs should be identical on all servers
51 1 Gennadiy Lukhanin
# a default list of installed software packages should not be impeding the development/ testing work, e.g. emacs, vim, mc, tmux, perf, iperf, strace, dstat,..... VNC/MATE should be installed by default
52 2 Gennadiy Lukhanin
# implement system monitoring using ganglia
53 2 Gennadiy Lukhanin
 or similar software
54 4 Pengfei Ding
55 5 Pengfei Ding
h2. System Services
56 4 Pengfei Ding
57 6 Pengfei Ding
# Optional: DNS, Kerberos, NIS, Supervisord,  influxdb, prometheus.
58 6 Pengfei Ding
# Ganglia, graphite.
59 7 Geoff Savage
60 7 Geoff Savage
h2. Geoff
61 7 Geoff Savage
62 8 Geoff Savage
* Turn off checking of raid arrays.
63 8 Geoff Savage
* Raid arrays must be raid 10?   You lose half the disk size?
64 8 Geoff Savage
* Do we really need hosts file?
65 8 Geoff Savage
** If we use a hosts file we should use a script to create the file.
66 8 Geoff Savage
* ntp from fermilab servers works well.  No need for an experiment ntp server.
67 8 Geoff Savage
* Who is in the RSI group?
68 8 Geoff Savage
* Use Ansible to verify the settings from puppet are correct?
69 9 Geoff Savage
70 9 Geoff Savage
* Buffer sizes in network switches
71 9 Geoff Savage
* Database computer specs