Project

General

Profile

Initial daq cluster setup checklist » History » Version 39

Geoff Savage, 01/27/2020 02:39 PM

1 1 Gennadiy Lukhanin
h1. Initial DAQ cluster setup checklist.
2 1 Gennadiy Lukhanin
3 1 Gennadiy Lukhanin
_Objective: To reduce the number service desk tickets during the initial setup of DAQ development / production clusters._
4 1 Gennadiy Lukhanin
5 1 Gennadiy Lukhanin
h2. Networking
6 1 Gennadiy Lukhanin
7 11 Geoff Savage
# define subnets for IPMI, fnal/public and data/daq interfaces
8 10 Geoff Savage
** How many cables are needed?  Shared IPMI/public?
9 10 Geoff Savage
** Name interfaces by function.
10 1 Gennadiy Lukhanin
# define host names for all network interfaces and make them consistent
11 11 Geoff Savage
** mydaq-br01, mydaq-eb01, mydaq-br01-ipmi, mydaq-br01-daq
12 1 Gennadiy Lukhanin
** the list of host names should be complete as if all hardware is available
13 13 Geoff Savage
*** Reserve a few for IPs for the next computer installs
14 3 Ron Rechenmacher
** put all host names into /etc/hosts and distribute it across all servers
15 12 Geoff Savage
** How do we automate generation of the hosts file?
16 12 Geoff Savage
** Right now /etc/hosts managed by puppet.  Remove this?
17 1 Gennadiy Lukhanin
# make a consistent IP address assignment across all subnets
18 1 Gennadiy Lukhanin
** use address blocks for the same server roles
19 1 Gennadiy Lukhanin
** make the last octet of an IP address being the same across all NICs of the same host
20 15 Geoff Savage
*** Discussion with networking
21 1 Gennadiy Lukhanin
# configure authentication
22 1 Gennadiy Lukhanin
** Kerberos for the public interface 
23 16 Geoff Savage
** public key for the data interface
24 16 Geoff Savage
*** Access everything over private network for the daq user
25 16 Geoff Savage
*** User testing artdaq will get instructions to set up their own public key
26 1 Gennadiy Lukhanin
# create instructions for rebooting servers using IPMI
27 10 Geoff Savage
# enable the 9000 MTU frames on -all- DAQ interfaces and networking equipment by default
28 10 Geoff Savage
** Switch configuration by networking.
29 10 Geoff Savage
** Just on DAQ network, not all interfaces.
30 11 Geoff Savage
** NFS on public network - jumbo frames for performance?   No jumbo on public.
31 1 Gennadiy Lukhanin
# configure and verify that multicasting is  enabled and working all networking equipment
32 18 Geoff Savage
** Need testing software to verify the configuration.
33 18 Geoff Savage
34 1 Gennadiy Lukhanin
h2. Users
35 1 Gennadiy Lukhanin
36 1 Gennadiy Lukhanin
# define a shared user for 
37 1 Gennadiy Lukhanin
** managing UPS products
38 1 Gennadiy Lukhanin
** running daq, dcs, databases
39 17 Geoff Savage
** Experiments manage .k5logins for the shared accounts
40 1 Gennadiy Lukhanin
# add all people from  the RSI group to the /root/.k5login
41 1 Gennadiy Lukhanin
# add all known daq users to the daq and dcs shared accounts
42 1 Gennadiy Lukhanin
# shared user profiles are not expected to have any customizations
43 21 Geoff Savage
# Control room accounts - shared
44 1 Gennadiy Lukhanin
45 1 Gennadiy Lukhanin
h2. Storage areas
46 1 Gennadiy Lukhanin
47 31 Geoff Savage
# setup a reliable NFS server for /home, /daq/software, /daq/log, /daq/run_records, /daq/scratch
48 19 Geoff Savage
** No mounts from labs central storage or pnfs.
49 29 Geoff Savage
** cvmfs requires additional configuration to optimize
50 32 Geoff Savage
** reserve adequate disk space for each area
51 32 Geoff Savage
** raid 10 for nfs server
52 1 Gennadiy Lukhanin
# create a designated scratch area for doing builds on a local NVMe derive, preferably on the fastest server
53 1 Gennadiy Lukhanin
** a faster NVMe drive such as Samsung 970 Pro or faster is preferred
54 30 Geoff Savage
** Pick the current SSD drive, larger size has faster write speed.
55 34 Geoff Savage
# Backups
56 34 Geoff Savage
** setup a nightly backup for /home
57 34 Geoff Savage
** setup a weekly backup for /daq areas as needed
58 1 Gennadiy Lukhanin
# the performance of the NFS should be monitored
59 34 Geoff Savage
** Develop monitoring of NFS performance
60 35 Geoff Savage
** Collect metrics
61 32 Geoff Savage
# /data is a local file system on data logger computers
62 32 Geoff Savage
** raid 10 for performance
63 1 Gennadiy Lukhanin
** lose half the disk space
64 1 Gennadiy Lukhanin
# Hardware raid cards
65 34 Geoff Savage
# Turn off raid checking on /data
66 34 Geoff Savage
# Raid checking for home areas on nfs space
67 1 Gennadiy Lukhanin
68 1 Gennadiy Lukhanin
h2. Software
69 1 Gennadiy Lukhanin
70 1 Gennadiy Lukhanin
# any base software such as the OS and productivity RPMs should be identical on all servers
71 36 Geoff Savage
** At Fermilab puppet is used
72 36 Geoff Savage
# a default list of installed software packages should not be impeding the development/testing work, e.g. emacs, vim, mc, tmux, perf, iperf, strace, dstat,..... VNC/MATE should be installed by default
73 36 Geoff Savage
** Generate a list of packages to install
74 20 Geoff Savage
# Support for MOSH?
75 20 Geoff Savage
** We should try it.  Might be blocked by ACLs.
76 4 Pengfei Ding
77 5 Pengfei Ding
h2. System Services
78 4 Pengfei Ding
79 6 Pengfei Ding
# Optional: DNS, Kerberos, NIS, Supervisord,  influxdb, prometheus.
80 27 Ron Rechenmacher
# -Ganglia-, graphite, grafana
81 25 Geoff Savage
** system monitoring - check_mk, net data
82 25 Geoff Savage
** singularity container to distribute monitoring software
83 26 Geoff Savage
** graphite/grafana - part of standard installation
84 28 Geoff Savage
# Keep separate hardware monitoring for system administration
85 28 Geoff Savage
# Combined hardware monitoring for DAQ - DAQ monitoring + hardware monitoring
86 23 Geoff Savage
# Disable selinux enforcing - permissive mode
87 22 Geoff Savage
# Disable firewall on private networks
88 7 Geoff Savage
89 39 Geoff Savage
h2. Other topics
90 39 Geoff Savage
91 39 Geoff Savage
* Buffer sizes in network switches
92 39 Geoff Savage
* Database computer specs
93 39 Geoff Savage
* System parameters
94 39 Geoff Savage
** Set in puppet
95 39 Geoff Savage
** Standard scripts to verify settings
96 39 Geoff Savage
97 7 Geoff Savage
h2. Geoff
98 38 Geoff Savage
99 38 Geoff Savage
* -Turn off checking of raid arrays.-
100 8 Geoff Savage
* -Raid arrays must be raid 10?   You lose half the disk size?-
101 38 Geoff Savage
* -Do we really need hosts file?-
102 8 Geoff Savage
** -If we use a hosts file we should use a script to create the file.-
103 9 Geoff Savage
* ntp from fermilab servers works well.  No need for an experiment ntp server.
104 9 Geoff Savage
* -Who is in the RSI group?-
105 9 Geoff Savage
* Use Ansible to verify the settings from puppet are correct?