Initial daq cluster setup checklist » History » Version 38
Geoff Savage, 01/27/2020 02:35 PM
1 | 1 | Gennadiy Lukhanin | h1. Initial DAQ cluster setup checklist. |
---|---|---|---|
2 | 1 | Gennadiy Lukhanin | |
3 | 1 | Gennadiy Lukhanin | _Objective: To reduce the number service desk tickets during the initial setup of DAQ development / production clusters._ |
4 | 1 | Gennadiy Lukhanin | |
5 | 1 | Gennadiy Lukhanin | h2. Networking |
6 | 1 | Gennadiy Lukhanin | |
7 | 11 | Geoff Savage | # define subnets for IPMI, fnal/public and data/daq interfaces |
8 | 10 | Geoff Savage | ** How many cables are needed? Shared IPMI/public? |
9 | 10 | Geoff Savage | ** Name interfaces by function. |
10 | 1 | Gennadiy Lukhanin | # define host names for all network interfaces and make them consistent |
11 | 11 | Geoff Savage | ** mydaq-br01, mydaq-eb01, mydaq-br01-ipmi, mydaq-br01-daq |
12 | 1 | Gennadiy Lukhanin | ** the list of host names should be complete as if all hardware is available |
13 | 13 | Geoff Savage | *** Reserve a few for IPs for the next computer installs |
14 | 3 | Ron Rechenmacher | ** put all host names into /etc/hosts and distribute it across all servers |
15 | 12 | Geoff Savage | ** How do we automate generation of the hosts file? |
16 | 12 | Geoff Savage | ** Right now /etc/hosts managed by puppet. Remove this? |
17 | 1 | Gennadiy Lukhanin | # make a consistent IP address assignment across all subnets |
18 | 1 | Gennadiy Lukhanin | ** use address blocks for the same server roles |
19 | 1 | Gennadiy Lukhanin | ** make the last octet of an IP address being the same across all NICs of the same host |
20 | 15 | Geoff Savage | *** Discussion with networking |
21 | 1 | Gennadiy Lukhanin | # configure authentication |
22 | 1 | Gennadiy Lukhanin | ** Kerberos for the public interface |
23 | 16 | Geoff Savage | ** public key for the data interface |
24 | 16 | Geoff Savage | *** Access everything over private network for the daq user |
25 | 16 | Geoff Savage | *** User testing artdaq will get instructions to set up their own public key |
26 | 1 | Gennadiy Lukhanin | # create instructions for rebooting servers using IPMI |
27 | 10 | Geoff Savage | # enable the 9000 MTU frames on -all- DAQ interfaces and networking equipment by default |
28 | 10 | Geoff Savage | ** Switch configuration by networking. |
29 | 10 | Geoff Savage | ** Just on DAQ network, not all interfaces. |
30 | 11 | Geoff Savage | ** NFS on public network - jumbo frames for performance? No jumbo on public. |
31 | 1 | Gennadiy Lukhanin | # configure and verify that multicasting is enabled and working all networking equipment |
32 | 18 | Geoff Savage | ** Need testing software to verify the configuration. |
33 | 18 | Geoff Savage | |
34 | 1 | Gennadiy Lukhanin | h2. Users |
35 | 1 | Gennadiy Lukhanin | |
36 | 1 | Gennadiy Lukhanin | # define a shared user for |
37 | 1 | Gennadiy Lukhanin | ** managing UPS products |
38 | 1 | Gennadiy Lukhanin | ** running daq, dcs, databases |
39 | 17 | Geoff Savage | ** Experiments manage .k5logins for the shared accounts |
40 | 1 | Gennadiy Lukhanin | # add all people from the RSI group to the /root/.k5login |
41 | 1 | Gennadiy Lukhanin | # add all known daq users to the daq and dcs shared accounts |
42 | 1 | Gennadiy Lukhanin | # shared user profiles are not expected to have any customizations |
43 | 21 | Geoff Savage | # Control room accounts - shared |
44 | 1 | Gennadiy Lukhanin | |
45 | 1 | Gennadiy Lukhanin | h2. Storage areas |
46 | 1 | Gennadiy Lukhanin | |
47 | 31 | Geoff Savage | # setup a reliable NFS server for /home, /daq/software, /daq/log, /daq/run_records, /daq/scratch |
48 | 19 | Geoff Savage | ** No mounts from labs central storage or pnfs. |
49 | 29 | Geoff Savage | ** cvmfs requires additional configuration to optimize |
50 | 32 | Geoff Savage | ** reserve adequate disk space for each area |
51 | 32 | Geoff Savage | ** raid 10 for nfs server |
52 | 1 | Gennadiy Lukhanin | # create a designated scratch area for doing builds on a local NVMe derive, preferably on the fastest server |
53 | 1 | Gennadiy Lukhanin | ** a faster NVMe drive such as Samsung 970 Pro or faster is preferred |
54 | 30 | Geoff Savage | ** Pick the current SSD drive, larger size has faster write speed. |
55 | 34 | Geoff Savage | # Backups |
56 | 34 | Geoff Savage | ** setup a nightly backup for /home |
57 | 34 | Geoff Savage | ** setup a weekly backup for /daq areas as needed |
58 | 1 | Gennadiy Lukhanin | # the performance of the NFS should be monitored |
59 | 34 | Geoff Savage | ** Develop monitoring of NFS performance |
60 | 35 | Geoff Savage | ** Collect metrics |
61 | 32 | Geoff Savage | # /data is a local file system on data logger computers |
62 | 32 | Geoff Savage | ** raid 10 for performance |
63 | 1 | Gennadiy Lukhanin | ** lose half the disk space |
64 | 1 | Gennadiy Lukhanin | # Hardware raid cards |
65 | 34 | Geoff Savage | # Turn off raid checking on /data |
66 | 34 | Geoff Savage | # Raid checking for home areas on nfs space |
67 | 1 | Gennadiy Lukhanin | |
68 | 1 | Gennadiy Lukhanin | h2. Software |
69 | 1 | Gennadiy Lukhanin | |
70 | 1 | Gennadiy Lukhanin | # any base software such as the OS and productivity RPMs should be identical on all servers |
71 | 36 | Geoff Savage | ** At Fermilab puppet is used |
72 | 36 | Geoff Savage | # a default list of installed software packages should not be impeding the development/testing work, e.g. emacs, vim, mc, tmux, perf, iperf, strace, dstat,..... VNC/MATE should be installed by default |
73 | 36 | Geoff Savage | ** Generate a list of packages to install |
74 | 20 | Geoff Savage | # Support for MOSH? |
75 | 20 | Geoff Savage | ** We should try it. Might be blocked by ACLs. |
76 | 4 | Pengfei Ding | |
77 | 5 | Pengfei Ding | h2. System Services |
78 | 4 | Pengfei Ding | |
79 | 6 | Pengfei Ding | # Optional: DNS, Kerberos, NIS, Supervisord, influxdb, prometheus. |
80 | 27 | Ron Rechenmacher | # -Ganglia-, graphite, grafana |
81 | 25 | Geoff Savage | ** system monitoring - check_mk, net data |
82 | 25 | Geoff Savage | ** singularity container to distribute monitoring software |
83 | 26 | Geoff Savage | ** graphite/grafana - part of standard installation |
84 | 28 | Geoff Savage | # Keep separate hardware monitoring for system administration |
85 | 28 | Geoff Savage | # Combined hardware monitoring for DAQ - DAQ monitoring + hardware monitoring |
86 | 23 | Geoff Savage | # Disable selinux enforcing - permissive mode |
87 | 22 | Geoff Savage | # Disable firewall on private networks |
88 | 7 | Geoff Savage | |
89 | 7 | Geoff Savage | h2. Geoff |
90 | 7 | Geoff Savage | |
91 | 38 | Geoff Savage | * -Turn off checking of raid arrays.- |
92 | 38 | Geoff Savage | * -Raid arrays must be raid 10? You lose half the disk size?- |
93 | 38 | Geoff Savage | * -Do we really need hosts file?- |
94 | 38 | Geoff Savage | ** -If we use a hosts file we should use a script to create the file.- |
95 | 8 | Geoff Savage | * ntp from fermilab servers works well. No need for an experiment ntp server. |
96 | 38 | Geoff Savage | * -Who is in the RSI group?- |
97 | 8 | Geoff Savage | * Use Ansible to verify the settings from puppet are correct? |
98 | 9 | Geoff Savage | |
99 | 9 | Geoff Savage | * Buffer sizes in network switches |
100 | 9 | Geoff Savage | * Database computer specs |