Initial daq cluster setup checklist » History » Version 18
Geoff Savage, 01/27/2020 01:46 PM
1 | 1 | Gennadiy Lukhanin | h1. Initial DAQ cluster setup checklist. |
---|---|---|---|
2 | 1 | Gennadiy Lukhanin | |
3 | 1 | Gennadiy Lukhanin | _Objective: To reduce the number service desk tickets during the initial setup of DAQ development / production clusters._ |
4 | 1 | Gennadiy Lukhanin | |
5 | 1 | Gennadiy Lukhanin | h2. Networking |
6 | 1 | Gennadiy Lukhanin | |
7 | 11 | Geoff Savage | # define subnets for IPMI, fnal/public and data/daq interfaces |
8 | 10 | Geoff Savage | ** How many cables are needed? Shared IPMI/public? |
9 | 10 | Geoff Savage | ** Name interfaces by function. |
10 | 1 | Gennadiy Lukhanin | # define host names for all network interfaces and make them consistent |
11 | 11 | Geoff Savage | ** mydaq-br01, mydaq-eb01, mydaq-br01-ipmi, mydaq-br01-daq |
12 | 1 | Gennadiy Lukhanin | ** the list of host names should be complete as if all hardware is available |
13 | 13 | Geoff Savage | *** Reserve a few for IPs for the next computer installs |
14 | 3 | Ron Rechenmacher | ** put all host names into /etc/hosts and distribute it across all servers |
15 | 12 | Geoff Savage | ** How do we automate generation of the hosts file? |
16 | 12 | Geoff Savage | ** Right now /etc/hosts managed by puppet. Remove this? |
17 | 1 | Gennadiy Lukhanin | # make a consistent IP address assignment across all subnets |
18 | 1 | Gennadiy Lukhanin | ** use address blocks for the same server roles |
19 | 1 | Gennadiy Lukhanin | ** make the last octet of an IP address being the same across all NICs of the same host |
20 | 15 | Geoff Savage | *** Discussion with networking |
21 | 1 | Gennadiy Lukhanin | # configure authentication |
22 | 1 | Gennadiy Lukhanin | ** Kerberos for the public interface |
23 | 16 | Geoff Savage | ** public key for the data interface |
24 | 16 | Geoff Savage | *** Access everything over private network for the daq user |
25 | 16 | Geoff Savage | *** User testing artdaq will get instructions to set up their own public key |
26 | 1 | Gennadiy Lukhanin | # create instructions for rebooting servers using IPMI |
27 | 10 | Geoff Savage | # enable the 9000 MTU frames on -all- DAQ interfaces and networking equipment by default |
28 | 10 | Geoff Savage | ** Switch configuration by networking. |
29 | 10 | Geoff Savage | ** Just on DAQ network, not all interfaces. |
30 | 11 | Geoff Savage | ** NFS on public network - jumbo frames for performance? No jumbo on public. |
31 | 1 | Gennadiy Lukhanin | # configure and verify that multicasting is enabled and working all networking equipment |
32 | 18 | Geoff Savage | ** Need testing software to verify the configuration. |
33 | 18 | Geoff Savage | |
34 | 1 | Gennadiy Lukhanin | |
35 | 1 | Gennadiy Lukhanin | h2. Users |
36 | 1 | Gennadiy Lukhanin | |
37 | 1 | Gennadiy Lukhanin | # define a shared user for |
38 | 1 | Gennadiy Lukhanin | ** managing UPS products |
39 | 1 | Gennadiy Lukhanin | ** running daq, dcs, databases |
40 | 17 | Geoff Savage | ** Experiments manage .k5logins for the shared accounts |
41 | 1 | Gennadiy Lukhanin | # add all people from the RSI group to the /root/.k5login |
42 | 1 | Gennadiy Lukhanin | # add all known daq users to the daq and dcs shared accounts |
43 | 1 | Gennadiy Lukhanin | # shared user profiles are not expected to have any customizations |
44 | 1 | Gennadiy Lukhanin | |
45 | 1 | Gennadiy Lukhanin | h2. Storage areas |
46 | 1 | Gennadiy Lukhanin | |
47 | 1 | Gennadiy Lukhanin | # setup a reliable NFS server for /home, /daq/products, /daq/database, /daq/log, /daq/database, /daq/tmp,.... /data /scratch, /daq/backup |
48 | 1 | Gennadiy Lukhanin | # reserve adequate disk space for each area |
49 | 1 | Gennadiy Lukhanin | # create a designated scratch area for doing builds on a local NVMe derive, preferably on the fastest server |
50 | 1 | Gennadiy Lukhanin | ** a faster NVMe drive such as Samsung 970 Pro or faster is preferred |
51 | 1 | Gennadiy Lukhanin | # setup a nightly backup for /home and a weekly backup for /daq/backup areas |
52 | 1 | Gennadiy Lukhanin | # the performance of the NFS should be monitored |
53 | 1 | Gennadiy Lukhanin | |
54 | 1 | Gennadiy Lukhanin | h2. Software |
55 | 1 | Gennadiy Lukhanin | |
56 | 1 | Gennadiy Lukhanin | # any base software such as the OS and productivity RPMs should be identical on all servers |
57 | 1 | Gennadiy Lukhanin | # a default list of installed software packages should not be impeding the development/ testing work, e.g. emacs, vim, mc, tmux, perf, iperf, strace, dstat,..... VNC/MATE should be installed by default |
58 | 2 | Gennadiy Lukhanin | # implement system monitoring using ganglia |
59 | 2 | Gennadiy Lukhanin | or similar software |
60 | 4 | Pengfei Ding | |
61 | 5 | Pengfei Ding | h2. System Services |
62 | 4 | Pengfei Ding | |
63 | 6 | Pengfei Ding | # Optional: DNS, Kerberos, NIS, Supervisord, influxdb, prometheus. |
64 | 6 | Pengfei Ding | # Ganglia, graphite. |
65 | 7 | Geoff Savage | |
66 | 7 | Geoff Savage | h2. Geoff |
67 | 7 | Geoff Savage | |
68 | 8 | Geoff Savage | * Turn off checking of raid arrays. |
69 | 8 | Geoff Savage | * Raid arrays must be raid 10? You lose half the disk size? |
70 | 8 | Geoff Savage | * Do we really need hosts file? |
71 | 8 | Geoff Savage | ** If we use a hosts file we should use a script to create the file. |
72 | 8 | Geoff Savage | * ntp from fermilab servers works well. No need for an experiment ntp server. |
73 | 8 | Geoff Savage | * Who is in the RSI group? |
74 | 8 | Geoff Savage | * Use Ansible to verify the settings from puppet are correct? |
75 | 9 | Geoff Savage | |
76 | 9 | Geoff Savage | * Buffer sizes in network switches |
77 | 9 | Geoff Savage | * Database computer specs |